Abstract
Takers of educational tests often receive proficiency levels instead of or in addition to scaled scores. For example, proficiency levels are reported for the Advanced Placement (AP®) and U.S. Medical Licensing examinations. Technical difficulties and other unforeseen events occasionally lead to missing item scores and hence to incomplete data on these tests. The reporting of proficiency levels to the examinees with incomplete data requires estimation of the performance of the examinees on the missing part and essentially involves imputation of missing data. In this article, six approaches from the literature on missing data analysis are brought to bear on the problem of reporting of proficiency levels to the examinees with incomplete data. Data from several large-scale educational tests are used to compare the performances of the six approaches to the approach that is operationally used for reporting proficiency levels for these tests. A multiple imputation approach based on chained equations is shown to lead to the most accurate reporting of proficiency levels for data that were missing at random or completely at random, while the model-based approach of Holman and Glas performed the best for data that are missing not at random. Several recommendations are made on the reporting of proficiency levels to the examinees with incomplete data.
1. Introduction
Takers of several large-scale educational tests receive a pass–fail status or a classification or a proficiency level instead of or in addition to a scaled score. For example, those taking any title of the Advanced Placement (AP®) examination receive a proficiency classification that is an integer between 1 and 5 indicating how qualified they are to receive college credit and placement; a classification of 5 is equivalent to a college course grade A or A+ (e.g., Patterson & Ewing, 2013). Takers of the Praxis® tests and the U.S. Medical Licensing Examination® (USMLE) examinations receive a pass or a fail classification (e.g., Educational Testing Service, 2020; USMLE, 2020). For convenience, tests (such as AP, Praxis, and USMLE) that involve the reporting of classifications will henceforth be referred to as classification tests.
Occasionally, portions of educational tests are lost, missing, or unscorable due to unforeseen events that are unrelated to examinee behavior. For example, (a) Gayle (2017) reported that portions of the AP test were lost for dozens of students from a county, (b) poor audio quality or excessive background noise during recording may render some item responses unscorable on speaking items of a test such as the Praxis® Spanish World Language test (Educational Testing Service, n.d.) that includes Speaking items, and (c) technology-related disruptions may lead to missing item scores, as was observed for a state test described by Byrne (2017). The loss, missingness, or unscorability of a portion of a classification test results in missing item scores and prevents the computation of raw/scaled/pattern scores for the corresponding examinees. Because the classification for an examinee is computed from the examinee’s score, missing scores create difficulty in reporting classifications for the corresponding examinees. The test administrators have the option of not reporting classifications to such examinees. Alternately, they have the option of reporting classifications after employing an approach for imputation/projection/estimation of the corresponding classifications. Students whose AP test responses are lost have the option of accepting a projected score based on the rest of their scored examination, having the score canceled, or taking the missing portion of the examination, as implied by Codes 37 and 38 on Page 6 of College Board (n.d.) and as stated by Gayle (2017). Imputed classifications are also reported for the USMLE when some item scores are missing due to reasons such as technical problems (e.g., Jodoin & Rubright, 2020).
In this article, the procedure used to report classifications to examinees whose data are incomplete due to unforeseen events that are unrelated to examinee behavior is referred to as “imputation” and the outcome of this imputation procedure is referred to as an “imputed” classification. Imputation of a classification involves making inferences on the missing portion of the test based on the nonmissing portion and is a special case of imputation of missing data (e.g., Little & Rubin, 2002, p. 20) where an investigator makes inferences from data that include some missing observations. While researchers have examined various problems related to missing data in educational measurement (e.g., De Ayala et al., 2001; Finch, 2008; Holman & Glas, 2005; Sijtsma & van der Ark, 2003; Smits et al., 2002; Xiao & Bulut, 2020), there is a lack of research on imputing classifications in the presence of missing item scores, Feinberg (2021) and Sinharay (2021a) being exceptions. Accordingly, the goal of this article is to explore imputation approaches that allow test administrators to impute and report accurate, and hence fair and valid, classifications to the examinees with missing item scores.
Section 2 includes a review of the literature on the imputation of missing data for educational and psychological tests. Section 3 includes brief descriptions of how several imputation approaches that have been used to impute missing data in other contexts in educational measurement can be used to impute classification. Section 4 includes a comparison of seven imputation approaches using real data from four classification tests. Section 5 includes discussions and conclusions.
This article does not include any investigation on the estimation of any model parameters (like item response theory or IRT item parameters) or summary statistics (like mean scores or reliability) in the presence of missing item scores. Researchers such as Edwards and Finch (2018), Finch (2008), and Sijtsma and van der Ark (2003) investigated these problems, and the model parameters for the classification tests considered in this article are accurately estimated because of the availability of a large sample of examinees with no missing data. Item responses are often missing due to various types of examinee behavior. The most common of these types of missing responses are omitted and/or not-reached responses. Examples of operational practice to deal with omitted and/or not-reached responses can be found in Allen et al. (2001). Researchers such as De Ayala et al. (2001) and Glas and Pimentel (2008) explored various approaches for handling omitted and not-reached responses—this article does not include an examination of these approaches and does not involve imputation of scores on items with omitted and not-reached responses. Instead, examinee records with such responses, which constitute a very small part of the available data sets, are not included in any computations in this article.
2. Literature Review
Graham (2009), Graham (2012), Schafer (1997), Schafer and Graham (2002), and Sinharay et al. (2001) provided reviews of the literature on missing data analysis in general.
Table 1 shows a list of several research studies that dealt with the problem of missing item scores in the context of educational or psychological measurement. The studies (listed in the second column of the table) are grouped according to their main focus (first column). Some of the studies included in Table 1 have multiple foci (e.g., Huisman & Molenaar, 2001)—so they appear in multiple rows.
The Focus Areas of Some of the Existing Studies on Missing Data in Educational and Psychological Measurement
Table 1 indicates that a wide variety of problems related to missing item scores have been the focus of existing research. However, there is a lack of research, with the exceptions of Feinberg (2021) and Sinharay (2021a), on the reporting of examinee classifications, which is the problem of interest in the current article.
Feinberg (2021) considered the problem of reporting classifications to examinees who are interrupted during a test and cannot finish the test, so that scores are missing on the items toward the end of the test. He focused on classification tests that only involve “pass” and “fail” classes (i.e., examinees are classified in two classes) and include only dichotomous items and suggested four approaches for estimating the probability of passing on an incomplete classification test. Feinberg (2021) found that none of the approaches is an overwhelming winner while an IRT-based approach performs the best overall. Sinharay (2021a) considered the more general problem of estimating the probability of passing on an incomplete classification test that could include polytomous items and could have missing item scores anywhere (and not necessarily at the end) on the test. He suggested two approaches—one each based on IRT and classical test theory—that were found to perform better than the approaches of Feinberg (2021). The four approaches of Feinberg (2021) and the two approaches of Sinharay (2021a) are briefly described in Online Appendix A.
The current article focuses on a slightly different and more general problem where (a) the interest lies in imputing the classifications themselves and not on estimating the probabilities of classifications, (b) the missing scores could occur on any item and not necessarily on those toward the end of the test, (c) the test may include polytomous items and the item scores are linearly weighted to produce a weighted sum score or total composite score (TCS) that is then converted to a classification using one or more cut scores, and (d) the examinees may be classified into more than two classes. Therefore, the approaches of Feinberg (2021) and Sinharay (2021a) do not apply to the problem considered this article. Also, a difficulty of applying the approaches of Feinberg (2021) and Sinharay (2021a) to the problem of interest in this article is that Feinberg (2021) and Sinharay (2021a) placed some examinees into an indeterminate class, whereas, for the operational tests considered in this article, the classification resulting from the imputation procedure is not allowed to be “indeterminate.” 1 The imputation approach based on IRT models, which is considered later in this article, is close in spirit to the IRT-based approaches that Feinberg (2021) and Sinharay (2021a) found the best overall.
3. Imputation Approaches
Seven imputation approaches were considered in this article. One of the approaches (linking) is used operationally to impute missing classifications for the tests considered later in this article. The other six approaches have been used in the analysis of missing data in educational measurement but never have been applied to impute missing classifications. The first two imputation approaches are not based on any rigorous prediction models while the last five approaches are. It is important for an imputation approach to be based on a rigorous prediction model in case the imputed classifications are questioned by the users of these high-stakes tests. The imputation approaches are explained below using the hypothetical example of a classification test that includes nine items and a hypothetical Examinee 1 for whom scores on Items 1 and 2 are missing and scores on Items 3 through 9 are available. Let us consider that the TCS on the test is computed as the weighted sum of the item scores before being converted to a classification using a set of cut scores. Let us further assume that the possible scores on Item 1 are 0, 1, 2, 3, and 4 and those on Item 2 are 0, 1, 2, and 3 and that the generalized partial credit model (GPCM; Muraki, 1992) provides an adequate fit to the data from the test.
3.1. Person-Mean Imputation (PMI)
The PMI or proration approach to impute missing scores (e.g., Huisman, 1999) involves the imputation of each missing item score of an examinee by the mean score on the nonmissing item scores for that examinee. If the maximum possible score varies over the items, all item scores are converted to a proportional score (by dividing the item scores by the maximum possible score on the respective items) before applying the PMI approach to impute a proportional score; then the imputed proportional score is multiplied by the maximum possible score on the item to obtain the imputed item score. To apply this approach to Examinee 1, one has to first compute average of the proportional scores over Items 3 through 9; imputed scores on Items 1 and 2 for the examinee can be obtained by multiplying this average by 4 and 3, respectively. Then, the imputed TCS for Examinee 1 can be computed as the weighted sum of the actual/observed scores on Items 3 through 9 (or, the observed partial composite score or PCS on Items 3–9) plus the imputed PCS on Items 1 and 2, which is the weighted sum of the imputed scores on Items 1 and 2. Then, the abovementioned cut scores can be used to convert the imputed TCS to an imputed classification for Examinee 1. The PMI approach is not based on any statistical theory or prediction model but is simple and does not require any specialized software. Because this approach implicitly assumes that all items are of equal difficulty, the approach may lead to inaccurate imputation when the items with missing scores differ in difficulty from the other items.
3.2. Linking
In the application of the linking approach to impute a score for Examinee 1, score linking (e.g., Kolen & Brennan, 2014, p. 487) is performed of the observed PCS based on the last seven items to the observed TCS using the data from the subsample of examinees whose scores are available on all nine items on the test. Then, the imputed TCS of the examinee can be obtained as the value of the TCS that is equivalent to (or linked to) the examinee’s PCS. Finally, the cut scores for the test can be used to convert the imputed TCS to an imputed classification for Examinee 1. The single-group equipercentile equating (e.g., Kolen & Brennan, 2014, p. 14) approach was used to perform score linking in this article. In this approch, one finds the linked/equivalent TCS corresponding to a PCS of S as the value of the TCS that has the same percentile rank as S. Thus, if G and F, respectively, denote the cumulative distribution function corresponding to the distribution of the TCS and the PCS, then the linked TCS corresponding to a PCS of S is obtained as
3.3. Imputation Based on Linear Regression
In the application of the imputation approach based on regression, a linear regression that predicts
3.4. Imputation Based on Cumulative Logistic Regression
Because the proficiency level or classification of an examinee is an ordinal variable, the cumulative logistic regression model or the proportional odds model (e.g., Agresti, 2013, p. 301) is a natural approach for imputing the classification. Let the variable
where
3.5. Imputation Based on IRT
The missing score on an item for an examinee can be imputed by its posterior expectation given the available item scores under an IRT model after the model parameters have been estimated using an examinee sample (e.g., Korobko et al., 2008). In this article, the three-parameter logistic model (3PLM) was used for the dichotomous items and the GPCM was used for the polytomous items—this combination of IRT models is used in several large-scale assessments including the National Assessment of Educational Progress (Allen et al., 2001, pp. 229–230). The steps for the IRT-based imputation approach for Examinee 1 are the following:
Estimate the parameters of the IRT model from the data. Ignore the items with missing scores in this step.
Impute the PCS on Items 1 and 2 for the examinee as
Compute the imputed TCS of Examinee 1 as the imputed PCS on Items 1 and 2 plus the actual PCS on Items 3 through 9.
Use the abovementioned cut scores to convert the imputed TCS to an imputed classification for Examinee 1.
This approach is similar in spirit to the (IRT-based) Lord–Wingersky approach of Feinberg (2021) and the modified Lord–Wingersky approach of Sinharay (2021a). The R package mirt (Chalmers, 2012) was used to fit the IRT models in this article. The integral in Equation 2 was approximated using numerical integration.
In application of this approach to the operational data sets in this article, the few omitted and not-reached responses that are observed for the data sets are not included in any computations. The application of the approach involves the assumption that the IRT model fits the data. So, the approach may not perform well when there is misfit of the IRT model to the data.
3.6. Multiple Imputation (MI) Using Data Augmentation and Chained Equations
Researchers in several fields including education and psychology (e.g., Finch, 2008; Smits et al., 2002; Sulis & Porcu, 2017) have found the MI approach (e.g., Little & Rubin, 2002, p. 85) to lead to the most accurate estimation of various quantities of interest (such as item parameter and reliability) in the presence of missing data. To apply MI, one assumes a probability model for the data and computes a predictive (or conditional) distribution of the missing data given the observed data and draws multiple values from this predictive distribution.
A recent MI approach that is gaining popularity is MI using chained equations (MICE), also known as fully conditional specification (FCS; Raghunathan et al., 2001). The MICE approach specifies the imputation model on a variable-by-variable basis by a set of conditional densities, one for each incomplete variable. Starting from an initial imputation, the MICE approach draws imputations by iterating over the univariate conditional densities. Variables are imputed one at a time, as opposed to all being simultaneously imputed as in some other MI approaches. A major advantage of the MICE approach over other MI approaches is that the conditional distributions of the variables can be specified to be models such as the ordered logit model that are appropriate for item scores that are the variables of interest in this article. The MICE approach has been found to lead to accurate imputation in comparison studies by researchers such as Horton and Lipsitz (2001). However, application of the MICE approach to educational measurement is rare, with the exception of Edwards and Finch (2018) and Xiao and Bulut (2020). While the R package mice (e.g., van Buuren & Groothuis-Oudshoorn, 2011) can be used to implement the FCS approach, the stand-alone BLIMP software (Enders et al., 2018) was used for the data examples in this article. 3 The BLIMP software uses an ordered probit model to impute incomplete ordinal variables and utilizes a fully Bayesian estimation and imputation (e.g., Enders et al., 2020). In this article, five sets of draws/imputations of missing item scores were used for the MICE approach; each set of draws was used to compute an imputed TCS; the final imputed TCS was obtained as the simple average of the five imputed TCSs. This strategy is in agreement with the recommendation on combining results from MIs by Rubin (1987, p. 76), who suggested estimating a quantity of interest by the simple average of the quantity computed from the MIs. Finally, the abovementioned cut scores were used to convert the final imputed TCS to an imputed classification for the examinee. This approach is based on the assumption that the ordered probit model assumed by the BLIMP software fits the data.
3.7. A Method Based on Modeling of Nonignorable or Missing Not at Random (MNAR) Data
Sinharay (2021b) provided an example where item responses missing apparently due to technical problems could be MNAR. So, there was the need to include an approach that can model MNAR data. Holman and Glas (2005) suggested such an approach in which one assumes that an IRT model fits the item scores Xi s. One also defines, for item i, a missing-score indicator variable di that is 1 or 0, respectively, depending on whether the score on item i is missing or not. One also assumes a probability model for the di s. While Holman and Glas (2005) suggested several models for the di s, the model assumed in this article is given by 4
where
where
where the
for
4. Methods: Comparison of Imputation Approaches for Four Classification Tests
The seven imputation approaches were compared using one data set each from four high-stakes operational tests—the data sets included only examinees with scores available on all items or examinees with complete records. These four data sets are referred to as the “complete data sets.” 5 In the comparison study, different parts of the complete data sets were assumed missing in various ways and the missing parts were imputed by seven imputation approaches.
4.1. The Four Tests and the Data Sets
Three of the four data sets were from three titles of a large-scale classification test. These three titles are henceforth referred to as Tests A1, A2, and A3 and are intended to measure the mastery of the examinees on an arts subject, a science subject, and a language subject, respectively. The fourth data set originated from another large-scale classification test—this test is henceforth referred to as Test B. For each test, a weighted sum of the item scores is computed to yield a TCS for each examinee. Several cut scores are used to convert the TCSs to classifications that are reported to the examinees.
Occasionally, portions of these classification tests are lost, missing, or unscorable due to various reasons, which leads to the problem of missing item scores, or, incomplete test, for the corresponding examinees. When scores on a small portion of the test are missing for an examinee on any of these tests, an imputed classification, which is based on the available item scores, is made available to the examinee. No imputation is performed for the examinees for whom the missing portion of the test contributes roughly more than 50% to the TCS—these examinees are allowed a free retest.
Table 2 includes some information about the four tests including the number of items, mean interitem correlation, rough sample size, average percent scores,
6
reliability, and the number of proficiency levels or classes for the available data sets. The table also shows the percentages of misfitting items that are the percentages of the
Some Information About the Four Tests
Note. The three mean interitem correlations, respectively, represent the mean interitem correlation among the constructed response (CR) items, among the multiple-choice (MC) items, and among combinations of an MC and a CR item. SD = standard deviation; TCS = total composite score.
4.2. Study Design and Computation
Descriptions of the factors that were varied in the comparison study are provided below, followed by a description of the steps of the comparison study.
4.2.1. Missing score patterns considered
Historically, for each of Tests A1 through A3 and B, numerous patterns of missing item scores have been observed due to various factors that are not related to examinee behavior. To make the comparison study realistic while keeping the size of the study manageable, the following five patterns of missing item are included in the comparisons: (a) 100% MC item scores missing, (b) 100% CR item scores missing, (c) 50% MC item scores missing, (d) 50% CR item scores missing, and (e) 25% CR item scores missing. These are the most common patterns of missing item scores across these four tests.
For Tests A1 through A3 and B, the number of CR items are not necessarily multiples of 4, but the percentage of CR item scores missing over all the replications of the comparison study is equal to 25% for the fifth missing score pattern. 7
4.2.2. Missing data mechanisms considered
For a data set that includes some missing values, the probability that a value is missing is related to the underlying values of the variables in the data set according to one of the following three missing data mechanisms (e.g., Little & Rubin, 2002, p. 11)—(a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) MNAR. The approaches used to analyze missing data should ideally depend on the nature of the dependencies implied by these missing data mechanisms. For data from classification tests such as Tests A1 through A3 or B, missing data for an examinee are MCAR when the missingness or the probability of the values missing is unrelated to any data including the item scores (either observed or unobserved) and examinee covariates. Given that this article focuses on scores that are missing due to unforeseen events (such as technology-related disruptions) that are unrelated to examinee behavior, most such scores are expected to be MCAR—so the MCAR mechanism was considered. The missing item scores for these tests are MAR if the probability of a missing score for an examinee on an item depends on the examinee’s scores on the items on which scores are available; this type of missingness may occur if one who has performed poorly on the other items of the test makes random noises on a speaking item to produce an poor/unscorable audio recording. 8 The MNAR missing data mechanism may arise for classification tests if the probability of a missing score for an examinee on an item depends on the (hypothetical) score the examinee would have received on the item if it were not missing; this type of missingness may occur if one who is likely to perform poorly on a speaking item makes random noises to produce an unscorable audio recording. Sinharay (2021b) provided examples where item responses missing apparently due to technical problems could actually be MAR and also MNAR. Therefore, MAR and MNAR mechanisms were also considered in the comparison study.
In the comparison study, to simulate under the MCAR mechanism, the examinees and the items with missing scores were chosen at random. To simulate missing scores under the MAR mechanism for, for example, the “50% missing CR” case, the PCS was calculated for each examinee by computing a weighted sum of all the MC items. The examinees were then divided into four groups based on their PCSs using the three quartiles of the sample distribution of the PCS. The members of each group were assigned the probabilities .14, .03, .02, and .01 of missing responses—thus, smaller scores had larger probabilities of missing responses. The mean of these probabilities across the groups is equal to .05—thus about 5% examinees had 50% missing CR item scores. For each examinee, a uniform random number u between 0 and 1 was generated and compared with the probability of missing responses (p) assigned to the examinee. The scores on 50% randomly chosen CR items were marked missing for the examinees for whom u was smaller than p. The set of 50% CR items with missing scores varied over the examinees.
The model of Holman and Glas (2005), described in Equation 5, was used to simulate MNAR scores. To simulate MNAR scores,
4.2.3. Steps in the comparison and computation
The comparison of the imputation approaches was based on 1,000 replications of the following steps for each missing data pattern for each missingness mechanism for each of the four tests:
Draw the examinee-item combinations whose scores are to be treated as missing in the comparison: • If the missingness mechanism is MCAR, then draw 5% examinees randomly from the full sample. If the missing score pattern is “100% MC item scores missing” or “100% CR item scores missing,” then mark the scores on all MC or CR items of the sampled examinees as missing. For any of the other three missing data patterns, for each sampled examinee, randomly draw the set of appropriate items (such as 50% MC items) and mark their scores as missing. • If the missingness mechanism is MAR or MNAR, then follow the steps described earlier (in Section 4.2.2) to simulate missing item scores. For the missing data patterns other than the patterns “100% MC item scores missing” and “100% CR item scores missing,” the set of items with missing scores differs over the examinees in each replication.
Estimate the “imputation models,” which are the psychometric/statistical models underlying the imputation approaches, based on the 95% examinees that did not have any missing item scores—data for these examinees constitute the model-building data. For example, for the IRT-based approach, this step involves the fitting of an IRT model to the model-building data.
Impute the classifications for the 5% examinees drawn in the first step (data for these examinees are referred to constitute the test data set) using the imputation model estimated in the second step and their item scores that are not marked missing in the first step above. For example, for the IRT-based approach, this step involves the application of Equation 2 to impute the PCS on the items with missing scores followed by the computation of the imputed TCS on the total test and, finally, the imputed classification for the 5% examinees with missing scores.
The above steps provided 1,000 sets of imputed classifications for 5% examinees for each imputation approach for each combination of a missing data pattern (of a total of five), a missing data mechanism (of three), and a test (of four). Because the actual or observed classifications of all examinees were available, it is possible to compare the imputed classifications with the corresponding actual classifications to evaluate the accuracy of the imputation approaches. The following measures, which are appropriate agreement measures for ordinal data (e.g., Williamson et al., 2012), were used in the comparison of the different imputation approaches:
the percent exact agreement between the actual and imputed classifications or the percentage of examinees for whom the imputed classifications were identical to the actual classifications;
the classification boundary percentage agreement (CBPA) between the actual and imputed classifications; these are the percentage of times when both the actual and imputed classifications indicated that the examinees had a classification of c or higher. For Test B that involves three possible classifications (1, 2, and 3), there are two CBPAs, one each for
Cohen’s κ: Experts such as Williamson et al. (2012) recommended the use of a measure like Cohen’s κ (Cohen, 1960) instead of or in addition to the percentage agreement measure. The Cohen’s κ, or, κ henceforth, is defined as
Accurate imputation of the missing classifications would result in each of percent exact agreement, CBPAs, and κ to be close to 100. In addition, because the accuracy in imputing the TCSs may be of interest, the bias and the root mean squared difference (RMSD) in imputing the TCSs for the various imputation approaches were also computed, where the bias was computed as the mean of the differences between the imputed and actual TCSs, and RMSD was computed as the square root of the average of the squared differences between the imputed and actual TCSs. The TCS is not imputed in the logistic regression approach—so bias and RMSD were not computed for this approach.
Online Appendix B includes R code for imputation of classification using the PMI, linking, linear regression, logistic regression, and MICE approaches for one replication for the case when 100% MC items are missing for several examinees (the steps for simulating missingness are not shown in the code; instead, it is assumed that the missingness has been simulated and the data set was partitioned into a model-building data set that does not include any missing item scores and a test data set that includes missing item scores).
4.3. Results Under the MCAR Missingness Mechanism
Figures 1 and 2 show the RMSDs and percent exact agreements for all the imputation approaches for the five missing score patterns for the four tests for the MCAR mechanism. In these figures, the missing score patterns are shown along the horizontal axis and the accuracy measures are shown along the vertical axis. In each figure, the range of the vertical axis is the same over the four panels. The modeling of nonignorable data (Holman & Glas, 2005), linear regression, cumulative logistic regression, and linking approaches are denoted as MNI, LR, CLR, and Link, respectively. The CBPAs and κ show the same patterns as the exact agreement—Table C1 and Figure C1 in Online Appendix C show their values. The values of bias for all imputation approaches and all missing score patterns were 0.0 up to one decimal place except for being equal to 0.1 for the PMI approach for “100% MC item scores missing” and “100% CR item scores missing” patterns—so the values of bias are not presented in this article.

Root mean squared differences of the imputation approaches for missing completely at random data. Link = linking; LR = linear regression; MNI = modeling of nonignorable data.

Percentage agreement of the imputation approaches for missing completely at random data. MNI = modeling of nonignorable data; LR = linear regression; CLR = cumulative logistic regression; Link = linking.
To put the number of missing item scores into context, the reliability of the PCS on the nonmissing part, as a percentage of the reliability of the TCS, where the reliability is computed using the Feldt-Raju procedure (Qualls, 1995), is provided in Table 3.
The Reliabilities as a Percentage of Reliability of the Whole Test When Scores Are Missing Completely at Random
Note. MC = multiple choice.
Figures 1 and 2 show that when data are MCAR,
The extent of the agreement between the actual and imputed TCS and between the actual and imputed classifications increases on average as the number of missing item scores decreases. For a given data set and a given imputation approach, each of percent exact agreement, CBPA, and κ increases and RMSD decreases as the percentage of missing CR item scores decreases from 100 to 50 to 25.
The values of percentage agreement for Test A3 are smaller overall than those for the other tests. This is presumably due to the smaller interitem correlations (see Table 2) and smaller percentage of examinees in the two extreme proficiency levels for Test A3 compared to the other tests. The values of percentage agreement are larger in general for Test A2, presumably due to the larger interitem correlations and large score reliability for the test.
For each test, the performances of the imputation approaches do not differ much, with larger differences being observed for more missing item scores. For example, for Test A1, the values of percentage agreement for the approaches vary over a narrow range of 87 to 90 for 25% missing CR item scores while the values vary between a wider range of 60 to 68 for 100% missing CR item scores (see Figure 2).
The MICE approach seems to lead to the most accurate imputation overall, closely followed by the MNI and regression approaches. The other approaches did not perform much worse except for the PMI approach that was considerably worse than the other approaches. A primary contributor to the accurate imputation for all approaches, presumably, is that the large multiple correlation coefficient of the TCS on the nonmissing item scores is larger than .93 on average for all the conditions.
4.4. Results Under the MAR Missingness Mechanism
The performance of any imputation approach under the MAR mechanism was very similar to that under the MCAR mechanism. The bias, RMDS, and proportion agreement for any combination of imputation approach, missing score pattern, and test was the same up to two decimal places over the MCAR and MAR missingness mechanisms. The similarity of the results over the MCAR and MAR mechanisms is expected from discussion in Finch (2008), Graham (2009, p. 553), and Schafer and Graham (2002) who asserted that the imputation approaches that operate the assumption that data are MAR perform quite well under the MAR mechanism.
4.5. Results Under the MNAR Missingness Mechanism
The mean of the TCS for the model-building data (that comprised 95% of the sample examinees) and the test data (5% of the sample examiness) under the MNAR mechanism for Test A1 were 82.6 and 74.3, respectively, for
Figures 3
through 5 show the bias, RMSD, and percent exact agreements for all the imputation approaches for the five missing score patterns for the four tests for the MNAR mechanism when the correlation (

Bias of the imputation approaches for missing not at random data with

Root mean squared differences of the imputation approaches for missing not at random data with

Percentage agreement of the imputation approaches for missing not at random data with
Figures 6
through 8 show the bias, RMSD, and percent exact agreements for all the imputation approaches for the five missing score patterns for the four tests for the MNAR mechanism when

Bias of the imputation approaches for missing not at random data with

Root mean squared differences of the imputation approaches for missing not at random data with

Percentage agreement of the imputation approaches for missing not at random data with
The figures show that the MNI approach is the most accurate, followed by the MICE approach, in imputing missing classification for MNAR data. A comparison of Figures 1 and 4 and of Figures 2 and 5 show that for any given missing data pattern, the RMSD or percent exact agreement for any imputation approach other than the MNI approach under a moderate extent of nonignorability (
However, a comparison of Figures 1 and 7 and Figures 2 and 8 show that for any given missing data pattern, the RMSD or percent exact agreement for any approach except for the MNI approach is considerably worse under the MNAR mechanism than under the MCAR mechanism for the stronger nonignorability condition (
The relative ranking of the imputation approaches with respect to RMSD is similar to that with respect to exact agreement, but the differences in RMDSs between the approaches are often larger than the differences in exact agreement. For 100% missing CR or MC item scores and for Tests A2 or B, the RMSD for the PMI approach is occasionally more than twice that of the MICE approach (Figures 1, 4, and 7), but Figures 2, 5, and 8 show that the percent exact agreement of the PMI approach for the two tests is 90% or above of that of the MICE approach. This phenomenon is presumably due to the loss of information resulting from the conversion of the TCSs to a few classifications before computing the percent exact agreement.
4.6. Discussion on the Results of the Comparison of the Imputation Approaches
The results from the above comparisons have the practical implication that if a practitioner has reasons to believe that the missing data are MNAR and is willing to use an approach that is complex and computation-intensive, then the MNI approach (Holman & Glas, 2005) should be chosen as the method of choice. If the practitioner thinks that the missing data are more likely to be MAR, then they should choose the MICE approach if computational burden is not an issue and the linear regression approach if the practitioner is interested in a simple approach (Sijtsma & van der Ark, 2003, emphasized the importance of using simple approaches in operational practice). A couple of approaches that are more computation-intensive than linear regression—the IRT approach and the logistic regression approach—provided no substantial benefits over the regression approach. It was found from further analysis that most incorrect classifications occurred when the individuals were in the middlemost classes. Also, more than 98% incorrect classifications were off by only one class.
5. Conclusions and Recommendations
Six imputation approaches were brought to bear on the problem of imputing classifications of examinees with incomplete data. These approaches and the operational approach currently used for four high-stakes classification tests were compared with respect to their accuracy in imputing classifications using data from the corresponding tests. When the data were MNAR, an approach based on modeling the nonignorable missingness (Holman & Glas, 2005) led to the most accurate imputation of classification. An approach based on MI—the MICE approach (Raghunathan et al., 2001)—led to the most accurate imputation of the classification for data that were MCAR or MAR. All approaches except the simple PMI approach performed better than the operational approach. Given that the operational approach (based on linking) is not based on a (statistical) prediction model and the other approaches are, this article demonstrates how statistical methods can be used to improve upon operational practice in an important problem in educational testing.
The differences in the accuracy of classification between the imputation approaches were small for MCAR and MAR missingness and a moderate extent of nonignorability, a result that is in agreement with the finding of various imputation approaches leading to similar estimates in Finch (2008), Huisman and Molenaar (2001), Sinharay (2021b), and Xiao and Bulut (2020) and is an outcome of the large correlations between the item scores on the tests. The difference between the imputation approaches was larger when the extent of nonignorability was strong, that is, when the correlation between examinee ability and missingness propensity was set equal to −.8.
The accurate imputation of the classification for the multiple linear regression approach (that performed only slight worse than the best-performing MICE and MNI approaches) for MCAR and MAR missingness and a moderate extent of nonignorability should be good news to practitioners and operational testing programs given the simplicity and ubiquitous nature of linear regression. Researchers such as Sarle (1998) noted that the literature on missing data deals almost exclusively with estimation 17 and the use of linear regression to impute missing data typically leads to poor estimation. Sarle (1998) also noted, however, that when prediction of values is required in the presence of missing data (as in the context of this article), linear regression may lead to excellent predictions. Therefore, the satisfactory performance of the multiple linear regression approach in this article may not be surprising.
One important and somewhat surprising finding in this article is the accurate imputation from several of the imputation approaches including the linear regression and MICE approaches for up to a moderate extent of nonignorability even though these approaches did not explicitly model the missingness. This finding is essentially an outcome of the high interitem correlations, agrees with similar findings in van Ginkel et al. (2010) and Xiao and Bulut (2020), and has the practical implication that the consequences of MNAR missingness in the context of imputing classifications may not be as disastrous as in other contexts (such as in Enders, 2011, who found MAR-based approaches to be highly biased for MNAR data in latent growth curve analysis), at least when faced with a small to moderate extent of missing item scores. Box and Draper (1987, p. 54) commented that all models are wrong, but some are useful. The results of this article indicate that MAR-based approaches may be wrong but are useful when applied to impute classifications in the presence of up to 50% MNAR data when the extent of nonignorability is small to moderate.
One practical implication of the results from the comparison study is that practitioners may use the MICE or the regression-based approach if computational complexity is an issue and (a) there is a small to moderate extent of missing item scores or (b) the practitioner strongly believes that the extent of nonignorability is not large. Note that it is possible to get some idea of the extent of nonignorability for item-response data using approaches used by, for example, Holman and Glas (2005) and Rose et al. (2010). 18 However, if computational complexity is not an issue, then practitioners should use the MNI approach that would protect them from the worst scenario of strong nonignorability in the data.
The practitioners who would like to adopt an approach to impute classifications in the face of missing item scores for their own data sets could perform a simulation study as performed in this article to compare various approaches. Ideally, they would start with a representative subset of examinees with no missing item scores and then artificially generate missing data according to the mechanism that is anticipated to have occurred for the data, where the missingness pattern is what they actually observed for their data (i.e., if they want to impute classifications of some examinees with scores missing on Items 1 through 10 on a 50-item test, then they should start with the examinees with no missing item scores, simulate missing scores on Items 1 through 10 under different assumptions about the missingness mechanism, and impute their classifications using various approaches). It would then be straightforward to compare the various approaches because the actual classifications are available for the subset of examinees that they started with.
One important consideration regarding imputation of classifications is the uncertainty inherent in the imputation of the imputed classifications, its impact on the decisions based on the classifications, and the ways to report the uncertainty to the score users. One way to report the uncertainty would involve the reporting of the probabilities of the various classifications, which can be readily computed from the cumulative logistic regression approach described earlier. For example, for 100% missing MC items and for Test B, the probabilities of classifications of 1, 2, and 3, respectively, are .94, .04, and .02 for an examinee while they are .59, .24, and .17 for another examinee. While both of them would receive a reported classification of 1, the uncertainly associated with the classification is much smaller for the first examinee compared to the second. However, a problem with classification probabilities is that they may not be easily understood by the score users.
Although the findings of this article may have important practical implications, this article has several limitations, and, consequently, it is possible to perform future research in several related areas. First, one could compare the imputation approaches using more data sets, both simulated and real, preferably from other types of classification tests including tests that (a) have smaller reliability and/or interitem correlation compared to the tests considered in this article, (b) include items with low psychometric quality (e.g., data from field trials), and (c) include testlets, with missingness depending on testlet membership. Second, it is possible to employ, in future comparison studies, more advanced approaches such as other model-based approaches that apply when data are MNAR (e.g., Enders, 2011; Glas & Pimentel, 2008; Rose et al., 2017) and data mining methods (e.g., Hastie et al., 2009). However, these approaches, especially those designed for the MNAR data, are case-specific and may not be easy or practical to use for large-scale tests due to their complexity. Third, the operational rules for handling missing item scores (such as rules for treating omitted and not-reached responses and threshold for imputation) for the tests were not questioned in this article, but future research could examine whether such rules are too strict or too liberal. Fourth, the results from this study apply only to tests for which the composite score is a linear combination of the item scores and the weights in the linear combination are known in advance—future research could evaluate the comparative performance of the imputation approaches for tests in which the composite score is not a linear combination of the item scores or the weights in the linear combination are not known in advance. Fifth, the MNI approach performed well under the specific MNAR mechanism (the one based on the model of Holman & Glas, 2005) that was considered here and may not perform as well under other MNAR mechanisms. Finally, it is possible to examine whether the relative performance of the imputation approaches is similar over relevant demographic subgroups.
Supplemental Material
Supplemental Material, sj-docx-1-jeb-10.3102_10769986211051379 - Reporting Proficiency Levels for Examinees With Incomplete Data
Supplemental Material, sj-docx-1-jeb-10.3102_10769986211051379 for Reporting Proficiency Levels for Examinees With Incomplete Data by Sandip Sinharay in Journal of Educational and Behavioral Statistics
Footnotes
Acknowledgments
The author wishes to express sincere appreciation and gratitude to Steven Culpepper, the editor, and the four anonymous reviewers for their helpful comments. The author would also like to thank Gautam Puhan, Hongwen Guo, and Sooyeon Kim for their helpful comments on an earlier version.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The author prepared the work as employee of Educational Testing Service. Any opinions expressed in this publication are those of the author and not necessarily of Educational Testing Service.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
