Abstract
Large amounts of missing data could distort item parameter estimation and lead to biased ability estimates in educational assessments. Therefore, missing responses should be handled properly before estimating any parameters. In this study, two Monte Carlo simulation studies were conducted to compare the performance of four methods in handling missing data when estimating ability parameters. The methods were full-information maximum likelihood (FIML), zero replacement, and multiple imputation with chain equations utilizing classification and regression trees (MICE-CART) and random forest imputation (MICE-RFI). For the two imputation methods, missing responses were considered as a valid response category to enhance the accuracy of imputations. Bias, root mean square error, and the correlation between true ability parameters and estimated ability parameters were used to evaluate the accuracy of ability estimates for each method. Results indicated that FIML outperformed the other methods under most conditions. Zero replacement yielded accurate ability estimates when missing proportions were very high. The performances of MICE-CART and MICE-RFI were quite similar but these two methods appeared to be affected differently by the missing data mechanism. As the number of items increased and missing proportions decreased, all the methods performed better. In addition, the information on missing data could improve the performance of MICE-RFI and MICE-CART when the data set is sparse and the missing data mechanism is missing at random.
Missing data are a common issue in educational assessments, and it may occur for a variety of reasons (Shi et al., 2019). For example, respondents may forget to answer some items inadvertently or they may not have enough time to answer some items at the end of an examination due to test speededness. It is also possible that respondents may prefer to omit some items just because they are unsure about the right answer. Since the presence of missing data could yield negative consequences such as bias in parameter estimates and decrease in statistical power (Roth, 1994), researchers developed several techniques to handle missing responses. In practice, omitted items are often regarded as wrong answers (i.e., zero replacement), based on the logic that respondents would have answered the item to get a score if they had really known the right answer (De Ayala et al., 2001). In addition to zero replacement, full-information maximum likelihood (FIML) and multiple imputation (MI) methods have also been highly recommended for dealing with missing responses (Schafer & Graham, 2002). Recently, the capabilities of MI have been further improved as MI and recursive partitioning methods were combined within a multivariate imputation by chained equations (MICE) framework (Van Buuren, 2007).
In item response theory (IRT), respondents’ ability levels are estimated based on their responses to a set of items. The presence of missing responses can have a detrimental influence on the parameter estimates when the methods to handle missing data are not suitable (Andreis & Ferrari, 2012). Therefore, previous studies have compared the performance of different methods including FIML, MI, and zero replacement for dealing with missing data within the IRT framework (e.g., Culbertson, 2011; De Ayala et al., 2001; Finch, 2008). Recently, there has been a growing interest in employing data mining methods (e.g., random forest, classification and regression trees) to handle missing data. Previous research showed that data mining methods outperformed traditional methods for handling missing data when estimating item parameters for IRT models (e.g., Andreis & Ferrari, 2012; Edwards & Finch, 2018); however, no studies made a thorough comparison of data mining methods and traditional missing data techniques for ability estimation in the context of IRT.
The purposes of the current study are twofold. First, this study aims to investigate the performance of different methods to handle missing data when estimating IRT ability parameters under several conditions. In the first simulation study, we compared the accuracy of ability estimates when missing responses were handled by FIML, zero replacement, MICE with classification and regression trees (MICE-CART), and MICE with random forest imputation (MICE-RFI). Simulation conditions were sample size, test length, missing data proportion, and missing data mechanism. Depending on the missing data mechanism, missing values themselves can also provide useful information considering that they are related to other observed values or values of missing data themselves. Therefore, the second purpose of the current study was to investigate whether incorporating missing responses into the imputation process with MICE could yield more precise ability estimates. The second simulation study was an extension of the first simulation study where we recoded missing responses as a new response category and used them in the imputation to improve the performance of MICE-CART and MICE-RFI. As for the first simulation study, the accuracy of resulting ability estimates was evaluated under different test lengths, sample sizes, missing data mechanisms, and missing proportions.
Literature Review
Missing Data Mechanisms
Handling missing data properly is an important task when estimating item and person parameters from a sparse response matrix. The decision regarding how missing data should be dealt with depends on several factors, and one of them is the missing data mechanism. Previous studies have already provided comprehensive discussions for the following missing data mechanisms: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR), and interested readers are encouraged to see Rubin (1976) and Graham (2012) for a comprehensive understanding of missing data analysis. To formulate the missing data mechanisms, let
In practice, when researchers assume that missingness in response data occurs by chance, this condition represents MCAR (Huisman & Molenaar, 2001). If a respondent skips some items due to not knowing the correct answer, this condition may be an example of either MAR or NMAR based on the assumption regarding the underlying missingness mechanism (Finch, 2008). Specifically, the condition is considered as MAR when missing values are related to other measured variables such as a respondent’s responses to other items (Sulis & Porcu, 2017). For example, respondents’ ability levels could be reflected by the number of correct answers if estimated ability levels were partly related to missing responses (De Ayala et al., 2001). De Ayala et al. (2001) argued that highly proficient respondents only omitted items for which they did not know the answer. However, less proficient respondents were unable to distinguish items well and except for skipping unknown items, they might also skip items that they could have answered correctly if they spent enough time on the items. Unlike MAR, NMAR occurs when missing data are directly related to the value of the missing variable itself (Edwards & Finch, 2018). For example, items that respondents are expected to answer incorrectly are more likely to be skipped. In large-scale assessments, both MAR and NMAR are commonly seen. When the missing data mechanism is either MCAR or MAR, both FIML and MI could produce highly accurate parameter estimates (Little & Rubin, 2002). In addition, the zero replacement method assumes unanswered items are unknown to respondents, which might be either MAR or NMAR. Therefore, all these missing data mechanisms (MCAR, MAR, and NMAR) were investigated in the current study.
Methods to Handle Missing Responses
As previously mentioned, omitted items are often regarded as wrong answers (i.e., zero replacement) in large-scale educational assessments. For example, in PISA (Programme for International Student Assessment), TIMSS (Trends in International Mathematics and Science Study), and PIRLS (Progress in International Reading Literacy Study), missing responses are typically scored as incorrect when estimating abilities for students (Martin et al., 2007; Organisation for Economic Co-operation and Development, 2009). This approach assumes that respondents who skip some items on the test do not have adequate proficiency to find the correct answer and thus their missing responses should be considered incorrect. For a dichotomous item, the correct answer would be recoded as 1, the wrong or omitted answers would be recoded as 0 by using the zero-replacement method. However, previous research suggests that when the zero replacement method is used to handle missing data, this could result in highly biased estimates (Finch, 2008; Mislevy & Wu, 1996) because respondents may have to skip some items for different reasons, such as lack of test-taking engagement, anxiety, and test speededness. Therefore, treating missing responses as incorrect could lead to the underestimation of respondents’ true ability levels.
FIML is one of the most commonly used approaches to deal with missing data. It uses the maximum likelihood algorithm with all available data to estimate parameters, instead of replacing or imputing missing values (Eekhout et al., 2015). With FIML, respondents who have missing values in item A would be ignored when estimating item A’s parameters. But, if they respond to item B, their information in item B would still be used for item B’s parameter estimates. FIML could handle the estimation of parameters and their standard errors in a single step, which is more efficient and effective compared with data imputation methods (Graham, 2009). Furthermore, previous research showed that FIML tends to yield unbiased parameter estimates when the type of missingness is either MCAR or MAR (Enders & Bandalos, 2001). Finally, FIML is the default missing data technique in most IRT software programs, making this method convenient to use in practice (Edwards & Finch, 2018).
To date, MI has been widely used for handling missing data (e.g., Leacy et al., 2017; Rezvan et al., 2015). Following a Bayesian approach, MI creates multiple plausible data sets, with missing values replaced by imputed values, and then appropriately combines results from each of the plausible data sets (Sterne et al., 2009). Although several adaptations of MI have been proposed in the literature (e.g., Sulis & Porcu, 2008; Van Buuren & Oudshoorn, 1999), in this study we only discuss MICE-based methods. The MICE framework assumes that data are drawn from a multivariate distribution, and each incomplete variable can be imputed by iteratively sampling from a conditional distribution (Van Buuren & Oudshoorn, 1999). Specifically, let I = (I1, I2, . . ., Ik) be a set of k items where each of the items may be partially observed. When the missing data mechanism is MAR, Ik is imputed from the conditional distribution
This approach provides greater flexibility in large data sets when there are different data types to impute. In addition, various estimation algorithms can be adopted within the MICE framework, such as MICE-CART and MICE-RFI. CART refers to classification and regression trees. It constructs a predictive model in the form of decision trees, by using a set of predictors and cut points to split the sample into several subgroups (Friedman et al., 2001). The splitting process is repeated several times to produce the most accurate model with the best split and the most homogenous subgroups. In each subgroup (also called leaf), the outcome variable can be either categorical or continuous, and it draws from the conditional distribution for a set of predictors that satisfy cut points (Akande et al., 2017). Van Buuren (2012) mentioned that the CART method shows its strengths when dealing with outliers, nonlinear relationships, and nonnormal distributions. These strengths also make CART a suitable method for data imputation. RFI refers to random forest imputation. Unlike CART that only creates a single decision tree, RFI creates a number of decision trees, which often yields higher variances across samples (Hastie et al., 2009). Therefore, RFI employs bootstrap methods to select a random subset of predictors for the splits on each iteration and aggregates results to identify the most stable predictive model. If there is a collinearity issue due to high correlations among predictors, RFI can address this issue by using highly correlated predictors in different iterations (Hayes et al., 2015).
Previous studies suggest that using either CART or RFI within the MICE framework could get unbiased parameter estimates with appropriate confidence intervals (Doove et al., 2014; Shah et al., 2014). In the context of IRT, MICE-RFI and MICE-CART were found to produce more accurate estimates for item parameters (Edwards & Finch, 2018), but no literature discussed their performance in the context of ability estimation. In this study, we assumed that respondents skipped items because they did not know the correct answers. Under this assumption, the presence of missing responses for a given item can inform the imputation of other items with missing responses. To utilize missing responses in the imputation process, missing responses can be recoded as a separate response category and included as predictors when replacing missing responses with imputed values. In the second simulation study, we applied this strategy to the imputation procedures with MICE-CART and MICE-RFI, which were denoted as MICE-CART2 and MICE-RFI2.
Simulation Study 1
Data Generation
The first Monte Carlo simulation study was conducted using the mirt (Chalmers, 2012) and mice packages (Van Buuren & Groothuis-Oudshoorn, 2010) in R (R Core Team, 2019). Data were generated under the 3PL (three-parameter logistic) IRT model. According to the 3PL model, the probability of a respondent with the ability parameter
where
The first study generated three types of missing data, MCAR, MAR, and NMAR. For MCAR, the desired proportions of missingness (5%, 15%, 30%, or 40%) were created randomly by replacing original responses with missing responses. For MAR, we divided the respondents into two groups based on the number of correct responses and split the data into two parts based on the mean value. The approach for generating MAR and NMAR was borrowed from previous research (Edwards & Finch, 2018; Enders, 2004; Finch, 2008). For MAR, the first data set including the high-ability group was generated with lower missing proportions and the second one including the low-ability group was generated with higher missing proportions. The average missing proportions were ensured to be equal to the desired missing proportions. For example, under 40% missing data condition, the high-ability group was generated about 30% missing data and the low-ability group was generated around 50% missing data. Finally, the average missing proportions were equal to 40%. These missing data were MAR because the missing values were related to observed variables—the number of correct responses to the nonmissing items. It should be noted that we did not use ability parameters for the MAR condition since missing data related to the latent traits could also be considered as NMAR cases (e.g., Rose et al., 2010; Sulis & Porcu, 2017).
Data generation with NMAR was similar to that of MAR. For NMAR, each examinee’s incorrect responses in the initial data set were assigned a higher probability of being missing while correct responses were assigned a lower probability of being missing. The mean missing proportions of the entire data set were equal to the desired proportions (i.e., 5%, 15%, 30%, or 40%). Again, we use the 40% missing data condition to describe the simulation process. For each item, incorrect responses were generated around 50% missing data, and correct responses were generated around 30% missing data. The average missing proportions were equaled to 40%. This was considered as NMAR because missing responses were related to their true values directly.
Data Analysis
For the MICE-CART and MICE-RFI methods, the current study conducted 20 iterations to impute five data sets in each replication. MICE-RFI grew 10 trees, which was the default number in the mice package. This study selected expected a posteriori (EAP) to estimate ability parameters. EAP is a noniterative technique based on the numerical evaluation of the mean and variance (Bock & Mislevy, 1982). The formula for the EAP estimation can be expressed as
where
The outcomes of interest in this study were the comparisons between true ability parameters (based on the complete data sets) and estimated ability parameters (based on the data sets by using different methods to handle missing values). Bias, root mean square error (RMSE), and Pearson correlation between true ability parameters and estimated ability parameters were used to evaluate the precision of ability estimates for each condition. Bias was computed as
and RMSE was computed as
where
where
Results of Simulation Study 1
Missing Completely at Random Results
Figure 1 shows the RMSE results of ability estimates for MCAR across different simulation conditions. As the missing proportion increased, RMSE increased for all methods. Zero replacement was the least accurate method with respect to RMSE. It produced the largest RMSE value when the sample size was 500, the test length was 20, and the missing proportion was 40% (RMSE = 0.583). In general, FIML produced much smaller RMSE values, but when the sample size was 1,000, the test length was 60, and the missing proportion was 5%, the performance of MICE-RFI (RMSE = 0.078) was slightly better than FIML (RMSE = 0.079), followed by MICE-CART (RMSE = 0.088) and zero replacement (RMSE = 0.177). Although MICE-CART performed slightly better than MICE-RFI under most conditions, differences in the RMSE results were quite small. However, MICE-RFI always produced lower RMSE values than MICE-CART when the test length was 20, the sample size was 500, regardless of missing proportions. In general, increasing the test length improved the performance of missing data handling methods, but the sample size did not appear to affect RMSE results.

Average RMSE values across the different simulation conditions when missing data type is MCAR.
Figure 2 shows the correlation values between true and estimated ability parameters. Similar to the RMSE findings, zero replacement performed the worst across all simulation conditions. It produced the smallest correlation value when the sample size was 500, the test length was 20, and the missing proportion was 40% (

Average correlation values across the different simulation conditions when missing data type is MCAR.
Missing at Random Results
Figures 3 and 4 present the average RMSE and correlation results for MAR. In MAR, the performance gap between zero replacement and the other three methods narrowed down. Even though zero replacement performed the worst under most conditions, it was able to outperform MICE-RFI when the missing proportion was 40%, the test length was 20, and the sample size was either 500 or 1,000. Especially when the sample size was 500, the missing proportion was 40%, the test length was 20, MICE-RFI produced the largest RMSE and the lowest correlation values (RMSE = 0.470,

Average RMSE values across the different simulation conditions when missing data type is MAR.

Average correlation values across the different simulation conditions when missing data type is MAR.
Not Missing at Random Results
Figures 5 and 6 show the average RMSE and correlation results for NMAR. In NMAR, zero replacement outperformed the other three methods when missing proportions were 5%, regardless of the sample size and test length. Furthermore, it was the only unbiased method since it produced bias and RMSE values of 0 and the correlation result of 1. Under other conditions, FIML always yielded smaller RMSE and larger correlation values, followed by MICE-CART, MICE-RFI, and last by zero replacement. Similarly, only the missing proportion and test length conditions had a noteworthy impact on the accuracy of ability estimates.

Average RMSE values across the different simulation conditions when missing data type is NMAR.

Average correlation values across the different simulation conditions when missing data type is NMAR.
Simulation Study 2
Data Generation and Analysis
The second Monte Carlo simulation study was designed based on the findings from the first study. The MAR and NMAR conditions in the first simulation study indicated that lower ability respondents were more likely to skip items and items were more likely to be skipped if respondents were not able to answer correctly. Therefore, the assumption of zero replacement was met more properly and its performance in handling missing data was improved substantially. This finding inspired us to improve the performance of the MICE-based methods by using missingness as auxiliary information in the imputation process. Therefore, the second simulation study included two new methods, MICE-CART2 and MICE-RFI2, which considered missing responses as a separate response category. In addition, we removed the sample size conditions, considering sample size had little to no impact on the results in the first simulation study. To test the performances of MICE-CART2 and MICE-RFI2 more properly, we focused only on higher missing proportions, 30%, 40%, and 70%. The missing proportion of 70% was added to create a new condition in which the response data set is highly sparse and thus handing missing data properly would have significant consequences in terms of ability estimation. The three test length conditions were also included in the second simulation study. However, the second simulation study only focused on the MAR and NMAR conditions because missingness in the response data had useful information under these two mechanisms. We adopted the same method as the first simulation study when generating response data with the MAR and NMAR patterns.
For both MICE-CART2 and MICE-RFI2, we followed an iterative process. First, all missing values in dichotomous responses were recoded as a new category “2” except for Item 1 from which the imputation process was initiated. Second, the recoded items were used in the imputation process to replace missing values of Item 1. Next, the missing values (i.e., Category 2) of Item 2 were turned into their original status (not available) and imputed by using Item 1 (now Item 1 had no missing values) and other items on the test. This process was repeated until missing data for all items were imputed. For each condition, this study ran 20 iterations to get five MIs to get the combined results. There were 10 trees to grow for MICE-RFI methods. The second simulation study followed the same data analysis and evaluation criteria as the first study.
Results of Simulation Study 2
Missing at Random Results
Table 1 shows the mean bias, RMSE, and correlation results across different simulation conditions for MAR. As the test length increased, RMSE decreased and the correlation increased for all missing data handling methods. On the contrary, increasing missing proportions resulted in larger RMSE and smaller correlation values. These findings were consistent with the results of the first simulation study. One new finding was that zero replacement outperformed the other missing data handling techniques when missing proportions were 70%. However, FIML generally performed the best whereas zero replacement performed the worst, compared with the other methods. For most conditions, utilizing missing values as a separate response category improved the performance of MICE-RFI, but it had a negative impact on MICE-CART except when the test had 20 items with 40% missing values or data sets were highly sparse (70% missing data). When the missing proportions were 70% and the test had 60 items, MICE-CART produced the largest bias value (Bias = −0.005). Under 70% missing conditions, only FIML and zero-replacement methods retained the mean bias value close to zero to three decimal places, regardless of the test length (either 20, 40, or 60 items). Under sparse data scenarios, MICE-CART2 always produced lower RMSE and higher correlation values, followed by MICE-CART, MICE-RFI2, and MICE-RFI except for the 20-item condition. In this case, MICE-RFI2 outperformed MICE-CART.
Average Bias, RMSE, and Correlation Values for MAR.
Note. RMSE = root mean square error; MAR = missing at random; FIML = full-information maximum likelihood; MICE-CART = multiple imputation with chain equations utilizing classification and regression trees; MICE-RFI = multiple imputation with chain equations utilizing random forest imputation.
Not Missing at Random Results
Mean bias, RMSE, and correlation results across different simulation conditions for NMAR are presented in Table 2. Consistent with previous findings, increasing the number of items and decreasing missing proportions improved the performance of each missing data handling method. FIML outperformed the other missing data handling methods across all conditions, but it yielded a negative bias when the test had 70% missing values and the test length was 60 items (Bias = 0.001). While zero replacement kept the mean bias value close to zero to three decimal places under all conditions, it performed the worst for most conditions except when missing proportions were 70% and the test had 40 or 60 items. Under these two conditions, MICE-RFI2 and MICE-CART2 produced the largest RMSE and the smallest correlation values. In addition, using missing values as a separate category always improved the performance of MICE-RFI but weakened the performance of MICE-CART. When the test had 20 items and 40% missing data, MICE-CART2 produced lower RMSE value (0.456) and higher correlation value (
Average Bias, RMSE, and Correlation Values for NMAR.
Note. RMSE = root mean square error; NMAR = not missing at random; FIML = full-information maximum likelihood; MICE-CART = multiple imputation with chain equations utilizing classification and regression trees; MICE-RFI = multiple imputation with chain equations utilizing random forest imputation.
Discussion
A highly important task in educational assessments utilizing IRT is to obtain accurate item and ability parameter estimates; but the existence of missing responses is inevitable, and it would have a detrimental influence on the accuracy of estimated parameters when missing data are not handled properly (Andreis & Ferrari, 2012). While previous studies focused on the effects of missing data handling methods on item parameter estimates (Edwards & Finch, 2018; Finch, 2008) and ability estimates (Culbertson, 2011), no studies made a thorough comparison of traditional missing data handling methods and data mining methods for ability estimates. In this study, we conducted two Monte Carlo simulation studies to compare the performances of missing data handling methods when estimating ability parameters from dichotomous item responses with missing values. In the first simulation study, we selected four missing data handling methods, namely, zero replacement, FIML, MICE-CART, and MICE-RFI. The first two are commonly used methods in the missing data literature and the latter two are the two new methods utilizing the CART and RFI algorithms within the MICE framework. To evaluate the performances of these methods under different data conditions, missing data mechanisms (MCAR, MAR, and NMAR), missing proportions (5%, 15%, 30%, and 40%), test lengths (20, 40, and 60 items), and sample sizes (500, 1,000, and 3,000) were manipulated. The relative performances of the missing data handling methods were evaluated based on RMSE and correlation values between estimated and true ability parameters.
The first simulation study indicated that the missing data mechanism, the proportion of missing data, and test length could influence the accuracy of ability parameters obtained from a sparse response data set. However, the sample size had almost no impact on the results. This result ties well with previous studies wherein sample size has a negligible effect on ability estimation in IRT (Bulut et al., 2017; de la Torre & Song, 2009). Among the four methods for handling missing data, the zero replacement method performed the worst under most conditions. A similar conclusion was reached by previous research on this method (De Ayala et al., 2001). However, the performance of the zero replacement method appeared to improve under the MAR and NMAR conditions. With MAR and NMAR, it was relatively more reasonable to treat omitted responses as incorrect answers, which yielded more accurate ability estimates especially when the missing proportion was very small.
Another important finding from the first simulation study is that MICE-CART outperformed MICE-RFI under most conditions but the difference between the two methods was negligible. A similar finding was also reported by Edwards and Finch (2018) who compared the performance of MICE-CART and MICE-RFI in estimating IRT item parameters from a sparse response data set. A possible explanation for this finding might be that the CART and random forest (RF) algorithms work quite similarly as recursive partitioning methods. CART builds a prediction model based on a single decision tree, while RF creates multiple decision trees based on bootstrapped samples of data and combines the predictions from all decision trees to build a final prediction model. Furthermore, RF inherits most properties of CART, such as outlier handling and the ability to utilize nonlinear relationships in the data. Therefore, the two algorithms appeared to function very similarly within the MICE framework as they created imputed values for missing responses. However, when dealing with large volumes of data, MICE-CART might be a more desirable method for handling missing responses compared with MICE-RFI, due to its lower computational cost.
To further evaluate the effect of using missing values as auxiliary information in the imputation process, a second simulation study was considered as an extension of the first one. Based on the results of the first simulation study, the second study ignored sample size due to its negligible impact on the accuracy of ability estimates. Also, the second study only focused on the MAR and NMAR conditions that enabled incorporating valuable information from missing values into the imputation process. Findings indicated that utilizing missing values in the imputation process worked well for MICE-RFI but weakened the accuracy of MICE-CART when missing proportions were either 30% or 40%. One possible reason was that under the MAR and NMAR conditions, systematic missingness in the data appeared to provide valuable information for each decision tree created by MICE-RFI2. Consequently, the decision trees in MICE-RFI2 were more informative and they performed better than those in MICE-RFI. In addition, because MICE-CART only created a single decision tree, this approach could possibly consider low rates of missing responses as noise when constructing a decision tree model. This could explain why the performance of MICE-CART2 tends to improve when the missing proportion went up to 70%. When missing proportions were 70%, the new approach improved MICE-based methods under MAR conditions, but it was not suitable for NMAR conditions with large numbers of items, which might be related to their underlying assumptions. MAR assumed that missing values were related to other observed variables, and thus incorporating the missing information from other variables would be meaningful. However, NMAR assumed missing values were only related to themselves, and thus the missing information from other variables might not be as valuable as it was for MAR, especially for highly sparse data. Consequently, adding irrelevant information to the model appeared to be detrimental to the accuracy of ability estimation.
The results of both simulation studies also indicated that using FIML to deal with missing data could result in higher accuracy in the estimation of ability parameters, compared with the zero replacement and MICE methods under most conditions. One possible reason for FIML performing better than the MICE methods was insufficient numbers of repeated imputations. Previous research examined how many imputations could make MI and FIML equivalent, and at least 20 imputations for each data set were recommended (Graham et al., 2007). In this study, we only conducted five imputations for each data set and obtained very similar results for FIML, MICE-CART, and MICE-RFI. Therefore, it is possible that increasing the number of imputations could yield better results for MICE-CART and MICE-RFI. In addition, this study adopted EAP to estimate ability parameters, but previous research indicated that different missing data conditions could influence the performance of EAP in ability estimation (De Ayala et al., 2001). Therefore, the potential interaction between missing data conditions and the ability estimation procedure might have affected the performance of the methods used for handling missing data in the current study.
There are several implications for the current study. In educational assessments with high proportions of missing responses (e.g., formative assessments, low-stakes tests), increasing the number of items could improve the accuracy of respondents’ ability estimates. FIML is recommended to handle missing responses when estimating ability parameters from sparse response data sets due to its convenience and high accuracy in handling missing values. However, if researchers intended to get a complete data set for further analysis, FIML would be inadequate. Instead, MICE-based methods can provide a complete data set with imputed missing values to conduct further data analysis. When the data set is highly response sparse and the missing data mechanism is not clear, FIML and zero-replacement methods can be adopted to handle missing values. However, when the missing values follow the MCAR mechanism and missing proportions are not high, researchers and practitioners should avoid the zero-replacement method. When the response data set includes relatively high missing proportions and researchers attempt to use MICE-RFI, adding missing values in the imputation process could provide more precise results when estimating ability parameters in IRT.
While the current study sought to make a comprehensive comparison of traditional missing data handling methods and data mining imputation methods for ability estimates, there remains room for improvement. First, the current study only discussed CART and RFI as they were readily available in the mice package. Future research should adopt alternative data mining approaches such as stochastic gradient tree boosting and C5.0 (Ramosaj & Pauly, 2017) within the MICE framework and evaluate their performance. Second, the current results were based on the fixed-length assessment scenario in which all test takers responded to the same set of items and test speededness was considered. Other scenarios including computerized adaptive tests and the presence of test speededness should also be investigated in the future. Third, this study only focused on dichotomous items in a unidimensional test scenario. Future studies should examine the performance of the missing data handling methods for ability estimates for polytomous items as well as under a multidimensional test scenario.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
