Abstract
The semi-generalized partial credit model (Semi-GPCM) has been proposed as a unidimensional modeling method for handling not applicable scale responses and neutral scale responses, and it has been suggested that the model may be of use in handling missing data in scale items. The purpose of this study is to evaluate the ability of the unidimensional Semi-GPCM to aid in the recovery of person parameters from item response data in the presence of item-level missingness, and to compare the performance of the model with two other proposed methods for handling such missingness: a multidimensional modeling approach for missingness and full information maximum likelihood estimation. The results indicate that the Semi-GPCM performs acceptably in an absolute sense when less than 30% of the item data is missing but does not outperform the other two methods under any particular conditions. We conclude with a discussion about when practitioners may or may not want to use the Semi-GPCM to recover person parameters from item response data with missingness.
Keywords
Introduction and Proposed Model
Missing data commonly exist in quantitative studies, and such missing data are known to possibly induce bias in parameter estimates when not handled appropriately. A core difficulty in dealing with missing data in social science research is that the underlying mechanism of missing data is often unknown (Schafer & Graham, 2002). Depending on the underlying mechanism, missing data are divided into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR; Little & Rubin, 2014). MCAR is defined as missing data that are independent of any observed or unobserved variables. In this case, parameter estimates are expected to be unbiased. MAR is defined as missing data that are dependent on one or more of the observed variables in the data set. In this case, by fully accounting for the relationship between the missing data and the observed variables, researchers should still be able to obtain unbiased parameter estimates. MNAR, also named as nonignorable missing data, are defined as missing data that are related to one or more unobserved variables.
When applying the three missingness mechanism definitions to scale item data aiming to estimate a unidimensional latent trait, we can be more specific in the missingness mechanism definitions. MCAR would remain consistent with the above definition, that is, any item-level missingness is considered MCAR if the missingness is random. MAR would be defined as missingness within a scale item that is related to an observed variable external to the scale itself (e.g., a demographic variable). MNAR could be defined in two ways. First, item-level missingness that is related to a variable external to the scale (e.g., a demographic variable) that was not measured in the study could be classified as MNAR. Second, item-level missingness that is related to the latent trait that the scale is attempting to measure would also be classified as MNAR, since the latent trait is unobserved. In this study, we consider the second type of MNAR item-level data.
One simplified way of handling missing item-level data is to calculate summated scores (rather than latent trait scores) from the scale data, while handling the missingness within the summation in some particular way, such as listwise deletion. However, these methods are generally not preferred due to issues of parameter estimate bias (Enders, 2010) and loss of available information that is expected when, for example, deleting full cases. For example, in a National Center for Education Research report, Ruby and Doolittle (2010) found that the missing proportion of individual survey item data was less than 5%, yet if they conducted a listwise deletion at the scale level, they would have excluded almost one third of the respondent sample. Broadly speaking, researchers should aim to handle missing data at the item-level rather than at the total scale level.
Huggins-Manley et al. (2018) proposed a semi-generalized partial credit model (Semi-GPCM) to deal with not-applicable (NA) scale responses. They considered NA as a valid response option and analyzed it together with the other common Likert-type scale data using the Semi-GPCM, noting very specific cases of unidimensionality that would have to hold for such a modeling approach. This model was developed based on the relationship between the generalized partial credit model (GPCM; Muraki, 1992) and the nominal response model (NRM; Bock, 1972; e.g., Huggins-Manley & Algina, 2015; Thissen & Steinberg, 1986).
Let yij represent a response to an item i by participant j; k represent a particular ordered response categories in a polytomously scored scale item and K represent the total number of ordered response options;
for the first response category in an item with more than two ordered response options,
for an ordered response category k = g, where g is an ordered response category other than the first category selected by a respondent, and
for the nominal response category h, which is the category for missing data in this study.
One notable property of this model is its assumption of unidimensionality underlying the scale item responses, in which the Semi-GPCM assumes that the nominal category (here, the missing data on the item) is a function of the underlying latent trait (
Purpose
The purpose of this study is to evaluate the ability of the Semi-GPCM to estimate unbiased person traits from polytomous scale data in which some item responses are missing. We evaluated the model-based parameter estimates with both absolute and relative bias criteria under a variety of data conditions that may influence the criteria in social science research studies that make use of scale data. We achieved this purpose through two simulation studies.
The first simulation study (Study 1) examined if the Semi-GPCM can perform well under three different missingness mechanism conditions: MCAR, MAR, and MNAR. However, we hypothesized that the model is particularly suited to handle MNAR item-level data, which meets a critical need in applied measurement as these types of missing item data tend to be the most challenging type to handle (Finch, 2008; Rose et al., 2010). We evaluated the model performance in light of various test lengths, sample sizes, proportions of missing data, and magnitudes of correlational relationships of the missing data to observed and unobserved variables. The research questions are as follows:
In Study 2, we compared the model person parameter recovery between the Semi-GPCM and two other methods for estimating person parameters from scale data with missing item responses: fitting a multidimensional item response theory (MIRT) that contains a secondary latent trait intended to represent the propensity to not respond to items, and implementing full-information maximum likelihood (FIML) in the presence of the missing data. Hence, the second simulation addresses the following research question:
Background on FIML and MIRT
Holman and Glas (2005) proposed four general MIRT models for handling nonignorable item-level missing data partially based on previous studies (e.g., Bock et al., 1988; Moustaki & Knott, 2000; O’Muircheartaigh & Moustaki, 1999). Different from the Semi-GPCM, MIRT requires the creation of a missing data indicator matrix, 1 in which elements of the matrix (rij) are defined as
Assuming local independence, the four models from Holman and Glas (2005) are
and
where
Unlike MIRT, using FIML estimation to handle missing data does not require researchers to compare different models and does not explicitly require researchers to have a good understanding of their missing data. Moreover, FIML is the default estimation approach to analyze measurement data when using the structural equation modeling (SEM) software Mplus (L. K. Muthén & Muthén, 2015), which arguably makes it an easy method to apply. Different from standard maximum likelihood (ML) estimation, FIML computes a casewise likelihood function using all the available information for the particular case first, and then accumulates likelihood functions across the entire sample (Arbuckle, 1996), such that
and
where Cj is the constant based on the number of observations for participant j;
General Method
Overview
We used the same approach to simulate the missing data for both Study 1 and Study 2, but applied different models to analyze them. Below we detail the data generation method that applies to both studies.
Data Generation
All data sets were generated in R (R Core Team, 2013) and were varied on the basis of four factors mentioned in Research Question 2 (test length, sample size, the proportion of missing data, and correlation magnitudes between the missing data and particular variables). The crossing of the four factors resulted in 432 unique simulation conditions (see Table 1). For the missing correlation factor, when the correlation was set to zero, we considered the missing data as MCAR. In MAR conditions, the correlation was not fixed to zero and the missing data were correlated at a specific magnitude to a secondary latent trait (
The Count of Different Combinations of the Conditions for the Data Simulation.
Note. MCAR = missing completely at random; MAR = missing at random; MNAR = missing not at random.
We used the GPCM (see Equations 1-3) as the data generating model for creating 100 complete data sets under each condition. Each item was generated to have five ordered categories, mimicking a common 5-point Likert-type scale measurement scenario. First, we generated the person parameters (
Next, a binary missing indicator matrix
and
where
Study 1
Data Analysis
We fit the GPCM to the complete data sets first to obtain baseline results. As for the incomplete data sets, we fit the Semi-GPCM in Equations 1, 2, and 3, in which we treated the missing data as a nominal category. For all the conditions, we analyzed the data by batching to Mplus (L. K. Muthén & Muthén, 2015) through R with the MplusAutomation package (Hallquist & Wiley, 2018).
Person parameter recovery was evaluated using three criteria: bias, root mean square error (RMSE), and the Pearson correlation between the estimated latent trait (
and
A multiple regression analysis was conducted for each criterion to explore which simulation factor affected the criterion, in which we treated missing type (MCAR, MAR, MNAR) as a categorical factor and all the other simulation factors as continuous variables. All possible interactions between different conditions were included in the analysis. In terms of the missing correlation factor, we disposed of its main effect because the meaning of the factor varied across MAR, MNAR, and MCAR conditions. In MAR conditions, the missing data were correlated to the secondary latent trait
Results
Complete Data Sets
Table 2 shows the descriptive statistics of person parameter recovery in the GPCM model analysis on complete data sets. The bias was small for all conditions, in which the largest absolute value of bias was less than or equal to |.008|. Neither the number of items nor the sample size meaningfully affected the bias: F(1, 1196) = 0.64, p = .42,
Descriptive Statistics of the Bias, RMSE, and Correlation Under Baseline (Complete Data) Conditions in Study 1.
Note. RMSE = root mean square error.
Data Sets With Missingness
Table 3 shows a summary of the multiple regression analyses evaluating the person parameter recovery when using the Semi-GPCM, including only the main and interaction effects associated with statistically significant and practically meaningful results, as defined above. Bias across all the conditions ranged from −.28 to .26. None of the model predictors had a meaningful impact on bias. RMSE was affected by the interaction of the missing proportion and the missing type, F(2, 41960) = 4988.44, p < .001,
Percentage of Variance Explained (η2* 100) and Statistical Significance Results of the Multiple Regression Predictors Under Missing Data Conditions in Study 1.
Note. RMSE = root mean square error.
p < .05. **p < .01. ***p < .001.
Figure 1 shows a visual of the meaningful interaction effects from Table 3, only with regard to RMSE and correlation criteria, as the simulation factors were not meaningfully predictive of bias. The Semi-GPCM performed well under most conditions except when missingness was MAR with a missing proportion equal to .5. Overall, as the missing proportion factor increased, the performance worsened, noting that the person parameter recovery under MCAR was relatively better than the person parameter recovery under MAR and MNAR across all levels of missing proportions, and the person parameter recovery under MAR was worse than the person parameter recovery under MNAR when the missing proportion was .5 (Figure 1A and C). In terms of the number of items, longer test lengths aligned with better person parameter recovery according to the RMSE and correlation criteria (see Figure 1B and D).

Person parameter recovery using the Semi-GPCM in Study 1.
Discussion
The Semi-GPCM provided acceptable person parameter recovery for all the conditions except for those with 50% of the data missing. Bias was small across all conditions and was not affected by any simulation factor. The maximum absolute value of bias was .28 when using the Semi-GPCM and all bias was in a confined range, which was consistent with the results shown in previous studies (e.g., B. Muthén et al., 1987; Reise & Yu, 1990). In terms of the other two criteria, larger missing proportions of data and fewer numbers of items led to worse person parameter recovery, whereas the sample size did not show a meaningful impact.
The magnitudes of the person parameter recovery results in this study aligned with previous research fitting the GRM to complete data sets. For example, Reise and Yu (1990) conducted a simulation study to explore the parameter recovery in the GRM and they found that the correlations between trait estimates and true traits were over .85 across all conditions when the sample size was greater than 500. We have similar findings in this study, even with sample sizes as low as 200. In this study, the sample size had little influence on the trait parameter recovery.
We found interaction effects between study factors and the missingness type. As the missing proportion increased, MNAR and MAR conditions had increasingly worse person parameter recovery as compared with MCAR conditions. When missing proportion increased to .5, the person parameter recovery under MNAR is better than the person parameter recovery under MAR. This aligned with our hypothesis that, due to the underlying unidimensionality assumption of the Semi-GPCM, the studied approach to missing data is better suited for MNAR data conditions comparing with MAR data conditions. Study 1 supports the notion that missing data that are a function of the trait of interest can be handled well by fitting the Semi-GPCM. As MNAR can be one of the more difficult forms of missing item data to handle, this is an advantage of the proposed modeling approach when MNAR is the theorized and empirically supported missing data mechanism.
Study 2
In Study 2, we compared the person parameter recovery across three different methods: fitting the Semi-GPCM, fitting a multidimensional IRT model, and using FIML while fitting the GPCM.
Method
Data were generated using the same method as Study 1. However, we excluded the sample size of 200 due to convergence issues when fitting multidimensional IRT models under small sample sizes. Multidimensional IRT and FIML were used to analyze the data. Compared with multidimensional IRT, using FIML while fitting the GPCM was a straightforward approach in which we used all the available information in the data set (see Equations 9 and 10). However, applying multidimensional IRT required us to make some decisions before interpreting the results, such as which model to choose and how to identify the model. In our case, we chose model G2 from Holman and Glas (2005; see Equation 6) based on the match it has to our data generation process. Specifically, our missing propensity is conditioned on the latent trait of the missing data (
Just as in Study 1, all the data were generated in R and analyzed in Mplus (L. K. Muthén & Muthén, 2015), with 100 data sets per condition. Our evaluation criteria included bias, RMSE, and the correlation between the estimated latent trait (
Percentage of Variance Explained (η2* 100) by Simulation Factors in the Multiple Regression Analysis for Study 2.
Note. RMSE = root mean square error.
p < .05. **p < .01. ***p < .001.
Results
Similar to the Semi-GPCM, the bias was relatively small across all the conditions using multidimensional IRT and FIML, with a range from −.18 to .16, and −.16 to .30, respectively, and not impacted by the test length. The RMSE and correlation evaluation criteria, on the other hand, were both heavily influenced by the main effect of test length (i.e., number of items), with F(1, 94380) = 158584.9, p < .001,

Person parameter recovery comparison using Semi-GPCM, FIML, and multidimensional-IRT.
Of core interest to Research Question 3, the analysis models (i.e., Semi-GPCM, multidimensional IRT, and GPCM with FIML) performed differently across different missing proportions of data, with F(2, 94380) = 44279.32, p < .001,

The correlation between the true latent trait and the estimated latent trait using Semi-GPCM, FIML and multidimensional-IRT when MCAR (A), MAR (B), and MNAR (C).
Discussion
Similar to Study 1, bias was small across all the conditions and not affected by test length. RMSE and correlation performance, on the other hand, were both affected by the test length (i.e., number of items). Among all factors, the test length plays a relatively prominent role compared with the rest of the factors, but the sample size does not heavily influence the primary latent trait estimates. This finding may be due to the sufficient sample sizes we chose, but regardless it is noteworthy that research on handling missing data with multidimensional IRT or FIML shows similar findings. Specifically, Holman and Glas (2005) used multidimensional IRT to deal with the nonignorable missing data, and their results showed that the mean square error of parameter estimates is more influenced by the number of items than the sample size. Enders and Bandalos (2001) explored how sample size, missing proportion, and factor loadings influenced the parameter estimate bias in their study that explored the use of FIML in SEM, and their results indicated that the sample size heavily influences bias when it is small but not so much when the sample size is over 500.
Returning to Study 2, we explored how the choice of analysis model affects the person parameter recovery and we found a significant interaction between the analysis model and the proportion of missing data for both bias and RMSE, and a three-way interaction between analysis model, missing proportion, and the missing data types. For bias, only FIML showed slightly more positive bias when the missing proportion is .5 (Figure 2D). However, the maximum absolute value of the FIML bias is .03, which is often tolerable when the true person parameter ranges from −3 to 3. Overall, for RMSE and correlation, MIRT and FIML performed either equally well or better than the Semi-GPCM, especially when the proportion of missing data was large. When data were MCAR, all three models performed equally well regarding the correlation evaluation criterion. The average correlation between true theta and estimated theta stayed near .9 for all three models even when the missing proportion was .5 (see Figure 3A). This aligns with overarching theories on MCAR data and parameter recovery in statistical models (Rubin, 1976). FIML and multidimensional IRT often showed similar performance between MCAR conditions and other conditions of MAR and MNAR. However, the performance of Semi-GPCM was noticeably worse than the other approaches for handling missing data when MAR. This aligns with the unidimensional assumption of the Semi-GPCM. In both Study 1 and Study 2, the Semi-GPCM results were largely acceptable in an absolute sense when the missing proportion was equal to or less than .3. However, the relative performance did not support the use of the Semi-GPCM. Next, we discuss instances in which practitioners may want to use the Semi-GPCM, but of course there may be many instances in which other methods for handling missing data are preferred.
Conclusion
Among all three analysis methods that we studied, using FIML while fitting the GPCM to polytomous item response data with missingness seems to be the most user-friendly method that can produce acceptable trait recovery results both in an absolute and relative sense. No missing data preprocessing is required for FIML and it is readily available in many statistical packages, including Mplus. However, there are times when practitioners not only want to recover parameters but also want to have information on the missing data itself. FIML does not provide substantive information about the missing data, which is in contrast to the other two studied methods. Based on the definition of multidimensional IRT, the relationship between the nuisance latent trait (i.e., the latent trait that is correlated with the missing data) and the primary latent trait of interest can inform researchers about the missing data mechanism and, indirectly, the relationship between missing data and the primary trait of measurement.
In terms of the Semi-GPCM, the item parameters for the missing data category (coded as a nominal category in the model) can directly inform researchers of the relationship between the missing data and the latent trait of interest. Figure 4 presents item characteristic curves (ICCs) from three iterations of our simulation study, including one iteration under each of the MCAR, MAR, and MNAR mechanisms. The gray lines represent the nominal category (i.e., missing data category), showing the probability of having missing data as a function of the underlying trait (i.e., theta on the x-axis). As shown in that figure, the missing data category is not strongly related to the trait under the MCAR and MAR conditions, which aligns with the simulated conditions. In contrast, the ICC in Figure 4 shows that when the data were generated as MNAR due to the missingness being related to the trait of measurement, the category response function in the ICC shows a clear relationship between the missing data and the trait of measurement. In this case, students of a higher trait were more likely to provide missing data on the item. These example ICCs demonstrate that applying the Semi-GPCM for missing data in polytomous scale items can directly inform researchers about this critical relationship between the trait they are measuring and the propensity for respondents to provide missing data.

Item characteristic curves based on item parameters estimated from the Semi-GPCM for three different simulation iterations.
While multidimensional IRT models can also be used to evaluate indirectly the degree to which missing data are related to the trait of interest, we argue that there are more practical challenges to doing so as compared with fitting the Semi-GPCM. Specifically, one must first choose an appropriate multidimensional IRT model out of a family of models, which would likely require model comparisons (Rose et al., 2010). Deciding between the multidimensional IRT models available can be time-consuming and, sometimes, subjective. In our simulations, we were able to match the multidimensional model choice to the data generation choice, but practitioners do not have this luxury. Hence, we view our study results as being the best-case scenario for the multidimensional IRT modeling approach, and had we chosen a more mismatched multidimensional IRT model for Study 2, we hypothesize that the parameter recovery would have been worse for that method. This hypothesis is supported by the fact that even when we used the appropriately specified multidimensional IRT model, we ran into some issues. Specifically, we freely estimated the correlation between two traits when fitting to the MCAR data, but the analysis hit a saddle point, and the estimates for some items were unrealistically large. In that case, we had to fix the correlation to zero for MCAR conditions, but allowed it to be freely estimated for MAR or MNAR. These choices are much easier to make in a simulation where we know when the data are MCAR than in practice. In addition, the sufficient sample size was relatively large for multidimensional IRT. Due to the complexity of the model, the number of parameters estimated using multidimensional IRT is almost twice as much as the number of parameters when fitting the GPCM or Semi-GPCM. Practitioners may not be able to consider MIRT methods for missing item data under particular sample sizes.
Overall, applying the Semi-GPCM to polytomous scale data with missingness can produce unbiased person parameter estimates in some measurement conditions, and the model results can provide useful information for practitioners trying to evaluate the missing data mechanism. However, FIML or multidimensional IRT may be preferred in particular measurement conditions and can provide a similar or better degree of accuracy in person estimates.
All our study results should be interpreted in light of limitations. First, multiple imputation (MI) is a known, viable option for handling missing data both inside (Finch, 2008) and outside (Enders, 2010) of unidimensional latent trait measurement, and it was not considered in this study. Ultimately, MI is a very general term for a process that has many particular decisions to make, and hence there was no single comparison arm that we could justify as representing MI methods in our study. Second, readers should be very attuned to the fact that we generated our MNAR data based on the assumption that the latent trait itself was the unobserved variable that explained missingness patterns. The performance of the Semi-GPCM in recovering trait parameters under MNAR conditions may be quite different if the MNAR mechanism was due to some other unobserved variable (e.g., the motivation of test taker that was not measured by the researcher).
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
