Abstract
Current research on multiple imputation suggests that including auxiliary variables in the imputation model may increase the accuracy and efficiency of coefficient estimation, yet few studies have actually tested this principle for regression analysis. This article uses data from the 2008 General Social Survey to present results from simulations that vary in three respects: (a) three types of auxiliary variables (variables related to the mechanism of missingness, variables related to the variable/varaibles being imputed, and extraneous variables); (b) three levels of missing data (10 percent, 20 percent, and 30 percent missing); and (c) two assumptions of missing (missing completely at random and missing at random). Results show that the inclusion of any type of auxiliary variable does not appreciably impact the coefficient bias or efficiency in this simulation, regardless of the amount of missing data or the assumption of missing. Hence, the inclusion of auxiliary variables may not be necessary in many analytic situations.
Keywords
Statistical methods to deal with missing data have greatly increased in both number and sophistication over the past 20 years. Multiple imputation (MI) is now considered the leading method for dealing with item-missing data that are missing completely at random (MCAR) or missing at random (MAR; Rubin 1987, 1996; Schafer 1997). Because standard statistical software packages now have built-in routines for implementing MI, its usage is making its way into the analyses of ordinary (but statistically savvy) researchers. Research emphasizes the importance of including in the imputation model all variables from the intended analytic model as well as variables correlated with the potential mechanisms of missingness and variables correlated with the variables to be imputed (hereafter referred to as auxiliary variables) as they improve the capability of the model to predict the missing values (Collins, Schafer, and Kam 2001; Schafer 2003).
Few studies have examined in detail the effect of auxiliary variables on coefficient bias and efficiency, and it is unclear to what extent they are being incorporated into sociological analyses. A scan of recent articles using MI in three sociological journals reveals that few researchers mention either what variables were included in their imputation model or how well their imputation model predicted their variables with missing data in their descriptions of MI implementation. To assess the impact of including auxiliary variables, this article presents results from MI simulations that compare the incorporation of different types of auxiliary variables to the accuracy and efficiency of coefficient estimation with varying levels of missing cases using data from the 2008 General Social Survey (GSS).
Overview
Almost invariably, quantitative social science data contains missing values with regard to one or more of the variables, but most standard statistical methods have been designed to analyze complete-case data sets. The most widely used method of handling missing data is listwise deletion; that is deleting cases with missing data. This method is generally easy to implement and satisfactory with small amounts of missing data. Bias and generalizability are key problems when the missing data are not MCAR. The MCAR assumption is met if the probability of missing data on Y is unrelated to the value of Y itself or to the values of any other variable in the data set (Allison 2001). Listwise deletion can reduce statistical power for hypothesis testing due to the reduced sample size but yields acceptable results when the amount of missing data is small and missing data mechanism is MCAR (Little and Rubin 1987; see also Baraldi and Enders 2010).
In imputation-based approaches, missing values are replaced by estimates based on nonmissing values of other variables. There are several strengths of imputation approaches to handling missing data (for a review, see Graham 2009). First, because cases are not deleted, imputation methods preserve information over listwise deletion. The imputed data preserve deviation from the mean as well as the shape of the distribution. Imputation approaches also tend to produce less biased estimates than listwise deletion. Explicit model-based procedures base inference on the likelihood under a defined model for the missing data. Commonly used explicit model-based imputation approaches include regression imputation and MI. While the former method fills in a single value for each missing value, MI replaces each missing value with plausible sets of values across multiple data sets that represent the uncertainty about the missing value. Compared to other explicit model-based imputation methods, MI better represents the uncertainty of missing data by maintaining the natural variability and incorporating the uncertainty introduced by the imputation itself. Studies show that MI strategies produce unbiased parameter estimates and are robust to departures from normality assumptions. Furthermore, even with high rates of missing or low sample size, MI provides adequate results and valid variance estimates using standard complete data procedures (Rubin 1987). In addition to be applicable when data are assumed to be MCAR, MI can also be used when data are assumed to be MAR. MAR is the assumption that the probability of missing data on a particular variable Y can depend on other observed variables in the data set but not on Y itself (Rubin 1996).
Despite some practical complexities and seemingly tedious analysis and pooling phases, MI has gained popularity in sociological research as the methodological literature increasingly provides useful and practical guidance for MI, and implementation has become more convenient (Rubin 1987, 1996; Schafer 1997). The number of publications per year using MI began to increase after the 1987 publication of Little and Rubin’s Statistical Analysis With Missing Data and Rubin’s Multiple Imputation for Nonresponse in Surveys and increased substantially following key software advances (see Graham 2009 for review).
To explore whether sociologists are including auxiliary variables or assessing the predictive power of their imputation models, I conducted a review of all quantitative articles which appeared between 2000 and 2008 in three major sociology journals (American Journal of Sociology [AJS], American Sociological Review [ASR], and Social Forces, in alphabetical order). I found that 13 percent of the total articles explicitly addressing missing data (184 articles) utilized MI to deal with missing data, while 46 percent used listwise deletion. This review revealed that few authors provided details about their implementation of MI, such as viability of assumptions, possible missing data mechanisms, variables included in the imputation model or imputation model fit,1,2 although some authors did compare the estimates and/or descriptive statistics obtained from different missing data approaches (e.g., Hipp et al. 2004; Alon and Tienda 2007; Staff and Kreager 2008). In general, it appears sociologists are not including auxiliary variables in their imputation models or paying much attention to how well the imputation model predicts the variables with missing data. This may be because the research supporting the inclusion of auxiliary variables using real data and/or practical analytic situations is relatively scant and not widely disseminated. Given the paucity of research on this topic in realistic contexts, together with the lack of information on implementation of MI in these sociology articles and the apparent lack of standard for reporting MI procedures, this article aims to test the effects of including auxiliary variables in MI models on coefficient bias and efficiency in subsequent analytic models. This article first demonstrates the effects and then suggests guidelines for reporting MI procedures.
Comparable results have been found in two psychology journals (Roth 1994) for 1989–1991 and in political science journals between the period of 1993 and 1997 (King et al. 2001). More specifically, Roth (1994:548) stated that “almost 42 percent of the articles in the Journal of Applied Psychology sample and 77 percent of the Personnel Psychology analyses involving survey data did not explicitly mention if there was any lack of response to individual items or methods to deal with the issue.” One conservative estimate from Roth’s results is that at least 40 percent of the articles coded required attention to missing data (Fichman and Cummings 2003). In political science journals, King et al. (2001) found that only 19 percent of authors explicitly mentioned how they dealt with missing data. Overall, insufficient information on missing data seems to be prevalent.
Literature Review
A key feature of MI is that the imputation phase is operationally separated from subsequent analyses, which is related to the issue of compatibility between imputation model and analysis model. The impact of model compatibility has been investigated by Fay (1992), Meng (1994), and Rubin (1996). Meng found that if the imputation model included all the variables and information from the intended analytic model, no bias was introduced and nominal confidence interval coverage was as great as actual coverage (Fay 1992). Because the bigger risk is leaving out variables rather than including too many, Rubin (1996) suggests including all possible relevant predicators (as many variables as possible) when implementing MI.
In terms of leaving out variables from the analytic model, if a variable Y is imputed under a model that includes the variable X1 but the analytic model contains X1 and X2, the estimated coefficient for X2 in the analytic model will be biased toward zero, because the imputed Y values have no relationship with X2. As such, interactions, squared terms, and other transformed variables need to be included in the imputation model just like other X variables (King et al 2001; von Hippel 2009). For example, if the missing values of a variable are imputed from a regression model with no interaction, and the analytic model investigates a potential interaction, then the MI estimate of interaction will be biased toward zero. For compatibility between imputation and subsequent analytic models, variables used in the imputation phase should preserve the association among variables that will be used for postimputation analyses. The converse of this rule, that any variables included in the imputation model should also be included in the analytic model, is not necessary (Schafer and Olsen 1998). Therefore, it is recommended and widely accepted to maintain congeniality between imputation model and the analytic model at a minimum, which is referred to as a restrictive model (Collins et al. 2001).
While MI was originally developed for large-scale survey data in which the imputer and the analyst were different individuals, the inclusion of MI capabilities in standard statistical software has made congeniality straightforward and easy to implement. Along with all variables in the analytic model, missing data scholars also suggest that we should include two types of auxiliary variables in imputation models: (a) extra variables that are correlated with the potential mechanisms of missingness and the variable to be imputed and (b) variables that are correlated with the variable to be imputed but not mechanisms of missingness (Collins et al. 2001). Further, imputation models can include (c) other extraneous variables (Collins et al. 2001). Types A and B are recommended since these variables are generally believed to restore some of the missing information and carry information about missing values. Proponents claim that we should include as many of these types of auxiliary variables as possible in the imputation model for possible improvements in efficiency and bias even though they may be unrelated to one’s substantive hypotheses. Including these types of variables is referred to as an inclusive strategy (Collins et al. 2001). Extraneous variables (Type C) are typically included inadvertently, that is, they are assumed to be Type A or B. Hence, they are a potential by-product of being more inclusive (Collins et al. 2001). The inclusion of extraneous variables could decrease efficiency, but few have tested this possibility (see review below).
Given that it is not feasible to include all variables due to the potential multicollinearity problems and computational difficulties, it is critical to systematically examine whether or not auxiliary variables are necessary for MI accuracy (van Buuren, Boshuizen, and Knook 1999). This article tests the hypothesis that the coefficients obtained from imputed data sets based on a restrictive strategy are less accurate and efficient compared to coefficients obtained from imputed data sets based on an inclusive strategy. Choosing auxiliary variables ideally is based on strong substantive knowledge, information about the data collection process, and an idea about potential mechanisms of missingness (Enders 2008).
Empirical or Simulation Studies on Inclusion of Auxiliary Variables
Several studies comparing MI to traditional methods for handling missing data have incorporated auxiliary variables in the imputation procedure (i.e., Horton and Lipsitz 2001; Fichman and Cummings 2003; Peugh and Enders 2004; von Hippel 2007). Additionally, several simulation studies have been conducted as well, facilitating ready comparisons between results with and without auxiliary variables in the imputation phase, but only one study explicitly tested the restrictive versus inclusive strategies in estimating regression parameters in real regression models, despite the fact that prominent researchers recommend this approach (e.g., Meng 1994; Rubin 1996; Allison 2001; Collins et al. 2001).
The conditions under which inclusion of auxiliary variables can be beneficial are outlined by several simulation studies. In a seminal study on this topic, Collins et al. (2001) conducted several simulations with all three types of auxiliary variables along with different levels of missingness (25 percent and 50 percent) in several different contexts, such as linear MAR and convex MAR. Results showed that inclusion of Type A auxiliary variables (variables correlated with the potential mechanisms of missingness) in the linear MAR condition made a difference in the regression coefficient of Y on X only when their association with missingness was high (e.g., .90) and the amount of missing data was large (e.g., >50 percent). In the convex MAR condition, an improvement in bias and efficiency on the regression coefficient of Y on X was seen when the association was high (e.g., .90) and the amount of missing data was 25 percent, and also when the association was moderate or high and the amount of missing data was 50 percent. Results further showed that including Type B variables (variables associated with the variable to be imputed) only improved efficiency when they were highly correlated with the outcome under MAR. Finally, there was trivial cost but no benefit when including Type C variables (extraneous variables). The study only examined the simple regression of Y on X though and did not explore the multivariate imputation context. Additionally, with only 2 levels of missing and 2 levels of correlation, no conclusions could be drawn about the usefulness of auxiliary variables whose correlation with Y falls between .40 and .90.
In another recent simulation study, von Hippel (2007) compared the effect of auxiliary variables between MI then deletion (MID) and MI. MID is an extension to traditional MI in which cases with missing on the dependent variable from the intended analytic model are included in the imputation model but then dropped in the analytic phase. His main interest was in exploring whether MID performed better than MI, but in doing so, he also explored the potential benefits of auxiliary variables. Comparing the nominal confidence interval coverage between the simulations with no auxiliary variables and the simulations with auxiliary variables seemed to indicate a very slight improvement in coverage with MI when including auxiliary variables at 50 percent missing when the correlation between the auxiliary variable and the intended dependent variable was .90 with 10 imputations (CI coverage: 94.7 vs. 95.8), but, again, the study was not designed to test this difference.
Beyond those two studies using MI, several studies have examined auxiliary variables in the context of maximum likelihood in the structural equation modeling framework. For example, Graham (2003) showed that modeling auxiliary variables with a saturated correlates model and an extra dependent variable (DV) model were equally effective in reducing parameter bias. Considering incomplete auxiliary variables in practice, Enders (2008) examined the impact of including an incomplete auxiliary variable in latent variable regression models. In this study, an auxiliary variable determined missingness and had a strong correlation with the incomplete outcome variables. Exclusion of this auxiliary variable from the model produced biased estimates. With 50 percent missing, the auxiliary variable substantially reduced bias in the model, but biased values were observed when one of the analyses variables and the auxiliary variable had approximately 10 percent concurrently missing, indicating that when the same cases are missing on both the auxiliary variable and variables in the analytic model, the inclusion of auxiliary variables may be of little benefit.
Another recent study on auxiliary variables in confirmatory factor analysis (CFA) by Yoo (2009) examined the effects of including auxiliary variables in MI with Type A and Type B auxiliary variables that were highly correlated with the research variable (.48–.72). Pertinent to the present study, Yoo (2009) included these variables under MCAR and MAR conditions as well as linear and convex types of missingness and concluded that the inclusion of auxiliary variables improved estimates when data were MAR convex but offered little improvement for MCAR or MAR linear.
While most of the studies reviewed above include auxiliary variables with high correlations, several studies also include variables with lower correlations. For example, Baraldi and Enders (2010) incorporated auxiliary variables in a maximum likelihood analysis with the longitudinal study of American youth data. They included three auxiliary variables; one with a very high correlation at .86 and the other two with modest-to-moderate correlations at .19 and .42. A comparison of restrictive and inclusive strategies showed that the inclusive strategy improved efficiency by decreasing the standard errors (SEs) for every parameter in their analytic model. The authors did not draw any definitive conclusions about auxiliary variables and bias though, because, in their study, any differences noted between coefficients estimated using MI with and without auxiliary variables “. . . could result from a reduction in bias or a reduction in random error” (Baraldi and Enders 2010:29). Finally, a simulation study by Enders and Peugh (2004) found no improvement when using six Type B auxiliary variables in a CFA model under MCAR and low correlation (.10–.30).
In summary, the studies reviewed above varied on whether they found the inclusive strategy beneficial and the differences appear to be based on the magnitude of the correlation, the proportion of missingness, missingness pattern/type of missingness, the nature of auxiliary variables, and more importantly, combinations of these conditions. Few of these studies have done simulations with realistic analytic situations common to social scientists. For example, while instructive theoretically, the finding that Type A auxiliary variables are beneficial when the proportion of missing is above 50 percent, the correlation with the imputed variable is above .9, and the type of missing is MAR linear is not entirely relevant when common-use guidelines recommend using MI only when the proportion of missing data is less than 35 percent, variables with such high correlations are frequently unavailable, and researchers may not know whether their missing is linear or convex. Also, most of these studies use a single-variable imputation rather than a multivariate empirical model that is commonly used in social scientific research. Even the seminal Collins et al. (2001) study mainly relied on a univariate imputation model and a simple regression analytic model rather than a multivariate imputation model and a multiple regression analytic model. Baraldi and Enders (2010) utilized a real data set and regression analysis in the ML context, but it was a comparative study between imputation models with and without auxiliary variables rather than a simulation study. Therefore, their study could not address the performance of auxiliary variables on the accuracy of estimate but focused on efficiency.
Overall, strong evidence on usage of auxiliary variables in the MI framework is still unclear, particularly in realistic conditions. This may in turn explain why few researchers employed auxiliary variables in practice when implementing MI. To fill this gap, this article explicitly compares restrictive and inclusive strategies in MI with both a univariate and a multivariate imputation and a multiple regression analysis closer to realistic research contexts and covers wide range of scenarios.
Note that this article does not intend to contradict extant theoretical explanations and rationales for inclusion of auxiliary variables in the imputation model. The starting point of this article is to pose a question of how practical it is and how necessary it is. Specifically, real data sets contain few potential auxiliary variables that are highly correlated with variables to be imputed and mechanisms of missingness, which has major implications for real analyses. Furthermore, even if one finds useful and strong auxiliary variables, the same cases that were missing on the variable to be imputed might also be missing on the auxiliary variables as well, offering little improvement in the imputation (see von Hippel 2007). Enders (2008) clearly noted that serious bias was found when both the auxiliary variable and the imputed variable had 10 percent missing cases, for example. In practice, the restrictive model is most convenient for researchers, and so, as a practical matter, this article therefore explicitly investigates the effectiveness of auxiliary variables in comparing restrictive versus inclusive strategies in real analytic situations.
Study Approach
As to why auxiliary variables might improve imputations, we seek imputation models that are statistically valid. Statistical validity implies approximately unbiased point estimates as well as confidence intervals, achieving their nominal coverages when averaged over the randomization distributions generated by the known sampling mechanism and the posited missing data mechanisms (Rubin 1996; Schafer 1997).
Assume
To achieve inferences that are statistically valid from multiply imputed data (1) the MIs from the imputation model must be proper and (2) the completed case inference based on (
To examine whether the inclusion of auxiliary variables in MI improves the accuracy and efficiency of parameter estimation in analytic models, these analyses are based on three levels of missing data from variables in a commonly used public use data set, first under the assumption of MCAR and then under the assumption of MAR. The models include different types and combinations of auxiliary variables in the univariate context and then expand to the multivariate context. Conducting the univariate analyses will allow comparisons between these results and previous results, while the multivariate analyses will be more applicable to realistic analytic situations.
Method
Data
To explore the implications of auxiliary variables for valid statistical inferences on multiply imputed data, this article attempts to mimic a realistic sociological data analysis, based on a review of how imputation is used by sociologists in AJS, ASR, and Social Forces. First, these analyses utilize the most recent cross-sectional wave of a popular public use data set, the GSS (2008, N = 2,023). The GSS is one of the most frequently used data sets in the three leading sociology journals. 4 The GSS employs a multistage area probability sample at the block level with quota sampling based on several block-level factors. More information about the design and features of the GSS can be found at the website of the National Opinion Research Center (NORC). After selecting a data set, I chose a commonly used variable that had little actual missing data, years of education.
Simulation Procedure
To simulate MCAR, I first dropped the negligible number of actual missing cases not only on the education but also on covariates included in imputation models (n = 35) and then randomly selected 10 percent, 20 percent, and 30 percent of cases and set those values equal to missing. Hence, there were four education variables—a “completed case” variable (n = 1,810), one with 10 percent of cases MCAR (n = 1,629), one with 20 percent of cases MCAR (n = 1,448), and one with 30 percent of cases MCAR (n = 1,267). To simulate MAR, this procedure was repeated with the probability of missingness dependent upon another X variable.
Next, I developed several imputation models, starting with the variables from the restrictive model (e.g., variables from the analytic model). For the analytic model, I chose both the dependent variable and the independent variable, with a focus on trying to mimic a typical sociological model in which years of education could be expected to be a significant predictor but also trying to include only variables that had very little actual missing since all real missing cases were dropped. The DV was a financial satisfaction variable originally coded on a 3-point scale in the GSS but was dichotomized into “satisfied” versus the other two categories. The choice of a categorical outcome was dictated by not being able to find a continuous outcome that met the criteria (e.g., little actual missing data and associated with years of education). The independent variables from the analytic model included income, sex, age, race/ethnicity (black and other race/ethnicity compared to white), number of children, and marital status (married and never married compared to divorced/widowed). The restrictive imputation model had an R2 = .14.
After running the restrictive imputation model, for the MCAR data, the analyses focused on the inclusion of correlates of the variable to be imputed (Type B auxiliary variables). In this case, an index of socioeconomic status (SEI), number of siblings, working full time, and perceived social class were significantly associated with years of education and adding them to the imputation model increased the R2 to .45. Their correlations with education ranged from r = .12 to r = .60, hence these became the weak and moderate correlation Type B variables. Because there were no variables in the data set correlated with education higher than .60, and so I generated variables with r =. 65 and r = .85 in order to examine a fuller range of situations. Hence these two variables became the high-correlation Type B variables. Adding those generated variables to the imputation model increased the R2 to .62. Those correlated with the mechanisms of missingness and the variable to be imputed (Type A auxiliary variables) do not apply in the MCAR situation and consequently were not included in the first set of models.
An additional set of models included both correlates of the variable to be imputed (Type B) and extraneous variables not associated with the variable being imputed (Type C auxiliary variables). Extraneous variables (Type C) must not be associated with years of education. This model included two variables that met this criterion: an indicator of which form the respondent filled out and the number of persons in the respondent’s household. The purpose of including extraneous variables (Type C) is not to see whether they improve the imputation model but rather to see whether they decrease efficiency. The partial correlations of all auxiliary variables with education, controlling for the variables in the analytic model, were incremental at .05, .26, .30, .55, .62, and .83. The change in R2 further showed that each auxiliary variable added explanatory power to the imputation model.
After completing the above steps for the univariate context, the procedure was repeated in the multivariate context. I set equal to missing 10 percent, 20 percent and 30 percent of 2 additional variables, income and number of children, to test the more realistic situation of imputing several variables with missing data at once. The analyses used the same auxiliary variables as in the MCAR univariate model, the variables chosen for their association with education, and therefore I expected the imputations of income and number of children to be impacted less by the auxiliary variables. In general, the auxiliary variables had higher partial correlations with income than number of children, thus I expected the imputation of education to be impacted the most, followed by income and number of children.
The MAR models included one set with no auxiliary variables, one set with mechanism of missingness (Type A) only, and one set with both mechanism of missingness (Type A) and correlates of the variable to be imputed (Type B). A true Type A variables would be associated with education but not with financial satisfaction so that it could be included in the imputation model as auxiliary but not in the analytic model. Therefore, missingness on education was associated with no religious preference, such that respondents who reported no religious preference were significantly less likely to be missing on education compared to those who indicated a particular religion, as no religious preference was associated with education but not financial satisfaction.
For the MAR models, the first model included the dependent variable from the intended analytic model, as well as the other right-hand side variables, income, age, sex, black, other race/ethnicity, number of children, married, and never married. For mechanism of missingness (Type A) only, the model included all of the above plus the dummy variable for no religious preference. For mechanism of missingness (Type A) and correlates of the variable to be imputed (Type B), the model included the same variables from the mechanism of missingness (Type A) only model, plus an SEI, number of siblings, working full time, and social class. The last model included a correlation of the cause of missingness (e.g., a correlate of no religious preference), frequency of religious service attendance, to mimic the situation in which the true mechanism of missingness may not be available in the data set or may be unknown.
In total, the MCAR models included 4 univariate imputation models (restrictive, weak + moderate Type B, high Type B, and Types B and C) on the education variables with 10 percent, 20 percent, and 30 percent missing for a total of 12 models and then the same 4 multivariate normal imputation models on data in which education, income, and number of children all had 10 percent, 20 percent, and 30 percent missing for a total of 12 additional models. The MAR models included 6 imputation models (restrictive, Type A only, weak + moderate Type B, high Type B, Types A and B, and Type A correlate) on education with 10 percent, 20 percent, and 30 percent missing for a total of 18 more models.
The MI itself used—mi impute—in Stata 11, which imputes based on simulating from a Bayesian posterior predictive distribution for the missing data using a noninterative technique (StataCorp, LP 2009). This procedure replaces the missing values on years of education with M sets of plausible values based on the specified imputation model (Rubin 1987; Schafer 1997). The analyses were performed with—mi estimate—to obtain a set of complete-case estimates (
The MCAR multivariate models used—mi impute mvn—which implements an iterative Morkov Chain Monte Carlo (MCMC) method (data augmentation; Schafer 1997). As noted above, the procedure first runs a univariate imputation of missing values on education based on a multiple regression of education on the other variables specified with random draws from the conditional distribution of the missing observations. All the specified variables are used for the imputation of all the other variables, thus preserving missing values and correlations between the imputed variables.
As for how many imputations (M) are necessary to obtain valid inferences, the theory underlying MI is based on an infinite number of imputations. In practice, M = 5 appears to be the standard used, yet the actual number needed depends on both the amount of missing data and the analytic model itself (StataCorp, LP 2009). Rubin (1987) suggests that the relative efficiency of MI with finite M is approximately 90 percent compared to infinite M, but Graham et al. (2007) recommend that at least 20 imputations should be the minimum. A recent study has demonstrated that the actual number needed to achieve optimal precision is a multiple of the fraction of missing information (Bodner 2008). Because the computational burden of additional imputations is small and the fraction of missing information varied from model to model, these imputations use a conservative M = 50.
After imputing the variables as described above, for MCAR in the univariate context, I entered each imputed year of education variable (12 variables total: 3 levels of missingness times 4 different imputation models) into the analytic model, to compare the coefficients of the different multiple imputed education variables with varying amounts of missingness and varying imputation models to a model that included education with no missing data. This procedure was repeated for the multivariate context and for the MAR context.
Although the missing data were randomly selected for each scenario (10 percent, 20 percent, and 30 percent), it is likely that a single example may not adequately represent the distribution of possible outcomes. Therefore, I simulated the above procedure 100 times in each of the 12 imputations (3 levels of missingness times 4 levels of auxiliary variables) for the MCAR univariate and MCAR multivariate contexts, and in each of the 18 imputations (3 levels of missingness times 6 auxiliary variables situations) in the MAR context, changing the random number seed each time to change which cases were set to missing. Simulating the procedure 100 times ensures that the results are not based on random chance. 5 The tables present the mean coefficients from the logistic models summarized across the 100 simulations.
Evaluation Criteria
The relative performance was evaluated using bias, standardized bias, and root mean square error (RMSE) as well as the minimum and maximum coefficients from the 100 simulations. The bias indicates the percentage the pooled mean coefficient is discrepant from the true coefficient [bias= (E(
Results
Table 1 presents pooled coefficients and SEs from analytic models which include the imputed education variable from the first set of MCAR univariate simulations. The true coefficient for education in the model with no missing data is .125, while the true SE is .020. Compared to this, none of the coefficients from the MCAR univariate models show appreciable bias. Bias is less than 2 percent in all models. In general, there is slight improvement when adding the auxiliary variables, but there is no consistent pattern of improvement by type of auxiliary variable. For example, the inclusion of low/moderate correlation auxiliary variables in the 20 percent and 30 percent models actually increases standardized bias very slightly, but in the 10 percent model the same variables decrease the standardized bias. All levels of bias and standardized bias are extremely low and within the acceptable range, with the lowest being .03 percent bias and .19 percent standardized bias in the high correlation, 20 percent missing model and the highest being 1.95 percent bias and 10.69 percent standardized bias in the low/moderate correlation, 30 percent missing model.
Mean Coefficients, Standard Errors, and Evaluation Measures From 100 Simulations of Univariate Multiple Imputation Under Missing Completely at Random Condition
Notes. True value. Educ = .1256; SE = .0200. No Aux. (restrictive model): not including any auxiliary variables in the imputation model. R2: full time (.2036), sibs (.2009), class (.2253), SEI (.4281), generated 1 (.4789); generated 2 (.7404). Low/moderate correlation (Type B): correlated with educ, r = .12 to .60. High correlation (Type B): correlated with educ, r = .65 and r = .85.
In terms of efficiency, the RMSE with no auxiliary variables was equal to .021 in the 10 percent missing models, .023 in the 20 percent missing models and .024 in the 30 percent missing models. The RMSE increased as the percentage missing increased and it decreased mostly with high-correlation auxiliary variables within each level of missing (i.e., MSE ratio = 1.255 in high correlation, 30 percent missing model). It was the same to two decimal places across all MCAR univariate models; hence auxiliary variables do not appear to have substantially impacted efficiency. There was substantial variability in the minimum and maximum coefficient values across simulations, with the range at 30 percent missing and no auxiliary variables being the highest at a minimum b = .091 and maximum b = .168 (Table 2).
Mean Coefficients, Standard Errors, and Evaluation Measures From 100 Simulations of Multivariate Normal Multiple Imputation Under Missing Completely at Random Condition
Notes. True values: edu (b = .1256, SE = .0200); income(b = .1021, SE = .0340); children (b = −.0713, SE = .0400). No Aux. (restrictive model): not including any auxiliary variables in the imputation model. R2: full time (.2036), sibs (.2009), class (.2253), SEI (.4281), generated 1 (.4789), generated 2 (.7404). Low/moderate correlations (Type B): correlated with educ, r = .12 to .60. High correlation (Type B): correlated with educ, r = .65 and r = .85.
The MCAR multivariate models imputed 3 different variables at the 10 percent, 20 percent, and 30 percent levels: education, income, and number of children. The true values were education, b = .125 and SE = .020; income, b = .102 and SE = .034; and number of children, b = −.071 and SE = .040. For education, the findings were similar to those in univariate MCAR models; specifically, none of the coefficients from the MCAR multivariate models demonstrated appreciable bias. Bias was less than 2 percent in all models and less than 1 percent in every model except the 20 percent, no auxiliary variables model. Looking across and within levels of missing, there was no consistent pattern of standardized bias for education. Within levels of missing, bias and standardized bias were reduced in each successive model at the 10 percent level, but all models with auxiliary variables had higher bias and standardized bias at the 30 percent level compared to the model with no auxiliary variables. Across levels of missing, the largest standardized bias among imputation models was found in the no auxiliary variable models with 10 percent and 20 percent missing, but the lowest was in the no auxiliary variable with 30 percent missing model (standardized bias = 0.33 percent).
Contrary to expectations, income had the highest levels of bias and standardized bias across models, but again all values were still below acceptable limits. The values at the 10 percent missing level were inconsistent, in that the coefficient for income in the model with highly correlated auxiliary variables had more bias and standardized bias than the coefficient from the model with no auxiliary variables. At the 20 percent and 30 percent levels, there was no clear pattern of results across models except that both bias and standardized bias were higher than the 10 percent models, with a maximum standardized bias of 29.92 percent for income in the no auxiliary variable, 30 percent model and closely followed by 28.45 percent in the high-correlation model and 28.43 percent in the 30 percent missing model with low-/medium-correlation auxiliary variables and extraneous variables (Types B + C). Bias and standardized bias values for number of children were lower than the corresponding values for income and similar in magnitude to values for education. These values also had no clear pattern. All values were substantially below cut point values.
As for efficiency, the RMSE for education closely resembled the RMSE from the MCAR univariate models at .021, .023, and .024 for the 10 percent, 20 percent, and 30 percent no auxiliary variable models, respectively. In the 30 percent missing model, there was a noticeable reduction in MSE for education in the model with high-correlation auxiliary variables (MSE ratio = 1.294), while there was much less improvement with other conditions and variables. The RMSE values for income and number of children were higher at around .04, which largely reflected the higher SEs of those coefficients. Note that in cases where coefficient estimate is unbiased, RMSE is the same as the SE because RMSE measures the average error of an estimator. All efficiency values were well within the acceptable limits for all MCAR multivariate models.
In the MAR univariate models, there were some slight decreases in bias between models with and without auxiliary variables within each level of missing, but the results were again inconsistent. At the 10 percent and 30 percent levels, the models with no auxiliary variables had lower bias and standardized bias than several of the auxiliary variables models. For example, at the 30 percent missing level, the analytic model that included education imputed without any auxiliary variables had bias = 1.12 percent and standardized bias = 5.84 percent, but the model that included the true mechanism of missingness (e.g., no religious preference, Type A) had bias = 1.50 percent and standardized bias = 7.65 percent. The model that included a variable correlated with the true mechanism of missingness had bias = .42 percent and standardized bias = 2.22 percent. Despite the inconsistencies, all bias and standardized bias numbers were extremely low and fell well within acceptable limits. As with the MCAR univariate models, the range of coefficient estimates varied greatly across simulations with the greatest range being in the 30 percent missing models. For example, in the no auxiliary variables model, the minimum b = .1085 while the maximum b = .1566.
In terms of efficiency, the RMSE tended to increase as levels of missingness increased, from .021 in 10 percent missing models to .024 in 30 percent missing models with no auxiliary variables. Including the artificially generated high-correlation auxiliary variables (High Type B) had the strongest impact on RMSE, particularly at the 30 percent missing level (MSE ratio = 1.262). The RMSE and SEs both within and across models were the same to two decimal places, however, at .02. Hence, efficiency did not appear to be impacted by the inclusion or exclusion of auxiliary variables to two decimal places.
Discussion
Of the 13 percent of articles of recent articles from three top sociology journals that used MI, only a few detailed the specifics of their imputation model, addressed the potential mechanisms of missing data, or mentioned what variables were included in their imputation models. While recent studies have concluded that the inclusion of auxiliary variables in MI models is beneficial, the actual findings have been few, inconsistent, and difficult to generalize. In these analyses, I attempted to demonstrate the effect of including various types of auxiliary variables in imputation models on coefficient bias and efficiency in analytic regression models, in the univariate MCAR context, the multivariate MCAR context, and the univariate MAR context with different levels of missing data. I approached this from a practical standpoint, rather than a theoretical standpoint, attempting to mimic a typical sociological analysis in terms of variables, model, and amount of missing data.
In relation to the main question, there was little evidence to necessitate the inclusion of auxiliary variables of any type, in any context, or with any amount of missing data (up to 30 percent). All bias and efficiency values were within acceptable limits and including auxiliary variables did not appear to systematically reduce bias or increase efficiency within each level of missing data. This finding is consistent with the studies that show that MI is robust to model misspecification when the amount of missing data is not large (Schafer 1997; Barzi and Woodward 2004). Additionally, several clinical studies have shown that, given low rates of missingness, results derived from different imputation approaches were similar (van Buuren et al. 1999; Arnold and Kronmal 2003; Barzi and Woodward 2004).
Therefore, in terms of bias and efficiency, I might conclude that auxiliary variables are unnecessary. An alternative argument, however, can be made from the perspective of power. As Baraldi and Enders (2010:30) note, while the reduction in SEs may appear small, one can view the reduction in terms of an increase in power. For example, in the MAR models with 30 percent missing, including both the mechanism of missingness (Type A) and low/moderate correlation (Type B) variables reduced the SE from .0240 to .0226, which, in terms of power, is equivalent to increasing the sample size from 1988 to 2242—quite a substantial gain. The inconsistent results make the power gain tenuous though, in that some of the models would lead to a gain in power and some to a loss in power. For example, the increase in SE to .0246 in the Type A only model would be the equivalent to reducing sample size to 1892. Perhaps the best way to maximize gain and minimize loss is to try different imputation models with different sets of auxiliary variables and see if any result in reduced SEs.
One additional finding worth noting is the high range of coefficient estimates across simulations. With MI, each full data set will have slightly different values for the imputed data and thus each set of parameter estimates will differ slightly. According to these results, a high range of estimated coefficients is seen in the case of large amounts of missing data (30 percent). This pattern carries some implications for MI: in the situation where the proportion of missing data is large, the inferences based on the observed-data likelihood may be unstable; and the inferences based on MI may have large variation (Zhang 2005). A further exploration of the number of imputations required to obtain consistent estimates may be called for.
Based on these findings, this study provides the following suggestions. First and foremost, basing the imputation model on the analytic model appears to be sufficient, with up to 30 percent missing data and likely higher, given the low bias at even the 30 percent level. Second, contrary to expectations, failing to include the variable that was the cause of missingness or associated with the cause of missingness in the MAR models (Type A) had little effect on bias and efficiency. Failing to account for variables associated with the potential mechanisms of missingness may not be as problematic as previously thought, at least when the percentage of missing is 30 percent or less and the pattern of missing data is straightforward. The auxiliary variables that appeared to have the strongest impact were the high correlation Type B variables, but, as noted above, those were artificially generated because no such variables actually existed in the data set. Hence, suggesting that they be included in MI models may be impractical.
In the multivariate MCAR models, the auxiliary variables had roughly the same impact on number of children and on education, even though they were correlated much more highly with education. Along the same lines, the auxiliary variables had much less of an impact on income, despite the partial correlations being higher for income than for number of children. The missing data on all three variables was MCAR, yet the amount of bias, standardized bias, and RMSE differed substantially, with income having the greatest amount. Note that the SE of the true estimate was higher for income compared to education and SEI and that higher SE was reproduced in the simulations. Consequently, the higher SE necessarily resulted in greater values when computing standardized bias and RMSE, although all values were within acceptable limits. Including highly correlated auxiliary variables at the 30 percent level was associated with a reduction in RMSE. Hence, power may be improved with the inclusion of this kind of variable despite the fact that the number was within acceptable limits.
While these findings add to the literature on model specification in MI, there are several weaknesses of this approach that limit generalizability. First, because the missing data were created, it was perfectly MCAR in the one context and dependent only upon 1 X variable in the MAR context. Both the MCAR and MAR assumptions are typically untestable and therefore many researchers make an assumption without having examined whether it is a feasible assumption to make in their data. Even when researchers assume their data are MAR, the data may or may not be. As such, these analyses test the perfect situations of MCAR and MAR, while real analyses likely do not meet either assumption perfectly. Second, the variables included in the imputation here could all be imputed with a linear model. MI routines are available for imputing other types of variables, such as binary and count, and results may be different with other types of variables. Finally, this study capped the percentage of missing data at 30 percent, as 30–35 percent is the typical maximum in sociological analyses. Auxiliary variables may have more of an effect at higher levels of missing data (as found by Collins et al. 2001 and others).
Despite these limitations, these results demonstrate that the current guidelines about including auxiliary variables may be overly conservative for typical sociological applications. These results appear to be only marginally different between models with and without auxiliary variables, which offer some guidance that applied researchers may be able to obtain valid statistical inference by adhering to congeniality without auxiliary variables in the imputation models when using MI to handle missing data. It is difficult to evaluate the validity of previous studies using MI without knowing about the specification of their imputation models and whether they included all of the variables from the analytic models. It should be noted that these results are applicable to the situation in which the imputer and the analyst are the same person, rather than in the case of large, public use data sets in which imputation may carried out without the knowledge of the final analyses. In that situation, this study suggests analyst’s implementation of MI for his or her own analysis may be superior to using already imputed data.
This study can be extended to simulations with different types of missing data. In this article, I examined the effects of including auxiliary variables in two different situations (MCAR and MAR) and three different amounts of missing data in both the univariate and multivariate contexts. In future work, I will consider more complex missing data mechanism and missingness patterns, mixed variable types, different sample sizes, and model misspecifications/specifications (including misspecification of mean structure as well as that of the error distribution).
In conclusion, these findings show that restrictive models themselves may perform adequately under the MAR and MCAR assumptions when the amount of missing data is less than or equal to 30 percent. The coefficients for the imputed variables were significant in all substantive models and the amount of bias never exceeded 12 percent and standardized bias never exceeded 30 percent. Conversely, neither did the presence of auxiliary variables, even Type C variables, appear to be associated with a decrease in efficiency. Hence, no benefit or harm was associated with including extra variables. In terms of making valid inferences the restrictive model appears adequate and the addition of auxiliary variables does no harm. When power is an issue though, researchers may wish to try different sets of auxiliary variables in an attempt to increase efficiency and effectively increase sample size. Finally, I recommend that researchers provide more information in published work about the specifications of their imputation model, in addition to the tenability of the assumptions of MI, just as they would with their analytic models.
Footnotes
Acknowledgements
The author wishes to thank Soyoung Kwon for research assistance, Richard Williams for valuable suggestions and discussion, and the anonymous reviewers for their helpful feedback.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
