Abstract
Researchers often impute continuous variables under an assumption of normality–yet many incomplete variables are skewed. We find that imputing skewed continuous variables under a normal model can lead to bias. The bias is usually mild for popular estimands such as means, standard deviations, and linear regression coefficients, but the bias can be severe for more shape-dependent estimands such as percentiles or the coefficient of skewness. We test several methods for adapting a normal imputation model to accommodate skewness, including methods that transform, truncate, or censor (round) normally imputed values as well as methods that impute values from a quadratic or truncated regression. None of these modifications reliably reduces the biases of the normal model, and some modifications can make the biases much worse. We conclude that, if one has to impute a skewed variable under a normal model, it is usually safest to do so without modifications–unless you are more interested in estimating percentiles and shape than in estimating means, variances, and regressions. In the conclusion, we briefly discuss promising developments in the area of continuous imputation models that do not assume normality.
Keywords
Introduction
Imputation is an increasingly popular method for handling data with missing values. When using imputation, analysts fill in missing values with random draws from an imputation model and then fit the imputed data to an analysis model. In multiple imputation (MI), the process of imputation and analysis is repeated several times and the results of the several analyses are combined (Rubin 1987; Allison 2002; Kenward and Carpenter 2007).
In an ideal world, the imputation model would perfectly represent the distribution of the data. But perfect fidelity can be very difficult to achieve and in practice it is often unnecessary. All that is necessary is that the imputation model preserves those aspects of the distribution that are relevant to the analysis model. For example, if the analyst plans only to estimates means, standard deviations, and the parameters of a linear regression model, then the imputation model needs only to preserve the means, variances, and covariances among the variables that will be analyzed.
When the goals of analysis are limited in this way, as they often are, a crude imputation model can yield usable results. For example, in some settings, it can be acceptable to impute dummy variables, squared terms, and interactions as though they were normal when conditioned on other variables (Horton, Lipsitz, and Parzen 2003; Allison 2005; Bernaards, Belin, and Schafer 2007; von Hippel 2009). The resulting imputed values will look implausible if inspected closely but that often has little effect on the analytic results. In fact, attempts to edit the imputed values to improve their plausibility can introduce bias by changing the variables’ means, variance, and covariances (Horton et al. 2003; Allison 2005; Bernaards et al. 2007; von Hippel 2009).
The point of imputation is not that the imputed values should look like observed values. The point is that the imputed variable should act like the observed variable when used in analysis.
This article considers methods for imputing a different class of nonnormal variables—namely, variables with skew. Like other nonnormal variables, skewed variables are often imputed as though they were conditionally normal. The word conditionally is important here; later, we will encounter situations where a variable is skewed and yet the residuals are approximately normal when the skewed variable is conditioned on other variables. In that situation, a conditionally normal imputation model may be perfectly specified even though the imputed variable is skewed. In many other situations, though, imputing a skewed variable from a normal model entails some degree of misspecification.
In calculations and simulations, we find that conditionally normal imputation of a skewed variable can often produce acceptable estimates if the quantities being estimated are means, variances, and regressions. The estimates do have biases under some circumstances but the biases are typically small. However, the biases grow much larger if we estimate quantities that depend more strongly on distributional shape—quantities such as percentiles or the coefficient of skewness.
To increase the skew of a normally imputed variable, popular references recommend modifications to the normal imputation model. Modifications include rounding (censoring) the imputed values, truncating the imputed values, imputing from a truncated regression model, or transforming the incomplete variable to better approximate normality. We evaluate all these methods as well as a new method that adds quadratic terms to the imputation model.
In univariate data, we find that such modifications can work well if very carefully applied. But in bivariate and trivariate data, we find that even careful modification of the normal imputation model is problematic. At best, modifications reduce the biases only a little; at worst, modifications make the biases much worse. We conclude that, if you have to impute a skewed variable as though it were normal, it is safest to do so without modification.
In the conclusion, we discuss nonnormal imputation models that are currently in development. Early evaluations of these methods look promising and we hope that they will soon become more widely available.
Our presentation proceeds in order of complexity, progressing from univariate to bivariate to multivariate data.
Univariate Data
Imputation is rarely useful in univariate data since, if univariate values are missing at random (MAR), the observed values are a random sample from the population and can be analyzed just as though the data were complete. Yet the simplicity of the univariate setting can help clarify the advantages and disadvantages of different imputation methods.
Suppose we have a simple random sample of n values from a nonnormal variable X with finite mean μ and finite variance σ 2 . A randomly selected nmis of the X values are missing, so that nobs = n–nmis values are observed. We assume that values are MAR (Rubin 1976; Heitjan and Basu 1996), which in the univariate setting means that the nobs observed X values are a random sample from the n sampled cases and therefore a random sample from the population.
How well can normal imputation, with or without modification, impute this nonnormal variable?
Fully Normal (FN) Imputation
The first and simplest technique is fully normal (FN) imputation (Rubin and Schenker 1986), under which a nonnormal variable X is imputed as though it were normal.
Since the observed values are a random sample from the population, we can obtain consistent estimates
The consistency of the observed-value estimators
The PD estimators, like the MVU estimators, are consistent.
1
However, the PD estimators are not particularly efficient, and
Small-sample properties of the MVU and PD estimators are discussed elsewhere (von Hippel 2012). We can avoid some detail by assuming that we have a very large sample where
Figure 1 illustrates the large-sample properties of FN imputation in a setting where the observed data are skewed. Here the observed values come from a standard exponential density

Distribution of a standard exponential observed variable and a normal imputed variable that has the same mean and variance.
Although the observed and imputed variables have the same mean and variance, they do not have the same distributional shape, and that implies that, in a mix of observed and imputed values, some quantities will be estimated with bias. For example, estimates of skewness will be biased toward zero since the imputed values have no skew. And the first sixteen percentiles, at least, will be negatively biased since 16 percent of the imputed distribution is negative while the observed distribution is strictly nonnegative. The lower panel of Figure 1 examines all the percentiles by comparing the cumulative distribution function (CDF) of the observed and imputed variables. Percentiles 24–89.5 are positively biased since the imputed CDF is right of the observed CDF, but all the other percentiles are negatively biased since the imputed CDF is left of the observed CDF. In some distributions, FN imputation of nonnormal variables causes more bias in the extreme percentiles than in percentiles near the median (Demirtas, Freels, and Yucel 2008) but in this skewed example the 50th percentile has substantial bias while the 90th percentile has almost none.
In sum, when nonnormal data are imputed under a normal model, estimates of the mean and variance will be consistent, but there can be considerable bias in estimating the shape and percentiles. Simulations confirm that FN imputation yields consistent estimates for
Imputing Within Bounds: Censoring and Truncation
What can make the observed and imputed distributions more similar? A popular approach is to bound normally imputed values within a plausible range. For example, in imputing a positively skewed variable like body weight, we might require the imputed values to be larger than the smallest observed body weight, or not so small as to be “biologically implausible” (Centers for Disease Control and Prevention 2011). Options for bounding imputed variables are available in most popular imputation software, including IVEware, the MI procedure in SAS 9.2, the mi impute command in Stata 12, and the Missing Values Analysis package in SPSS 16.0.
There are two general approaches to bounding: censoring and truncation. For example, in Figure 1, where we imputed a standard exponential variable as though it were normal, we could have used censoring or truncation to ensure that the normally imputed values were nonnegative. To do this, we would first generate normal imputations
Bounding imputed values changes the mean and variance of the imputed variable. After bounding, the mean of the imputed variable is higher than
Bias can occur if we fail to anticipate the effect of bounding on the mean and variance of the imputed variable. To return to our running example, if the observed variable is standard exponential
These biases can be avoided, or at least reduced, if we anticipate the effect of censoring on the mean and variance of the imputed variable. By inverting equations (2) and (3), we can choose
Similarly, if we want to get a similar mean and variance from a normal variable that is truncated on the left at c = 0, we can get close by choosing
Notice that, although the censored normal imputations have the same mean as the observed variable, the truncated normal imputations do not. In fact, with c = 0, it is impossible to get the mean and variance of a censored normal variable to equal
We have just seen that, even in a simple univariate setting, a truncated normal model can have trouble matching the moments of a variable that is not in fact truncated normal. When the truncated normal model is used for this purpose, the parameter estimates can be sensitive or even infinite. Later, we will encounter similar problems when we use the truncated model in a multivariate setting.
Figure 2 illustrates our attempts to match the distribution of a standard exponential variable with a truncated or censored normal variable. The figure illustrates a naïve approach, where the mean and variance of the imputed variable are unbiased before censoring or truncation, but biased after. The figure also illustrates less-biased approaches, where the effect of truncation or censoring is anticipated before the imputations are drawn. Notice that, if the effects of truncation are anticipated, the truncated normal variable achieves a close match to the exponential distribution (lower left)—notwithstanding some bias in the mean.

Biased and less-biased ways of imputing a censored or truncated normal variable to match the distribution of an observed exponential variable.
Transformation
Instead of bounding imputed X values, some experts recommend transforming an incomplete skewed X variable to better approximate normality (Schafer and Olsen 1998:550; Schafer and Graham 2002:167; Allison 2002:39; Raghunathan et al. 2001:82-83). Following this advice, we would subject the observed values
In the univariate setting, transformation works well if the transformed variable is in fact normal or close to it. For example, if we apply a log transformation to a lognormal variable, the transformed variable is exactly normal.
But transformation can yield substantial bias if the transformed variable is not close to normal. To return to our earlier example, if we sample an exponential variable
The challenge of transformation is that skew-reducing transformations are nonlinear, and nonlinear transformations do complicated things to the mean and variance. Although on the transformed scale the observed variable
The transformation method can yield better results if the transformation is carefully chosen. In imputing an exponential variable
Figure 3 illustrates two transformations that do a better job of normalizing an exponential variable

Normal imputation of an observed exponential variable under square-root and fourth-root transformation.
Both the square-root and fourth-root transformations yield an imputed variable whose mean is unbiased, or nearly unbiased, when compared to the mean of an exponential variable. The variance of the imputed variable is biased, though less biased for the fourth root than for the square root.
More specifically,
3
if we subject an observed standard exponential variable
We get better results with a fourth-root transformation, although there is still a small bias. If the observed variable
In short, in univariate data, the use of normalizing transformation to impute skewed variables can sometimes yield good results if the transformation is carefully chosen. However, we still cannot recommend transformation since it has disadvantages in bivariate settings, which we will discuss next.
Bivariate Data
Bivariate data present new challenges for the imputation of skewed variables. In bivariate data, the imputation model should preserve not just the marginal distribution of the skewed variable; the imputation must also preserve the relationship between the skewed variable and other variables. This can be difficult.
To set the stage, suppose that we have, again, a standard exponential variable

X is standard exponential and Y is conditionally normal. The left panel shows a scatterplot for the regression of Y on X, with an OLS line. The right panel shows a scatterplot for the regression of X on Y, with an OLS line and an OLS quadratic.
In evaluating imputation methods, it will be important to realize that the regression of Y on X satisfies the OLS assumptions, but the regression of X on Y does not. In particular, the regression of X on Y violates the OLS assumption that the conditional expectation of the residuals
In fact, the regression of X on Y is not even linear. When X is nonnormal, a linear regression of Y on X does not necessarily imply a linear regression of X on Y. As Figure 4 show, the fit of the regression of X on Y can be improved, though not perfected, by adding a quadratic term
Although the OLS estimates are not perfect fits for the regression of X on Y, over most of the range of X the linear fit is a decent approximation, and the quadratic fit is very good. We will find that that OLS estimates, though approximate, can be often be serviceable when we impute X conditionally on Y.
Having looked at complete data, we now suppose that some values are MAR. In the bivariate setting, the MAR assumption means the probability that a value is missing depends only on values that are observed. That is, the probability that Y is missing cannot depend on Y, but can depend on X in cases where X is observed. Likewise, the probability that X is missing can depend on observed Y values in cases where Y is observed (Rubin 1976; Heitjan and Basu 1996).
We can impute missing values using several different methods.
Linear Regression Imputation
The simplest parametric imputation model assumes that the missing values fit a linear regression with normal residuals. We call this linear regression imputation.
To understand the technique, it is helpful to start with a situation where the assumptions of the imputation model are met. Imagine that X is complete and Y is MAR. Then the missing values fit the linear regression in equation (4), and the MAR assumption means that the probability Y is missing is independent of
Like the OLS estimates, the PD estimates are consistent,
4
and if Y is regressed on X in the imputed data, the resulting regression estimates are consistent as well, although
In sum, if Y is the incomplete variable, linear regression imputation yields consistent estimates.
If X is the incomplete variable, however, linear regression imputation can be inconsistent. With X incomplete, linear regression imputation assumes that X fits the following equation:
Despite its inconsistency for incomplete X, linear regression imputation is a convenient approximation, and we will find out later that its biases are fairly small.
Quadratic Regression Imputation
Quadratic regression imputation is an attempt to improve on linear regression imputation by adding a squared term
Imputing Within Bounds: Censoring, Truncation, and Truncated Regression
As in the univariate case, in the bivariate case, we can increase the skew of a normally imputed X by keeping the imputed values within bounds.
The simplest way to bound imputed values is censoring. We impute X values under a linear regression imputation model, and then round (censor) out-of-bounds X values up to the boundary value c. In our running example, where we impute an exponential variable as though it were conditionally normal, we would round negative imputed X values up to zero, which is the lower bound of the exponential distribution.
An alternative approach is truncation. Under a simple truncation algorithm, we impute X values under a linear regression imputation model, then reject any out-of-bounds (e.g., negative) X values and reimpute them until an in-bounds (e.g., positive) X value randomly occurs. This is the approach used by the MI procedure in SAS 9.2, which will terminate if the wait for an in-bounds value is too long. A faster approach is to impute from a truncated distribution using sampling importance resampling (Robert 1995; Raghunathan et al. 2001).
As in the univariate setting, in bivariate data censoring and truncation can improve the shape of the imputed distribution, but censoring and truncation also bias the mean and variance of X. And in the bivariate setting, censoring and truncation do not solve the problem that X has been imputed from a misspecified regression model whose parameters are inconsistently estimated. So bounding linear regression imputations has several sources of bias: bias from bounding imputed values, bias from a misspecified regression model, and bias from inconsistent parameter estimates. If we are lucky, these biases will offset each other; if we are unlucky, they will reinforce one another.
A more sophisticated approach is to impute values with a truncated regression model (Goldberger 1981), which is implemented by IVEware’s BOUNDS statement and by Stata’s mi impute truncreg command (Raghunathan, Solenberger, and Van Hoewyk 2002; Stata Corp. 2011). Under truncated regression imputation, we do not just truncate the imputed X values. We also estimate the regression parameters under the assumption that the observed X values come from a conditionally normal distribution that has been truncated in the same way. This is akin to the univariate approach that we discussed earlier, where the effect of truncation was anticipated in estimating the model parameters.
Like the normal regression model, the truncated regression model makes an assumption about the conditional distribution of X, and the assumption is not correct for many skewed variables. In Figure 4, for example, X follows an exponential distribution, not a truncated conditional normal distribution. The truncated regression model is only an approximation, and the approximation is not necessarily better than the approximation offered by a model without truncation. In fact, our discussion of univariate data showed that, when the assumptions of a truncated model are violated, the model parameters can be infinite and estimates can be very sensitive. Later, a simulation will show that similar problems can occur in bivariate data.
Transformation
As in the univariate setting, in the bivariate setting it is commonly recommended that skewed incomplete variables be transformed to approximate normality before imputation (Schafer and Olsen 1998:550; Raghunathan et al. 2001:82-83; Allison 2002:39; Schafer and Graham 2002:167). But transformation of incomplete variables is often unnecessary and can lead to bias.
Again, it is helpful to return to a simple example where X is standard exponential and Y is conditionally normal (see equation (4)). What should we do if Y is MAR? Y has some skew which it inherits from X, so we might be tempted to transform Y to a better approximation of normality. Yet transformation could only hurt the imputations, since the optimal imputation model is a linear regression that imputes Y on its original scale. Linear regression imputation is appropriate for Y because, although the marginal distribution of Y is skewed, the X−Y relationship is linear and the conditional distribution of Y is normal—that is, the residual
So transformation can only hurt the imputation of Y. Transformation can also hurt the imputation of X, because transformation introduces curvature into the relationship between X and Y. In Figure 4, for example, we have an exponentially distributed X, which, as we saw earlier, can be transformed to a very good approximation of normality by a fourth root transformation
As an alternative to transforming X alone, we can transform all the variables in the imputation model. For example, if X is exponential, we can transform X to
Simulation Experiment
In this section, we carry out a simulation experiment to test the bias and efficiency of different methods for imputing a skewed X variable in a bivariate (X, Y) data set. We then extend the simulation to accommodate a third variable Z.
Bivariate Design
In each simulated bivariate data set, Y fits a normal linear regression on X:
The experiment independently manipulates four factors: The first factor is the distribution of X. We let The second factor is the strength of the X–Y relationship as measured by the regression’s coefficient of determination The third factor is the location of the missing X values. From each complete data set, we delete half the X values in three MAR patterns: To let values be missing completely at random (MCAR), we delete X values with a constant probability of ½. To favor deletion of large X values—that is, values in the tail of the X distribution—we delete X values with probability To favor deletion of small X values—that is, values near the peak of the X distribution—we delete X values with probability The fourth factor is the method of imputation. We use the seven different methods described earlier: Linear regression imputation where X is regressed on Y. Linear regression imputation with imputed X values censored below Linear regression imputation with imputed X values truncated below Quadratic regression imputation, where X is regressed on Y and The transform X method: linear regression imputation where The transform all method: linear regression imputation where Truncated regression imputation, with the lower truncation point at
For each imputation method, we imputed each incomplete data set five times. In each of the five imputed data sets, we calculated the mean, standard deviation, and skew of X, and we fit an OLS regression of Y on X. We then averaged estimates from the five imputed data sets to yield a single set of MI estimates for each incomplete data set and imputation method.
Note that we do not vary the fraction of missing X values, since it is clear that changing the fraction of missing values would not change the relative performance of the imputation methods. Deleting a larger or smaller fraction of values would just make the differences between the imputation methods more or less consequential.
All the imputation methods used the MI procedure in SAS 9.2, except for the truncated regression method, which used the mi impute truncreg command in Stata 12. Both SAS and Stata displayed some technical limitations. These limitations are not central to the results, but are worth describing briefly, as follows:
In truncating normal imputations, SAS’s MI procedure uses a simple rejection algorithm,
6
rejecting negative values and reimputing them iteratively until a positive value randomly occurs. The wait for a positive imputation can be long, and in about 8 percent of data sets SAS terminated the attempt after 100 iterations. We reimputed these data sets with a lower truncation point of
In transforming imputed variable, SAS’s MI procedure
7
refuses to inverse-transform any values that could not have been obtained by transformation. In imputing X, for example, we transformed the observed X values to
Stata’s implementation of truncated regression imputation is quite slow, taking about twenty-four hours to impute all the experimental data sets (SAS imputed them in minutes). It is not clear whether this is an inherent problem with truncated regression or a problem in Stata’s implementation of it. In addition to being slow, the truncated regression model failed to converge in about 1 percent of incomplete data sets—typically data sets with low
Illustrative Results
Figure 5 uses all the imputation methods to impute a simulated data set with

Different imputation methods when the incomplete variable X is standard exponential with values missing completely at random, and the complete variable Y is conditionally normal.
Clearly, some methods handle these data better than others. Under linear regression imputation, the imputed values, although more symmetrically distributed than the observed values, nevertheless fit well around the true regression line. Censoring or truncating the linear regression imputations does not change the fit very much; censoring and truncating simply move the leftmost values a little to the right, and this has little influence since the points with the most influence on the regression line are on the far right, in the tail of the X distribution. Quadratic regression imputation produces fairly similar results, although the imputed X values are a little more dispersed than they are under linear regression imputation.
The transform X method and the transform all method have serious trouble with these data, because transformation curves the relationship between X and Y. In the tail, which has the most influence on the regression, all the imputed points lie below the true regression line. These points will negatively bias the estimated slope.
The truncated regression has even worse problems, with nearly all the imputed points in influential positions and below the true regression line. Evidently, the truncated regression model can be a poor fit to observed data that are not actually truncated, and very unrealistic parameter estimates can result. Recall that the truncated model can produce infinite parameter values even in a simple univariate setting.
Figure 6 gives even more serious examples of bad imputations obtained from the truncated regression model. In one example, the imputed X values have a negative relationship to Y, even though the relationship in the observed data is clearly positive. Some imputed X values have values of 200 or greater, even though all the observed X values are less than 11. In one example, nearly all the imputed values are negative, even though the purpose of truncated regression here is to produce positive imputations. The presence of negative imputations, though extremely rare, suggests implementation problems that go beyond the basic difficulty of fitting a truncated model to nontruncated data.

Anomalous imputed data sets generated by truncated regression imputation in Stata 12. Note: Main effect of imputation method, averaged across other factors.
Comprehensive Results
The full bivariate simulation experiment yielded 36,000 sets of MI estimates—100 estimates for each of 360 different experimental conditions (4
Table 1 summarizes the relative bias and relative RMSE of each imputation method, averaged across all the experimental conditions. Results are shown for seven different estimands. The first three estimands describe the marginal distribution of X in terms of the mean, standard deviation, and coefficient of skewness. The remaining four estimands describe the regression of Y on X in terms of the intercept, slope, residual standard deviation, and coefficient of determination.
Results of Two-Variable Simulation: Main effect of imputation method.
Note: RMSE= root mean square error.
For all but one of the estimands, linear regression imputation yields estimates whose average relative bias is fairly small, often close to zero. Censoring or truncating the linear regression imputations yields similar results, with a little more bias for some estimands and a little less for others. 8 All the other methods have more serious biases. The transformation methods have the worst biases for the regression slope and intercept, and the truncated regression method has the worst biases for the mean and standard deviation of X. The biases of the truncated regression method can be enormous, exceeding 400,000 percent for some parameters, because the method occasionally imputes wild outliers like those in Figure 6.
In short, for the estimands that are emphasized in most social research—means, standard deviations, and regression parameters—linear regression imputation, with or without censoring or truncation, can often produce reasonable results. The other imputation methods are no better for these estimands, and sometimes much worse.
This is not to say that linear regression imputation is good for every estimand—it is not. In fact, linear regression imputation does a very poor job of estimating parameters that reflect distributional shape. In estimating the coefficient of skewness, for example, linear regression imputation had an average relative bias of –52 percent—that is, the estimated skew was less than half the true skew, on average. With one exception, the other methods do not estimate skew very well either; they are less biased than linear regression imputation but still have positive or negative biases exceeding 20 percent. The one method that estimates skew with little bias is the transform all method, but we cannot recommend that method because it has serious biases in estimating the regression parameters.
The online Appendix Table A1 (which can be found at http://smr.sagepub.com/supplemental/) summarizes the simulation results in more detail, breaking them down across different levels of the manipulated factors. The most striking result is that the differences of the methods are largest when missing values are in the tail. This makes sense because the tail values have the most influence in estimating the regression line. Another striking result is that the biases of the transformation methods are worst when X is highly skewed—which is exactly when one would be most tempted to use transformation.
The biases of the different methods vary from one simulated condition to another, and there are circumstances where transformation or quadratic regression or truncated regression gives the best results. It is tempting to imagine that you could obtain good results by picking and choosing different methods to suit different circumstances, but that it is a dangerous game with only small benefits for guessing right, and large penalties for guessing wrong.
The safest approach is to use a method that tends to have small biases in a wide variety of settings, and by that criterion the best choice for bivariate data is linear regression imputation, with or without truncation or censoring.
Trivariate Extension
We now extend the simulation to a regression where there are three variables: a complete dependent variable Z and two independent variables, one complete (Y), and one incomplete (X). To do this, we simply keep the X and Y variables from the two-variable simulation, and add a Z variable that fits a linear regression on X and Y, with normal residuals:
The first factor is the distribution of X. Again
The second factor is the squared correlation
The third factor is the pattern of missing X values. These patterns are defined in exactly the same way as in the bivariate setting. In the MCAR scenario, X values are deleted with probability ½; in the MAR tail pattern, X values are deleted with probability
The fourth factor is the imputation method. The methods are the same as in the bivariate simulation, except that X is imputed conditionally on Z as well as Y. In the transform all method, X is transformed to
To avoid further complicating the experiment, we hold constant certain parameters of the regression of Z on X and Y. The constant parameters are the slopes and intercept, which are held constant at
Table 2 summarizes the relative bias and relative RMSE of each imputation method for estimands relating to the marginal distribution of X and the regression of Z on X and Y. Table 2 presents average results; in the online appendix (which can be found at http://smr.sagepub.com/supplemental/) Table A2 breaks the results down by each manipulated factor.
Results of Three-Variable Simulation: Main effect of imputation method.
Note: RMSE= root mean square error.
The basic results of the three-variable simulation are similar to those of the two-variable simulation.
In estimating the regression parameters, and in estimating the mean and standard deviation of X, linear regression imputation gives perhaps the best results overall, with relative biases of 5 percent or less. Quadratic regression imputation gives similar results, as does censoring or truncating the linear regression imputations. The transformation methods give more biased regression parameters, and the truncated regression method occasionally imputes very large outliers and so gives seriously biased estimates of the mean and standard deviation.
In estimating the skew, all the methods give highly biased results except for the transform all method, which cannot be recommended since it gives the most biased estimates for the regression parameters. Again, the differences among the imputation methods are much more consequential if values are imputed in the tail rather than the peak of the X distribution (see Table A 2b, which can be found in the online Appendix at http://smr.sagepub.com/supplemental/).
In the three-variable simulation, the slope of Z on Y has bias, which is opposite in direction to the bias of the slope of Z on X. Bias in the slope of Z on Y occurs despite the fact that neither Z nor Y has any imputed values. The reason for this bias is that Y is correlated with X, so any bias in the slope of X engenders a compensating bias in the slope of Y. Table A2b in the online appendix (which can be found at http://smr.sagepub.com/supplemental/) confirms that the bias in the slope of Y is greatest when the squared correlation
An initially surprising result of the three-variable simulation is that the regression slopes estimated by the transformation methods, despite having the most bias, sometimes have smaller RMSE than the regression slopes estimated by other methods. The presumed reason for this is that the transformation methods, because they use misspecified imputation models, reduce the correlation between Y and the imputed X, and this reduced correlation yields smaller standard errors when X and Y are used to predict Z. Under some circumstances, the reduction in standard error can more than make up for the increase in bias.
Is the potential for reduction in RMSE a reason to use the transformation methods? No. First, reduced RMSE in the regression slope is limited to the simulations where the correlation between X and Y is very, very high. Table A 2b in the online appendix (which can be found at http://smr.sagepub.com/supplemental/) shows reduced RMSE in the simulation where
Applied Examples
Authors often use incomplete skewed variables in applied research. In this section, we discuss two examples and evaluate the decisions that were made by the authors.
Analysis of Female Legislative Candidates
In a cross-national study, Kunovich and Paxton (2005) pointed out a strong bivariate relationship between two percentages that vary across the world’s n = 171 countries: the percentage of parliamentary legislators who are female (Y) and the percentage of parliamentary candidates who were female (X). The regression of Y on X is important because a slope of less than one suggests that female candidates lose more often than they win. Y was complete and skewed to the right, with a coefficient of skewness of 1.4. The skew of X is similar but harder to estimate from the observed data, since X was missing for more than half (ninety-nine) of the countries.
Following my advice, the authors imputed missing X values using the transform X method. It now appears, in light of results in this article, that my advice was misguided and could have caused substantial bias if missing X values had been primarily in the tail. Fortunately, X was missing primarily in the peak, and in this situation—with a skew of 1.4, values missing from the peak, and
Reanalyzing the authors’ data without control variables, I found that the estimated slope of Y on X was .60 with the authors’ transformation, and .55 without transformation. My simulations suggest that the authors’ estimate is closer to the truth, but the difference is not large either way.
Analysis of Body Mass Index (BMI)
In a longitudinal study of n = 358 children, von Hippel, Nahhas, and Czerwinski (2012) estimated percentiles for change in BMI from age 3½ to age 18 years. BMI was measured every six months, but some measurements were missing. Our longitudinal imputation model was more complicated than the models considered in this article, but the model still assumed that the incomplete variables were conditionally normal. Skew in BMI was an important challenge, especially since the skew of the BMI distribution increases as children grow older.
As one of the authors, I knew that linear regression imputation could estimate the conditional mean of a skewed variable with little bias, but this was not reassuring since in this study we sought to estimate the percentiles instead of the mean. With or without transformation, it proved difficult to find an imputation model that gave plausible estimates for extreme tail percentiles such as the 90th or 95th.
In the end, we sidestepped the issue by imputing BMI increments—that is, changes in BMI from one measurement occasion to the next. Although BMI is skewed, the distribution of BMI increments is approximately symmetric and has only slightly more kurtosis than a normal distribution.
Conclusion
On the whole, the simulation results suggest that when an incomplete variable has skew, linear regression often gives reasonable, though not unbiased, estimates for the quantities most commonly estimated in social science—namely, means, standard deviations, and linear regressions (see also Demirtas et al. 2008). Ad hoc modifications of linear regression—through censoring, truncation, or transformation—rarely do much to improve the estimates, and in fact can make the estimates much worse.
Although the normal regression method is fairly good for estimating means, variances, and regressions, it can do a poor job of estimating quantities that depend on distributional shape. Such quantities include the coefficient of skewness, the percentiles, and quantities based on percentiles such as the Gini coefficient.
Although our discussion has been confined to normal imputation, we note that similar biases would be expected if estimates were obtained, without imputation, by applying maximum likelihood to the incomplete data under an assumption of normality. Maximum likelihood can be asymptotically biased when its distributional assumptions are violated (Yuan, Wallentin, & Bentler, in press). For example, in the Bivariate Data section, we discussed a situation where OLS estimates for the regression of X on Y—which are maximum likelihood estimates if X is conditionally normal—are biased for a skewed X.
To improve estimation from incomplete skewed variables, it would be helpful to have imputation methods that do not assume normality. One option is the distribution-free approach of imputing missing values by resampling values from similar cases. Variants of this idea are called hot-deck imputation, the approximate Bayesian bootstrap, or predictive mean matching (for a review, see Andridge and Little 2010). For example, in bivariate data (X, Y) with Y complete and X MAR, one would fill in missing X values with observed X values resampled from cases with similar values of Y.
While imputation by resampling is initially attractive, it can work poorly when observed values are sparse in one part of the distribution. For example, suppose that X is missing if and only if Y < 0. If we wish to impute the missing X values in cases with Y < 0, we have to resample X values from cases with Y ≥ 0—and this inevitably leads to bias. (See Allison 2000 for a similar but slightly more complicated example.) To take a less-artificial example from applied research, in our study of BMI growth among children, the very heaviest children had a number of missing measurements. These measurements could not be imputed by resampling, unless we were willing to resample BMIs from much lighter children.
A perhaps more promising idea is to model and impute nonnormal variables using flexible nonnormal distributions that can take a variety of shapes—such as Tukey’s gh distribution (He and Raghunathan 2006), the Weibull or beta density (Demirtas and Hedeker 2008), the generalized lambda distribution, or Fleishman power polynomials (Demirtas 2010). These approaches are currently in the development stage. Initial evaluations suggest that they can mimic very well the shape of many nonnormal distributions and that they preserve the relationships among variables at least as well as normal imputation methods (He and Raghunathan 2006, 2009, 2012; Bondarenko and Raghunathan 2007; Demirtas 2010; Demirtas and Hedeker 2008). We look forward to these flexible methods being available in software so that they can be used in applied research and evaluated further.
Footnotes
Acknowledgments
I thank Ray Koopman for help with the calculations in the subsection on Imputing Within Bounds: Censoring and Truncation. I also thank Pamela Paxton and Sheri Kunovich for sharing the data in the subsection on Analysis of Female Legislative Candidates.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
