Abstract
Missing observations are pervasive throughout empirical research, especially in the social sciences. Despite multiple approaches to dealing adequately with missing data, many scholars still fail to address this vital issue. In this article, we present a simple-to-use method for generating multiple imputations (MIs) using a Gaussian copula. The Gaussian copula for MI allows scholars to attain estimation results that have good coverage and small bias. The use of copulas to model the dependence among variables will enable researchers to construct valid joint distributions of the data, even without knowledge of the actual underlying marginal distributions. MIs are then generated by drawing observations from the resulting posterior joint distribution and replacing the missing values. Using simulated and observational data from published social science research, we compare imputation via Gaussian copulas with two other widely used imputation methods: multiple imputation via chained equations and Amelia II. Our results suggest that the Gaussian copula approach has a slightly smaller bias, higher coverage rates, and narrower confidence intervals compared to the other methods. This is especially true when the variables with missing data are not normally distributed. These results, combined with theoretical guarantees and ease of use, suggest that the approach examined provides an attractive alternative for applied researchers undertaking MIs.
Missing data problems are ubiquitous in observational data and common among social science applications. Statistical inference that does not adequately account for the missing data is widely known to lead to biased results and inflated (or deflated) variance estimates (King et al. 2001; Molenberghs et al. 2014; Rubin 1976; White and Carlin 2010). Even though most statistical software platforms provide methods that adequately handle missing data (the most popular of these is multiple imputations [MI]), they are often ignored by applied researchers. 1
In Figure 1, we illustrate the number of articles published in five top Sociology and Political Science journals since 1990 that contain “multiple imputations” in the body of the article. 2 Our survey of the literature shows the rapid growth of the use of MIs in the social sciences. Nevertheless, as missing data is a feature of almost any observational data set, the annual counts of articles mentioning MIs per year still point to significant underutilization of this method in the social sciences. This may be due to a lack of understanding of the benefits (and assumptions) of common imputation methods.

Number of references to “multiple imputation” in articles from five top Sociology and Political Science journals since 1990.
This article has two aims. First, we introduce applied researchers in the social sciences to a specific copula method for imputation and discuss its advantages over other methods. The method discussed is easy to implement using the sbgcop package (Holf 2010) in
Copulas are often used for the estimation of dependency between variables and are particularly useful in the generation of imputations as they allow for the construction of valid joint distributions of the data, even if the researcher has little knowledge about the actual joint distribution. Given the joint distribution of the data, we can generate imputations by sampling from the conditional distribution of the missing data given the observed data. We highlight a semiparametric Gaussian copula approach to missing data imputation. The Gaussian copula is one particular way of constructing a joint distribution from which missing values can be easily drawn. The method was initially developed by Hoff (2007) to estimate empirical models on multivariate data.
In particular, the Gaussian copula defines the dependence among the distributions of a set of variables which may contain missing values. These variables can include normal, ordinal, and binary variables. Rather than using the distributions themselves, a rank likelihood approximation is used. As a result, the technique does not require the specification of marginal or conditional distributions. This is in stark contrast to other imputation methods using copulas that either require knowledge of the marginals or correlation structure (Käärik 2006, Käärik and Käärik 2009; Robbins, Ghosh, and Habiger 2013) or target different copula parameters via pseudolikelihood methods (Di Lascio, Giannerini, and Reale 2015). The proposed approach allows applied researchers to undertake imputations of their data without relying on prespecification or ad hoc decisions.
The potential use of copulas for MI applications has not been thoroughly discussed within the social sciences. The copula methods we describe are easy to use and are more likely to provide a good representation of the joint distribution of the data than existing methods. Moreover, provided the Markov chain Monte Carlo (MCMC) converges, the output from the copula model represents a valid posterior density. Simply put, this means that we have theoretical guarantees about the posterior distribution from which the imputations are generated that other methods can not provide. Based on an extensive simulation exercise, we show that the method presented here is generally at least as accurate as other commonly used methods—it is often better. It also provides better uncertainty estimates for the imputations. Lastly, as is shown in Bojinov, Pillai, and Rubin (2017), the copula method can also be used to test some of the underlying assumptions about the appropriateness of imputations for a given data set.
Common Approaches to MI
The standard techniques employed to deal with missing data require an assumption regarding the missing data pattern; these were first formalized in Rubin (1976). 4 To briefly summarize these terms, missing data are missing completely at random (MCAR) when the probability of the observed missing data pattern is unchanged regardless of what values both the observed and missing data take (Marini, Olsen, and Rubin 1980). The missing data are missing at random (MAR) when the probability of observing the missing data pattern is unchanged no matter what values the missing data take. Finally, the missing data are missing not at random (MNAR) when the probability of observing the missing data pattern changes for some values of the missing data.
These definitions are important both from a theoretical and a practical point of view. The most basic methods, such as listwise deletion, generally lead to biased regression coefficients if the missingness process is not MCAR (Graham 2009). To achieve valid inference under the Bayesian and likelihood paradigms, while ignoring the missing data mechanism, we require the weaker MAR assumption. 5
The most common appropriate approach to dealing with missing data is MI, which refers to any method that replaces the set of missing values with various plausible values, thus obtaining
MI with expectation maximization [EM]: This approach uses iterative EM to create complete data sets based on assuming a particular joint distribution. A widely used method for imputation in the social sciences is the Amelia II
Conditional approaches to MI: An alternative method is to model each variable’s imputation via its conditional distribution based on all other variables in the data. One such approach is developed in MICE (Van Buuren 2012), another was developed as the MI package in
One of the main drawbacks of the FCS is that only under certain conditions, do the individual conditional models define a valid joint distribution. This often leads to pathologies in the convergence of the algorithms (Li, Yu, and Rubin 2012; Chen and Ip 2015). For example, if Y | X is specified to be an exponential random variable with rate
One of the advantages of conditional model specification is that it allows each variable to be modeled based on its specific distribution, which is specified by the researcher. However, this also means the imputation model for each variable in the data has to be correctly specified, which can be “labor-intensive and challenging with even a moderate number of variables” (Murray 2013:41). Moreover, coefficients estimates in the conditional models can suffer significantly when the number of missing observations is large, especially for categorical variables (Murray 2013).
A Copula Approach to Missing Data Imputation
One of the key issues with conditional approaches to imputation, such as MICE, is that they do not necessarily specify a valid joint distribution (such as the example in the previous section). 7 When a valid joint distribution does not exist, then there are no guarantees that the MI procedure is proper (as defined in Rubin (2004)). A natural approach to overcoming a possibly incompatible conditional specification is by specifying the joint distribution directly. For example, this is done in most EM approaches, such as Amelia II, by simply assuming a multivariate normal distribution. However, while an approximation, most social science data include binary and ordinal variables and thus cannot have a multivariate normal joint distribution. As a result, this misspecification of the joint distribution is problematic. Moreover, specifying the correct joint distribution becomes increasingly complicated as the number of covariates in the model increase.
It is therefore valuable to decouple the specification of the marginal distribution of each covariate from the function that describes the joint behavior of all covariates together. One of the main advantages of using copulas for imputation is that they allow us to do exactly that. Sklar’s (1959) theorem guarantees that every joint distribution can be decomposed in this way:
Sklar’s theorem guarantees that the function
Much work has been done studying the class of Gaussian copulas where the multivariate dependence is defined by
As previously noted, the specification of marginal distributions is difficult in applied settings and so of particular interest is the setting where the researcher does not need to specify the marginal distributions for
In this flexible setting, the estimation procedures described below provide consistent and likely asymptotically efficient estimates of the dependence parameters in the Gaussian copula, that is,
The rank likelihood (Pettitt 1982) is a type of marginal likelihood that bases inference on the ranks of data rather than the full data. In a univariate setting, it is defined as follows: consider
where
Hoff (2007) extends the rank likelhood to the multivariate setting by considering the semiparametric Gaussian copula. Let
It is easy to see that all
The aforementioned results guarantee that inference about
This means that regardless of the marginal distributions of the individual variables, all we need is their ordering to facilitate the use of the Gaussian copula model to make inferences about the dependence between these variables, that is, the correlation matrix
Let us summarize and paraphrase the method in less technical terms. Assume we have two vectors
When values of
Comparing Amelia II, sbgcop, and MICE
In this section, we compare the working properties of the copula-based imputation with those of Amelia II and MICE packages. We evaluate each method based on an extensive simulation study as well as an empirical example from the social sciences, discussed in the next section.
Evaluating Imputations
MI procedures are specifically designed to yield valid statistical inference (meaning, asymptotically unbiased with correct standard errors and coverage) for population quantities of interest. Since correct estimation of the coefficients and standard errors is critical for obtaining valid statistical inference, any analysis of MI procedures must focus on studying its frequentist properties. Properties such as empirical coverage, average bias, and average interval length of the estimate of the scientific estimand over repeat samples will be of cardinal interest.
We therefore use the following approach to assess the validity of a MI procedure through simulation: Define a full data quantity of interest, Generate a complete data set and apply a prespecified missing data mechanism to remove some observations. Use the MI procedure to create Use each of the Report the bias of
We repeat steps 2–5
Simulation Study
In regression settings, an outcome
In this situation, complete case analysis (or listwise deletion) provides an unbiased estimate of the regression coefficients; however, the reduced sample size often leads to losses in efficiency, through higher standard errors. Another disadvantage of using complete case analysis whenever the number of explanatory variables
For our simulation study, we set
The distributions we consider for the elements of the design matrix are Gaussian, Bernoulli, Poisson, and ordinal. To make imputation feasible, we require the variables to be correlated. To generate correlated variables, we first construct a matrix of correlated Gaussian random variables and then transform the variables to have the appropriate marginals. For example, to generate a pair of correlated Poisson random variables
where
Simulation Study Configurations.
We consider two missing data mechanisms for
Results
We performed 1,000 simulations under each of the possible combinations of the correlation and missingness coefficient, as detailed in Table 1, under both MAR and MNAR missing data mechanisms. For MICE, we specified the correct marginal distributions (e.g., ordered logit model for the ordinal variables). For Amelia II, we used the appropriate variable transformation in accordance with the package help files. For the copula approach, we did not need to specify any distributions/transformations. Using each of the three procedures, we created 20 completed data sets that were used to estimate the regression coefficients and a corresponding 95 percent CI. 10 None of the simulations had enough complete cases to estimate the regression coefficients using listwise deletion.
The most significant source of variation in the simulation was due to the different classes of variables, followed by the correlation and the missingness coefficient. There is only a small difference in the results obtained from the MAR and MNAR data; therefore, our discussion will focus on the former, with the figures for the latter included in Online Appendix C. Figures 2 and 3 illustrate how the bias, coverage, and interval length vary across the interaction of the different variable classes, the correlation, and the missingness coefficient, respectively. Overall, the copula method achieved an empirical coverage rate of 93 percent which was much higher than that of MICE, 87 percent, and Amelia II, 83 percent. Less adversarial regimes were previously studied in White and Carlin (2010), by reducing the number of covariates in our simulation, we can recover similar coverage rates for the MI procedures as are reported there. Both the copula and MICE methods had an absolute average bias of .17. Amelia II performed worse and had a bias of .25. On average, all three methods had approximately the same interval length.

Simulation study results for the missing at random data as a function of the missingness coefficient, averaging over the correlation. The plot is split by the different variable types (normal, binomial, Poisson, and ordinal) and the three outcomes of interest (bias, coverage, and interval length). The rightmost panel shows the result averaging over the different variable types.

Simulation study results for the missing at random data as a function of the correlation, averaging over the missingness coefficient. The plot is split by the different variable types (normal, binomial, Poisson, and ordinal) and the three outcomes of interest (bias, coverage, and interval length). The rightmost panel shows the result averaging over the different variable types.
The copula imputations were obtained using 10,000 iterations from Hoff’s (2010) package whose convergence was checked on a subset of simulations. The lag-10 autocorrelation for the thinned chains was less than .18 in absolute value for each of the elements of the latent correlation matrix, and the effective sample size was always above 200 (97.6 percent of the entries were above 500). Since the copula method is sampling from the posterior distribution which requires the MCMC algorithm to converge to the stationary distribution, its computation time depends on the rate of convergence as well as the desired number of imputations. Running multiple MCMC chains in parallel to generate independent imputations can reduce the computation time. This approach is slightly slower than Amelia II but is substantially faster than the standard MICE algorithm where all
Since the MICE procedure is iterative, users need to check that the model parameters fully explore the parameter space. Unlike the Bayesian copula method, there are no explicit convergence criteria that can be tracked. We performed a visual check that revealed no abnormalities and also ran each MICE chain for 20 iterations as recommended in Van Buuren and Groothuis-Oudshoorn (2011). The MICE method performed almost as well as the copula method but had a slightly lower coverage rate, meaning the estimated standard errors were too small. MICE also had the smallest average bias for the normal and Poisson variables. Again, however, these results are contingent on specifying the correct conditional distribution which can often be challenging.
Amelia II had the lowest coverage and highest bias both on average and in most scenarios that we considered. It had the smallest average interval length of 1.23, which shows that it was systematically underestimating the variance: leading to the low coverage rates.
Figure 2 shows that the average bias and the interval length increase as a function of the proportion of missing values. This leads to a decrease in the empirical coverage as the bias increases at a faster rate than the interval length. One notable exception was the correct coverage of the copula approach for the regression parameters of the ordinal and binomial variables, both Amelia II and MICE undercovered. Given that these types of variables are frequently encountered in social science applications, these results especially suggest that using a copula approach can lead to better statistical conclusions. Moreover, the overall simulation results indicate that when a normal distribution does not well approximate the data, then the copula approach will consistently outperform both Amelia II and MICE.
Somewhat surprisingly, there seems to be less variation in the bias and the interval length as a function of the correlation, as is shown in Figure 3. Except for the normally distributed variables, the bias decreases as the correlation increases due to the reduction in the relative loss of information from the missing data.
Breaking the MAR assumption did not lead to drastically worse results. We observe a decrease of about 3 percent in the coverage of all three methods and a slight decrease in the average bias. This shows that the methods are somewhat robust to violations of MAR assumption when it is not too severe. Figures C1 and C2 in the Online Appendix C show the results of the simulations when the MAR assumption is violated.
Application Study
In this section, we provide a comparison of the three imputation methods using an application from political science. The empirical example shows how copula methods can be used to generate imputations in a large data set with a variety of variable types.
Inequality and Democratic Support
As we have elaborated above, imputation methods are still underused, especially in the social sciences. There is, however, some visible progress. One example where scholars have taken advantage of one of the imputation methods currently available is “Economic Inequality and Democratic Support” by Krieckhaus et al. (2014) published in the Journal of Politics. Krieckhaus et al. (2014) explore whether the support for democracy within countries is affected by the level of inequality. The authors combine country-level variables (such as inequality) with individual-level survey data from 40 democracies around the world. For multiple countries, several survey waves are included, resulting in 57 country-years and a total of 77,642 observations (Krieckhaus et al. 2014:144). For this replication exercise, we replicate model 1 in Table 1 in Krieckhaus et al. (2014). The dependent variable is a “13-point additive index (ranging from 0 to 12) of democratic support,” which the authors treat as a continuous variable (Krieckhaus et al. 2014:144). The main independent variables of interest are Inequality at the country level and an ordinal Income scale at the individual level (ranging from 1 to 10). Additionally, the authors control for Age, Gender, Institutional Confidence, Interest in Politics, Interpersonal Trust, Education, Prior Regime Evaluation, and Leftist Ideology all drawn from the World Values Survey (2012). As in the original article, all individual-level variables are demeaned “using group-mean centering” after the imputation (Krieckhaus et al. 2014:145). The data are analyzed using a random coefficients model.
Most importantly for this study, the original data suffer from a relatively high number of missing observations. Table 2 shows the share of missing observations for variables included in the replication exercise. We can see that many of the variables have a large share of missing observations. If instead of MIs, the authors used listwise deletion then the number of observations in the regression model would have been approximately halved. Instead, Krieckhaus et al. (2014) use Amelia II to multiple impute five data sets which they analyze. Estimates are then combined using Rubin’s rule.
Share of Missingness in Variables of Interest.
This is an excellent setting for our comparison of MI techniques. The number of missing observations is quite large, and the data set includes different types of variables, continuous, binary, and ordinal. We created 20 multiple imputed data sets using each of the imputation techniques: Amelia II, MICE, and sbgcop. We then reestimate model 1 in table 1 in Krieckhaus et al. (2014:147) and combine the estimation results for each method’s multiple imputed data sets via Rubin’s rule.
For Amelia II, we specify the type of each variable and then generate 20 imputed data sets using the full original data. Similarly, we declare each variable’s type for MICE and estimate the default model for each. We use all variables except the one to be imputed as independent variables in the chained equations. Again, we create 20 multiple imputed data sets and set the maximum number of iterations to 20.
Lastly, we use our preferred method, imputation via the semiparametric Gaussian copula, to generate 20 imputed data sets. We run the MCMC chain for 2,100 iterations and randomly draw 20 data sets from the posterior. Note that, again, we do not have to declare any of the variable types or make any other specification or transformation of the data.
Figure 4 shows the coefficient estimates and 95 percent intervals for the replicated model based on each of the imputation techniques, as well as when listwise deletion is used. First, the results are quite similar for the Inequality, Income, and Age variables. Even for the models based on listwise deletion. For the two main variables of interest, inequality, and income, the results based on different imputation techniques are virtually the same.

Coefficient estimates and confidence intervals for model 1 in table 1 in Kriechkhaus et al. (2014) based on three imputation techniques and listwise deletion.
On the other hand, there are several significant differences for the other variables included in the model. First, the effect of gender is essentially zero according to the models estimated on the copula imputed data. Based on the data imputed using MICE or Amelia II, females have higher ratings of democracy satisfaction (though the CIs just cover zero). According to the nonimputed data, the effect of gender is quite strong.
Based on the data imputed with the copula method, the estimated association of Institutional Confidence with Democracy Satisfaction is significantly stronger compared to the models based on listwise deletion or other imputation methods. Similarly, the estimated effect of Leftist Ideology is also substantially larger according to the copula imputed data. On the other hand, the association of Education levels with Democracy Satisfaction is significantly smaller. Based on the copula, the relationships of Interest in Politics and Prior Regime Evaluation with the dependent variable of Democracy Satisfaction are all modeled to be weaker, compared to the other methods (and the nonimputed data), though the CIs overlap.
It is interesting to note that, except for one variable (Interpersonal Trust), whenever the estimated coefficient for the copula imputed data differs from the coefficients based on the other imputation methods, it is in the opposite direction of the difference to the listwise deletion coefficient. This is especially easy to see for the Gender and Leftist Ideology variables, where the effect is strongest (weakest) according to the model estimated on the listwise deleted data and weakest (strongest) for the copula-based models.
Based on the simulation results, especially with respect to binary and ordinal variables, and the theoretical properties, we are confident in the accuracy of the copula imputation method. These results suggest that Gender is not associated with people’s satisfaction with democracy, whereas Institutional Confidence and Left ideology both have much stronger effects.
Conclusion
What practical lessons can we learn about how to deal with missing data? In this article, we reemphasize the importance of dealing with missing data and present a copula based approach, developed by Hoff (2007), that is elegant and requires little prespecification of the data. With the rank-based approach introduced by Hoff (2007), the Gaussian copula can be used to impute binary, ordinal, and continuous variables. We discuss the theoretical properties of the copula method and its theoretical attractiveness compared to other commonly employed techniques. In particular, the Gaussian copula introduced here enables researchers to make imputation via draws from a valid posterior of the joint distribution without specifying the distributions of the individual variables. Moreover, we present evidence from simulations that it performs better than either Amelia II or MICE, especially when it comes to nonnormally distributed data.
While the three imputation methods perform relatively similarly, throughout the simulation, the Copula method does have the lowest average bias (tied with MICE) and the highest coverage rate (93 percent). More so, MICE requires specification of the conditional distributions whereas the copula method does not. Recent theoretical results for MICE suggest that good performance heavily relies on being approximately correct in the choice of conditionals (Li et al. 2012). On the other hand, theoretical guarantees for good behavior of copula methods are available. In particular, information bounds for rank-based estimators are the same as the information bounds for estimators based on the full (scale and rank) data (Hoff et al. 2014). Under MAR and MCAR, we inherit all the properties of the full data, and by introducing structure to the imputation, we are likely to have good behavior even under MNAR.
One aspect that we have not addressed herein is the validity and sensitivity to the unassessable assumptions made when analyzing data with missing values (Molenberghs et al. 2014), that is, the type of missingness mechanism. Bojinov et al. (2017) show that the Gaussian copula approach can be used to assess the validity of the missing always at random assumption (a slightly stronger assumption that implies MAR). Their results suggest that by using a Gaussian copula for generating imputations, the analysts can also easily diagnose the assumptions they made and quickly identify variables which are likely to break these assumptions. This adds another benefit to using the method discussed in this article.
Consideration must also be given to the computational cost of any procedure. As indicated by Graham (2009), the disadvantages of EM approaches are especially large when imputing databases with many variables or applications of “big data.” MICE can be computationally less expensive but suffers when the number of variables increases as the correct choice for each of the conditionals becomes increasingly unlikely. The semiparametric copula approach described here relies on MCMC, its speed does not depend on the fraction of missing data and scales nicely in the dimension of the data set. This makes it possible to impute even large database in a relatively timely manner and no prespecification of the data. Moreover, using the copula model to multiply impute missing values provides some of the advantages (such as a proper posterior distribution of the data) but is less burdensome on scholars than imputing values in a fully Bayesian approach (Erler et al. 2016).
Finally, the copula approach is quite flexible and can be employed at different stages of the analysis process. First, it can be used to generate a single estimate of the missing data or the mean of a large number of draws, which is exactly what might be needed in some situations. Second, per the recommendation of Rubin, it can be used to construct multiple databases. As with Amelia II, the copula imputations can be analyzed separately and the results combined using either mitools or Zelig (Imai, King, and Lau 2008) in
Supplemental Material
Supplemental Material, Suppl_SMR - Multiple Imputation Using Gaussian Copulas
Supplemental Material, Suppl_SMR for Multiple Imputation Using Gaussian Copulas by Florian M. Hollenbach, Iavor Bojinov, Shahryar Minhas, Nils W. Metternich, Michael D. Ward and Alexander Volfovsky in Sociological Methods & Research
Footnotes
Acknowledgments
For helpful insights, we thank Philippe Loustaunau, among the first of our colleagues to encourage this effort. Stephen Shellman was a strong critic who deserves our thanks too: His criticisms helped us to improve our approach. John Ahlquist, Matt Blackwell, Andreas Beger, Cassy Dorff, Gary King, and Jacob Montgomery provided helpful comments on previous versions of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project was partially supported by the Office of Naval Research (holding grants to the Lockheed Martin Corporation, Contract N00014-12-C-0066). Nils W. Metternich acknowledges support from the Economic and Social Research Council (ES/L011506/1). The work was completed while Alexander Volfovsky was supported by an NSF MSPRF under DMS-1402235.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
