Abstract
Many researchers encounter the missing data problem. The phenomenon may be occasioned by data omission, non-response, death of respondents, recording errors, among others. It is important to find an appropriate data imputation technique to fill in the missing positions. In this study, the Expectation Maximization (EM) algorithm and two of its stochastic variants, stochastic EM (SEM) and Monte Carlo EM (MCEM), are employed in missing data imputation and parameter estimation in multivariate
Introduction
In research studies, data may be characterized by missing values. Since most of the statistical methods cannot be applied directly on such datasets, the data analyst has to pre-treat the data. This may be done by deleting the rows or columns with missing values. However, deletion methods may lead to inadvertent loss of crucial information, which may have negative effects on the inferences. Additionally, the complete cases may not constitute a representative sample of the original dataset (Pigott, 2001; Raghunathan, 2004). Due to such uncertainties, model-based techniques are preferred in remedying missing data problems since in addition to using all the available information, they also preserve the distribution of the original data.
The expectation maximization (EM) algorithm is a model-based iterative technique popularly used for parameter estimation in the presence of missing values. The deterministic method is implemented in two parts namely the expectation (E) step and the maximization (M) step (McKnight et al., 2007). The algorithm iteratively alternates between the two steps until convergence is achieved. One of the major drawbacks of EM is that it may be trapped in local saddle points, preventing it from achieving the desired output. Over the years, a number of EM variants aimed at converging at the global maximum and simplifying the EM computations have been devised. The variants can be split into two versions: deterministic and stochastic.
The deterministic variants include Expectation Conditional Maximization (ECM), Expectation Conditional Maximization Extension (ECME), Alternating ECM (AECM), and Parameter-Expanded EM (PX-EM) (Liu & Rubin, 1995; Wahlström et al., 2018; Diffey et al., 2017) The stochastic variants include stochastic EM (SEM), stochastic approximation EM (SAEM), and Monte Carlo EM (MCEM) (Celeux & Diebolt, 1985; Zhu et al., 2007; Wei & Tanner, 1990). In this paper, EM and two of its stochastic variants, stochastic EM (SEM) and Monte Carlo EM (MCEM), are considered for data imputation and parameter estimation in multivariate
The rest of the paper is organized as follows: Section 2 presents the materials and methods employed in the paper. In Section 3, the results for simulated as well real data are given. In Section 4, a discussion of the results is launched. A conclusion is given in Section 5.
Materials and methods
Multivariate
distribution
Given that
Formally, the pdf for the multivariate
It is a three-parameter model, that is
Developed by Dempster et al. (1977), the EM algorithm operates in two main steps namely expectation (E-step) and maximization (M-step).
The E-step makes use of the log-likelihood function of
where
From Eq. (2);
Ignoring the constant terms, Eq. (2) can be split into:
where
where
Equation (3) can be differentiated with respect to
Equation (4) can be differentiated with respect to
The E-step for the EM algorithm is carried out in a similar fashion as in the case when the degrees of freedom are known. The missing values are imputed as follows:
For any
where
Additionally, the EM algorithm computes the conditional expectation of the sufficient statistics for
where
that is, the weight of the observed values for any
Proof
A family of densities
where
The Gamma density with some fixed
The density can be expressed as an exponential family as follows:
This density is an exponential family in
where
In this case,
Thus;
Based on Eq. (11),
Consider;
So that
Therefore, Eq. (5) becomes
The solution to Eq. (2.2) provides the estimate for the value of
During the M-step, EM updates the current parameter estimates given the complete dataset. The estimates are given by:
where
The EM technique runs iteratively between the E-step and M-step until convergence of the imputed values and the parameter estimates is realized. The numerical stability of the imputation procedure is guaranteed since the log likelihood function increases at each iteration (Varadhan & Roland, 2008). In addition, operating on the log-scale enables the algorithm to simplify numerical approximations in most of the models, especially those in the exponential family.
The E-step for the SEM algorithm with unknown degrees of freedom is similar to the one with known degrees of freedom. It involves drawing a single value from the conditional distribution of the missing values given the observed value and the current parameter estimates (Tregouet et al., 2004; Gilks et al., 1995). In addition, however, the weights are treated as random variables from the
distribution. Therefore, a single value is simulated from this distribution and used as the respective weight for observation
In MCEM, the E-step involves drawing multiple samples from conditional distribution of the missing values given the observed value and the current parameter estimates (Karimi et al.,2019; Levine & Casella, 2001; Jank, 2005). Additionally, the weights are treated as random variables from
The average of the values simulated from the Gamma distribution is then taken and used as the respective weight for any observation
Imputed and updated parameter values (EM)
In simulation study, the parameters of interest, that is, the location vector, the scatter matrix, and the degrees of freedom are fixed a priori and used to simulate a trivariate dataset of size
and
and
The EM, SEM, and MCEM methods are also used to impute missing values and estimate multivariate
Results
Simulation study
Tables 1–3 below give a summary of the imputed values, updated parameter estimates, and other metrics for EM, SEM, and MCEM respectively.
Table 1 shows that using the EM method, most of the missing values are efficiently imputed. The overall MSE for the imputed values is 415.6322. The recovered value for the degrees of freedom is 2.1567 against the ML estimate of 1.9230 and the true value of 3. The recovered location vector and scatter matrix are also relatively close to their corresponding ML estimates.
From Table 2, it can be observed that most of the missing positions are efficiently imputed, with relatively small deviations from their true values. The overall MSE for the SEM technique is 372.0655. Additionally, the 95% confidence intervals for the missing values include the true values. The recovered value for the degrees of freedom is 2.8970, against the ML estimate of 1.9230 and the true value of 3. The values for the location vector are also close to their corresponding ML estimates. However, the deviations between the updated values and the ML estimates for the scatter matrix are relatively large, indicating higher variability.
Imputed, updated parameter values, and confidence intervals (SEM)
Imputed, updated parameter values, and confidence intervals (SEM)
Imputed, updated parameter values, and confidence intervals (MCEM)
Table 3 shows that most of the imputed values are efficiently imputed by the MCEM method. The overall MSE for the method is 415.9571. Notably, the 95% confidence lengths for missing values are narrow, thereby excluding the true values in most of the instances. The recovered value for the degrees of freedom is 2.1398. This is against the ML estimate of 1.9230 and the true value of 3. The values for the location vector and the scatter matrix are also close to their corresponding ML estimates as indicated by their relatively small absolute differences.
The parameter estimates for the real data realized by the three imputation procedures are displayed in Table 4.
It can be observed that the EM, SEM, and MCEM techniques estimate the degrees of freedom for the 161 data respectively as 4.3318, 4.8803, and 4.2701. As earlier observed in the simulation study, the results for imputed 162 values and parameter updates in EM are almost similar to those of the MCEM technique.
Parameter estimates of real data for EM, SEM, and MCEM
Parameter estimates of real data for EM, SEM, and MCEM
The EM algorithm is a popular procedure often used in data imputation and parameter estimation problems. The technique is relatively easy to implement in exponential family models because the expectation of the complete data log likelihood function can be reduced to finding the expectations for the complete data sufficient statistics (Ng et al., 2012). In addition, the technique exhibits monotonic convergence, a feature which ensures that throughout the iterations, the log likelihood does not decrease. However, EM does not work well in models whose log likelihoods are intractable (Louis, 1982). Additionally, even in cases where the model is in the exponential family, the algorithm provides no guarantee that the complete data log likelihood function eventually obtained converges at the global maximum. In situations where the likelihood contains several saddle points, the point of convergence significantly relies on the starting point (Rubin & Thayer, 1982; Gupta & Chen, 2011). It is difficult to determine which starting point yields global maximal convergence.
Stochastic variants of EM have been devised as a measure to address some of its shortcomings. The variants are developed on the idea of global maximization. Irrespective of the starting point, the stochastic versions explore over a wide region of the log likelihood function, thereby evading local maximal traps (Delyon et al.,1999). Furthermore, in situations where a log likelihood function has no closed form, the techniques efficiently estimate the expectation of the complete data log likelihood through simulation.
From the results displayed in Tables 1–3, SEM is the most efficient method among the three imputation procedures considered in this study. The method yields the lowest MSE value. The method also yields the best value for the unknown degrees of freedom, managing to recover 2.8970 vis a vis the true value of 3. EM and MCEM methods produce almost similar MSE values. The recovered degrees of freedom recovered by the two methods are almost similar as well.
The SEM technique includes the true missing values in the 95% confidence intervals. This could be explained by its stochastic nature. Unlike the deterministic EM, SEM does not converge to the same value with each repeated sampling, which enables it to explore over a wide range of values (Jank, 2006). In our results, however, MCEM does not include most of the true values in the 95% intervals. Indeed, the method behaves more or less the same like the deterministic EM. It is worth noting that when the number of samples being drawn in the E-step is relatively large, MCEM yields similar results with the EM algorithm (Biscarat et al., 1992; Nielsen, 2000). The confidence lengths end up being too narrow, making it difficult to include the true values.
In the case of EM and MCEM, it can be observed in Tables 1 and 3 that the ML estimates for parameters of the originally simulated dataset are relatively close to the recovered parameter estimates upon imputation of the incomplete dataset. The deterministic nature of both the EM’s parameter updates and the ML parameter estimates obtained from the originally simulated datasets explains their relatively small absolute differences. It can also be observed in Table 2 that the updated parameter estimates in the case of SEM are close to the parameter values used in simulation owing to their relatively small absolute differences. However, the absolute difference between the ML estimates and the updated parameter estimates for SEM is relatively large, which could be associated with the random nature of the algorithm.
Conclusion
The study has focused on data imputation and parameter estimation in multivariate
Footnotes
Conflict of interest
The authors declare that they have no competing interests.
Acknowledgements
This work was supported partly by funds from Egerton University Council Best Student Scholarship Award.
