Missing data imputation in multivariate t distribution with unknown degrees of freedom using expectation maximization algorithm and its stochastic variants

Abstract

Many researchers encounter the missing data problem. The phenomenon may be occasioned by data omission, non-response, death of respondents, recording errors, among others. It is important to find an appropriate data imputation technique to fill in the missing positions. In this study, the Expectation Maximization (EM) algorithm and two of its stochastic variants, stochastic EM (SEM) and Monte Carlo EM (MCEM), are employed in missing data imputation and parameter estimation in multivariate $t$ distribution with unknown degrees of freedom. The imputation efficiencies of the three methods are then compared using mean square error (MSE) criterion. SEM yields the lowest MSE, making it the most efficient method in data imputation when the data assumes the multivariate $t$ distribution. The algorithm’s stochastic nature enables it to avoid local saddle points and achieve global maxima; ultimately increasing its efficiency. The EM and MCEM techniques yield almost similar results. Large sample draws in the MCEM’s E-step yield more or less the same results as the deterministic EM. In parameter estimation, it is observed that the parameter estimates for EM and MCEM are relatively close to the simulated data’s maximum likelihood (ML) estimates. This is not the case in SEM, owing to the random nature of the algorithm.

Keywords

Expectation maximization (EM)stochastic EM Monte Carlo EM unknown degrees of freedom

1. Introduction

In research studies, data may be characterized by missing values. Since most of the statistical methods cannot be applied directly on such datasets, the data analyst has to pre-treat the data. This may be done by deleting the rows or columns with missing values. However, deletion methods may lead to inadvertent loss of crucial information, which may have negative effects on the inferences. Additionally, the complete cases may not constitute a representative sample of the original dataset (Pigott, 2001; Raghunathan, 2004). Due to such uncertainties, model-based techniques are preferred in remedying missing data problems since in addition to using all the available information, they also preserve the distribution of the original data.

The expectation maximization (EM) algorithm is a model-based iterative technique popularly used for parameter estimation in the presence of missing values. The deterministic method is implemented in two parts namely the expectation (E) step and the maximization (M) step (McKnight et al., 2007). The algorithm iteratively alternates between the two steps until convergence is achieved. One of the major drawbacks of EM is that it may be trapped in local saddle points, preventing it from achieving the desired output. Over the years, a number of EM variants aimed at converging at the global maximum and simplifying the EM computations have been devised. The variants can be split into two versions: deterministic and stochastic.

The deterministic variants include Expectation Conditional Maximization (ECM), Expectation Conditional Maximization Extension (ECME), Alternating ECM (AECM), and Parameter-Expanded EM (PX-EM) (Liu & Rubin, 1995; Wahlström et al., 2018; Diffey et al., 2017) The stochastic variants include stochastic EM (SEM), stochastic approximation EM (SAEM), and Monte Carlo EM (MCEM) (Celeux & Diebolt, 1985; Zhu et al., 2007; Wei & Tanner, 1990). In this paper, EM and two of its stochastic variants, stochastic EM (SEM) and Monte Carlo EM (MCEM), are considered for data imputation and parameter estimation in multivariate $t$ distribution with unknown degrees of freedom. The distribution is commonly used as an alternative to the multivariate normal distribution due to its robustness, particularly when dealing with heavy-tailed datasets (Lange et al., 1989; Liu, 1995). Many research works have focused on parameter estimation in multivariate $t$ distribution using EM and its variants when the degrees of freedom are known. It would be interesting to see how the techniques perform in the case when the degrees of freedom are unknown. In this study, the missing data mechanism is assumed to be missing at random (MAR).

The rest of the paper is organized as follows: Section 2 presents the materials and methods employed in the paper. In Section 3, the results for simulated as well real data are given. In Section 4, a discussion of the results is launched. A conclusion is given in Section 5.

2. Materials and methods

2.1 Multivariate $t$ distribution

Given that $Y|\mu,\Sigma,v,\tau\sim N_{p}(\mu,\Sigma/{\tau)}$ , and that $\tau|\mu,\Sigma,v\sim\textit{Gamma}(\frac{v}{2},\frac{v}{2})$ , then the integration of the joint distribution for $\left(Y,\tau\right)$ with respect to $\tau$ yields the marginal density for $Y$ which is $t_{p}\left(\mu,\Sigma,v\right)$ (Liu & Rubin, 1995).

Formally, the pdf for the multivariate $t$ distribution of dimension $p$ takes the form:

$\displaystyle f\left(Y;\theta\right)=\frac{\mathrm{\Gamma}\left(\frac{v+p}{2}% \right)}{\mathrm{\Gamma}\left(\frac{v}{2}\right)}\frac{1}{{\left(v\pi\right)}^% {\frac{p}{2}}}\frac{1}{{|\Sigma|}^{\frac{1}{2}}}{\left(1+\frac{1}{v}(Y-\mu)^{T% }\Sigma^{-1}\left(Y-\mu\right)\right)}^{-\frac{v+p}{2}}$ (1)

It is a three-parameter model, that is $\theta=(\mu,\Sigma,v)$ where $\mu$ is the mean or location vector, $\Sigma$ is a positive definite scatter matrix, and $v$ is a scalar for the degrees of freedom (Roth, 2013). Basically, $v$ -also referred to as the shape parameter, determines the peakedness of the distribution.

2.2 Imputation using EM, SEM, and MCEM in multivariate

t

with unknown degrees of freedom

Developed by Dempster et al. (1977), the EM algorithm operates in two main steps namely expectation (E-step) and maximization (M-step).

The E-step makes use of the log-likelihood function of $\theta=(\mu,\Sigma,v)$ given $Y$ and $\tau$ which is given by:

$\displaystyle\ell\left(\mu,\Sigma,v\Big{|}Y_{i},{\tau}_{i}\right)=\frac{p}{2}% \sum_{i=1}^{n}{\ln}{(\tau}_{i})-\frac{np}{2}\ln\left(2\pi\right)-\frac{n}{2}% \ln|\Sigma|-\frac{1}{2}tr\Bigg{(}\Sigma^{-1}\Bigg{\{}\sum_{i=1}^{n}{{\tau}_{i}% [(Y_{i}-\mu)(Y_{i}-\mu)^{T}]}\Bigg{\}}\Bigg{)}+\frac{nv}{2}\ln\left(\frac{v}{2% }\right)-n\ln\Gamma\left(\frac{v}{2}\right)+\left(\frac{v}{2}-1\right)\sum_{i=% 1}^{n}{\ln\mathrm{}({\tau}_{i})}-\frac{v}{2}\sum_{i=1}^{n}{{\tau}_{i}}$ (2)

where

$\displaystyle Y_{i}|\mu,\Sigma,v,\tau_{i}\sim N_{p}\left(\mu,\Sigma/\tau_{i}% \right),∼{}\text{for}∼{}∼{}i=1,\ldots,n$ $\displaystyle\tau_{i}|\mu,\Sigma,v\sim\textit{Gamma}\left(\frac{v}{2},\frac{v}% {2}\right),∼{}\text{for}∼{}∼{}i=1,\ldots,n$ $\displaystyle\tau_{i}\big{|}Y_{i},\mu,\Sigma,v\sim\textit{Gamma}\left(\frac{v+% p_{i}}{2},\frac{v+{\delta}_{i}}{2}\right);\delta_{i}=(Y_{i}-\mu)^{T}\Sigma^{-1% }\left(Y_{i}-\mu\right)$

From Eq. (2);

$\displaystyle-\frac{1}{2}tr\Bigg{(}\Sigma^{-1}\Bigg{\{}\sum^{n}_{i=1}{{\tau}_{% i}[(Y_{i}-\mu)(Y_{i}-\mu)^{T}]}\Bigg{\}}\Bigg{)}=\frac{1}{2}tr\Bigg{(}\Sigma^{% -1}\sum^{n}_{i=1}{{\tau}_{i}}Y_{i}Y^{T}_{i}\Bigg{)}$ $\displaystyle+\mu^{T}\Sigma^{-1}\sum^{n}_{i=1}{{\tau}_{i}}Y_{i}-\frac{1}{2}\mu% ^{T}\Sigma^{-1}\mu\sum^{n}_{i=1}{{\tau}_{i}}$

Ignoring the constant terms, Eq. (2) can be split into:

$\displaystyle\ell_{N}\left(\mu,\Sigma|Y_{i},{\tau}_{i}\right)=-\frac{n}{2}\ln|% \Sigma|-\frac{1}{2}tr\Bigg{(}\Sigma^{-1}\sum^{n}_{i=1}{{\tau}_{i}}Y_{i}Y^{T}_{% i}\Bigg{)}+\mu^{T}\Sigma^{-1}\sum^{n}_{i=1}{{\tau}_{i}}Y_{i}-\frac{1}{2}\mu^{T% }\Sigma^{-1}\mu\sum^{n}_{i=1}{{\tau}_{i}}$ (3)

where $S_{TY}=\sum^{n}_{i=1}{{\tau}_{i}}Y_{i}$ , $S_{TYY}=\sum^{n}_{i=1}{{\tau}_{i}}Y_{i}Y^{T}_{i}$ , and $S_{T}=\sum^{n}_{i=1}{{\tau}_{i}}$ are the sufficient statistics for $\mu$ and $\Sigma$ and

$\displaystyle{\ell}_{G}\left(v\left\lfloor\tau\right.\right)=-n\ln\Gamma\left(% \frac{v}{2}\right)+\frac{nv}{2}\ln\left(\frac{v}{2}\right)+\frac{v}{2}\left[% \sum^{n}_{i=1}\left(\ln\left({\tau}_{i}\right)-{\tau}_{i}\right)\right]$ (4)

where $S_{TT}=\sum^{n}_{i=1}\left(\ln\left({\tau}_{i}\right)-{\tau}_{i}\right)$ is the sufficient statistics for $v$ (Liu & Rubin, 1995).

Equation (3) can be differentiated with respect to $\mu$ and $\Sigma$ respectively to yield

$\displaystyle\widehat{\mu}=\frac{\sum^{n}_{i=1}{{\tau}_{i}Y_{i}}}{\sum^{n}_{i=% 1}{{\tau}_{i}}}$ $\displaystyle\widehat{\Sigma}=\frac{1}{n}\sum^{n}_{i=1}{{\tau}_{i}}(Y_{i}-\mu)% (Y_{i}-\mu)^{T}$

Equation (4) can be differentiated with respect to $v$ to yield:

$\displaystyle\ln\left(\frac{v}{2}\right)+1-\varphi\left(\frac{v}{2}\right)+% \frac{1}{n}\left[\sum^{n}_{i=1}{\left(\ln\left({\tau}_{i}\right)-{\tau}_{i}% \right)}\right]=0$ (5)

The E-step for the EM algorithm is carried out in a similar fashion as in the case when the degrees of freedom are known. The missing values are imputed as follows:

For any $i^{\rm th}$ unit and at the $t^{\rm th}$ iteration;

$\displaystyle y_{ij}^{(t)}=\begin{cases}y_{ij},&\text{if the value is observed% ,}\\ E\left(y^{(t)}_{i,mis}/y^{(t)}_{i,obs};{\theta}^{(t)}\right),&\text{if the % value is missing}\end{cases}$ (6)

where $E(y^{(t)}_{i,mis}/y^{(t)}_{i,obs};{\theta}^{(t)})={\mu}^{(t)}_{i,mis}+\Sigma^{% (t)}_{i,mis,obs}(\Sigma^{(t)}_{i,obs})^{-1}(y^{(t)}_{i,obs}-{\mu}^{(t)}_{i,obs})$ ; which is analogous to the location parameter for the conditional multivariate $t$ distribution.

Additionally, the EM algorithm computes the conditional expectation of the sufficient statistics for $v$ given the current parameter estimates as shown below.

$\displaystyle{E(S}_{TT}|{\theta}^{(t)})=\sum^{n}_{i=1}{\left[\varphi\left(% \frac{v^{(t)}+p_{i}}{2}\right)-\ln\left(\frac{v^{(t)}+p_{i}}{2}\right)\right]}% +\sum^{n}_{i=1}\left(\ln\left(w_{i,obs}^{(t+1)}\right)-w_{i,obs}^{(t+1)}\right)$ (7)

where

$\displaystyle w^{(t+1)}_{i,obs}=E({\tau}_{i}|{\theta}^{(t)},y_{i,obs})=\frac{v% +p_{i}}{v+{\delta}^{(t)}_{i,obs}},$

that is, the weight of the observed values for any $i^{\rm th}$ unit. We note that $p_{i}$ , ${\mu}_{i,obs}$ , and $\sum_{i,obs}$ are the respective dimension, location vector and scatter matrix for $y_{i,obs}$ .

Proof

Theorem

A family of densities $\{f_{\theta}(\theta\in\Theta)\}$ is said to be of the exponential family if with respect to some measure $z$ , it has the density

$\displaystyle p(x|\theta)=A(x)\exp(T(x)\theta-F(\theta))$ (8)

where $T$ and $A$ are some fixed functions characterizing the exponential family. $\theta$ belongs to some natural parameter space $\mathrm{\Theta}$ . $T(x)$ is the sufficient statistic. $F(\theta)=\log(\int A(x)\exp[T(x)\theta]dz(x))$ is referred to as the normalization function (Korda et al., 2013; Lindsay, 1983).

The Gamma density with some fixed $\beta$ can be expressed as

$\displaystyle f_{\theta}(x)=\frac{{\beta}^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1% }e^{-\beta x}$ (9)

The density can be expressed as an exponential family as follows:

$\displaystyle f_{\theta}(x)=\exp\{\ln(x)\alpha+\alpha\ln\left(\beta\right)-\ln% \left(\Gamma(\alpha)\right)-\beta x\}$ (10)

This density is an exponential family in $\alpha$ only with $T(x)=\ln(x)$ and $F(\alpha)=\ln(\Gamma(\alpha))-\alpha\ln(\beta)$ . Computing $\frac{d}{d\alpha}F(\alpha)$ yields

$\displaystyle E(\ln(x))=-\ln(\beta)+\varphi(\alpha)$ (11)

where $\varphi(\alpha)$ is a digamma function.

In this case,

$\displaystyle{\tau}_{i}|{y_{i,obs},\theta}^{(t)}\sim\textit{Gamma}\left(\frac{% v^{(t)}+p_{i}}{2},\frac{v^{(t)}+{\delta}_{i,obs}}{2}\right)$

Thus;

$\displaystyle w_{i,obs}^{(t+1)}=E({\tau}_{i}|{y_{i,obs},\theta}^{(t)})=\frac{v% ^{(t)}+p_{i}}{v^{(t)}+{\delta}_{i,obs}}$ (12)

Based on Eq. (11),

$\displaystyle E(\ln(\tau_{i}|y_{i,obs},\theta^{(t)})=\varphi\left(\frac{v^{(t)% }+p_{i}}{2}\right)-\ln\left(\frac{v^{(t)}+{\delta}_{i,obs}}{2}\right)$ (13)

Consider;

$\displaystyle-\ln\left(\frac{v^{(t)}+{\delta}_{i,obs}}{2}\right)=-\ln\left[% \frac{1}{2}\left(\frac{v^{(t)}+{\delta}_{i,obs}}{v^{(t)}+p_{i}}\right){(v}^{(t% )}+p_{i})\right]=-\ln\left[\frac{1}{w_{i,obs}^{(t+1)}}\frac{1}{2}({v}^{(t)}+p_% {i})\right]=\ln(w_{i,obs}^{(t+1)})-\ln\left(\frac{v^{(t)}+p_{i}}{2}\right)$

So that

$\displaystyle E(S_{TT}\big{|}{\theta}^{(t)})=E\left(\sum^{n}_{i=1}{\left({\ln% \left({\tau}_{i}\right)}-{\tau}_{i}\right)}\right)=\sum^{n}_{i=1}{\left[% \varphi\left(\frac{v^{(t)}+p_{i}}{2}\right)+\ln(w_{i,obs}^{(t+1)})-\ln\left(% \frac{v^{(t)}+p_{i}}{2}\right)-w_{i,obs}^{(t+1)}\right]}=\sum^{n}_{i=1}{\left[% \varphi\left(\frac{v^{(t)}+p_{i}}{2}\right)-\ln\left(\frac{v^{(t)}+p_{i}}{2}% \right)\right]}+\sum^{n}_{i=1}{\left(\ln\left(w_{i,obs}^{(t+1)}\right)-w_{i,% obs}^{(t+1)}\right)}$

Therefore, Eq. (5) becomes

$\displaystyle\ln\left(\frac{v}{2}\right)+1-\varphi\left(\frac{v}{2}\right)+% \frac{1}{n}\sum^{n}_{i=1}{\left[\varphi\left(\frac{v^{(t)}+p_{i}}{2}\right)-% \ln\left(\frac{v^{(t)}+p_{i}}{2}\right)\right]}+\frac{1}{n}\sum^{n}_{i=1}{% \left({\ln\left(w_{i,obs}^{(t+1)}\right)}-w_{i,obs}^{(t+1)}\right)}$ $\displaystyle=0$ (14)

The solution to Eq. (2.2) provides the estimate for the value of $v$ . Since the equation does not have a close-form solution, numerical approximation is employed (Doğru et al., 2018). In this study, the bisection method is used to solve for $v$ in every iteration until convergence is achieved.

During the M-step, EM updates the current parameter estimates given the complete dataset. The estimates are given by:

$\displaystyle\widehat{\mu}^{(t+1)}=\frac{\sum^{n}_{i=1}{w^{(t+1)}_{i,obs}Y^{(t% )}_{i}}}{\sum^{n}_{i=1}{w^{(t+1)}_{i,obs}}};i=1,\ldots,n;i∼{}\text{denotes % number of units}$ (15) $\displaystyle\widehat{\Sigma}=\frac{1}{n}\sum^{n}_{i=1}{w^{(t+1)}_{i,obs}}% \left(Y^{(t)}_{i}-\widehat{\mu}^{(t)}\right)\left(Y^{(t)}_{i}-\widehat{\mu}^{(% t)}\right)^{T}+\frac{1}{n}{\psi}^{(t)}_{i}$ (16)

where

$\displaystyle\psi^{(t)}_{i}=\begin{cases}0,&\text{if}∼{}∼{}y_{ij}∼{}∼{}\text{% or}∼{}∼{}y_{ik}∼{}∼{}\text{is observed}\\ \displaystyle\Sigma^{(t)}_{i,mis}-\Sigma^{(t)}_{i,mis,obs}\Big{(}\Sigma^{(t)}_% {i,obs}\Big{)}^{-1}\Sigma^{(t)}_{i,obs,mis},&\text{if}∼{}∼{}y_{ij}∼{}∼{}\text{% and}∼{}∼{}y_{ik}∼{}∼{}\text{are missing}\end{cases}$

${\delta}^{(t)}_{i,obs}={\delta}_{i,obs}=\left(y_{i,obs}-{\mu}_{i,obs}\right)^{% T}\Sigma^{-1}_{i,obs}\left(y_{i,obs}-{\mu}_{i,obs}\right)$ , that is, the Mahalanobis distance.

The EM technique runs iteratively between the E-step and M-step until convergence of the imputed values and the parameter estimates is realized. The numerical stability of the imputation procedure is guaranteed since the log likelihood function increases at each iteration (Varadhan & Roland, 2008). In addition, operating on the log-scale enables the algorithm to simplify numerical approximations in most of the models, especially those in the exponential family.

The E-step for the SEM algorithm with unknown degrees of freedom is similar to the one with known degrees of freedom. It involves drawing a single value from the conditional distribution of the missing values given the observed value and the current parameter estimates (Tregouet et al., 2004; Gilks et al., 1995). In addition, however, the weights are treated as random variables from the

$\displaystyle\textit{Gamma}\bigg{(}\frac{v^{(t)}+p_{i}}{2},\frac{v^{(t)}+{% \delta}_{i,obs}}{2}\bigg{)}$

distribution. Therefore, a single value is simulated from this distribution and used as the respective weight for observation $Y_{i}$ (Nielsen, 2000). During the M-step, the obtained weights are substituted in Eq. (2.2) to help solve for $v$ using the bisection method. The estimates for $\mu$ and $\Sigma$ are computed as given in Eq. (15) and Eq. (16) respectively.

In MCEM, the E-step involves drawing multiple samples from conditional distribution of the missing values given the observed value and the current parameter estimates (Karimi et al.,2019; Levine & Casella, 2001; Jank, 2005). Additionally, the weights are treated as random variables from

$\displaystyle\textit{Gamma}\bigg{(}\frac{v^{(t)}+p_{i}}{2},\frac{v^{(t)}+{% \delta}_{i,obs}}{2}\bigg{)}.$

The average of the values simulated from the Gamma distribution is then taken and used as the respective weight for any observation $Y_{i}.$ During the M-step, this mean value is then substituted in Eq. (2.2) to facilitate the numerical approximation of $v$ using the bisection method.

Table 1

Imputed and updated parameter values (EM)

Row	Column	Imputed value	True value	Deviation	Deviation ${}^{2}$
8	1	101.2048	107.2248	6.02	36.2402
19	1	83.0524	84.0179	0.9655	0.9322
13	2	138.9717	90.1431	$-$ 48.8286	2384.2283
19	2	130.5666	124.6941	$-$ 5.8725	34.486
21	2	141.3439	139.8360	$-$ 1.5078	2.2735
24	2	186.3955	180.4211	$-$ 5.9744	35.6932
MSE				415.6422
Parameter	True	ML	Updated	Absolute difference (true/updated)	Absolute difference (ML/updated)
${\mu}_{1}$	93	90.0538	89.0988	3.9012	0.9550
${\mu}_{2}$	139	134.4720	135.3952	3.6048	0.9231
${\mu}_{2}$	188	183.9852	183.3785	4.6215	0.6067
$v$	3	1.9230	2.1576	0.8424	0.2346
${\sigma}_{11}$	435	292.8585	313.8047	121.1953	20.9462
${\sigma}_{12}$	326.4583	197.5128	220.2766	106.1820	22.7638
${\sigma}_{13}$	291.2623	217.8191	224.6422	66.6201	6.8231
${\sigma}_{22}$	500	244.7940	291.2198	208.7802	46.4259
${\sigma}_{23}$	312.2659	179.1945	179.3957	132.8703	0.2012
${\sigma}_{33}$	398	285.0174	314.6457	83.3543	29.6283

2.3 Numerical studies and real data application

In simulation study, the parameters of interest, that is, the location vector, the scatter matrix, and the degrees of freedom are fixed a priori and used to simulate a trivariate dataset of size $n=25$ using the mvtnorm package in $R$ statistical software. The parameters used to simulate the data for analysis are

$\displaystyle\mu=(93,139,188),∼{}\Sigma=\begin{bmatrix}435&326.4583&291.2623\\ 326.4583&500&312.2659\\ 291.2623&312.2659&398\end{bmatrix},$

and $v=3$ . The scatter matrix used is such that the correlations among all the variables is 0.7. 20% of the observations are then eliminated at random, giving rise to 6 missing values. The study then uses the EM, SEM, and MCEM algorithms to recover the lost information and to estimate the three parameters of interest. Arbitrary starting values

$\displaystyle\mu_{\textit{stat}}=(10,20,30)\text{∼{} and ∼{}}\Sigma_{\textit{% stat}}=\begin{bmatrix}14&10&12\\ 10&13&9\\ 12&9&18\end{bmatrix}$

and $v_{\textit{stat}}=5$ are used to initiate the three algorithms. EM algorithm is set to converge once the difference $L({\theta}^{(t+1)})-L({\theta}^{(t)})\leqslant 0.00001$ . The first 100 iterations in both the SEM and the MCEM methods are burnt-in to ward off the effects of the initial values. In MCEM’s E-step, 1,000 sample values are drawn from the conditional trivariate $t$ -distribution for each missing position. The mean values for the drawn samples are then used for imputation. The maximum likelihood (ML) parameter estimates for the originally simulated dataset are deterministically computed using the EM algorithm to examine how close they are to the parameter updates realized by the three procedures upon data imputation.

The EM, SEM, and MCEM methods are also used to impute missing values and estimate multivariate $t$ parameters using real data. Clinical data on creatinine clearance by Shih and Weisberg (1986) is used as the real the data. It has previously been used in multivariate $t$ context by Liu and Rubin (1995). The data of size $n$ =34 concerning clinical trials on male patients has three explanatory variables namely serum creatinine (SC) concentration in mg/deciliter, body weight (WT) in kg, and age in years. It has one dependent variable, the endogenous creatinine (CR) clearance.

3. Results

3.1 Simulation study

Tables 1–3 below give a summary of the imputed values, updated parameter estimates, and other metrics for EM, SEM, and MCEM respectively.

Table 1 shows that using the EM method, most of the missing values are efficiently imputed. The overall MSE for the imputed values is 415.6322. The recovered value for the degrees of freedom is 2.1567 against the ML estimate of 1.9230 and the true value of 3. The recovered location vector and scatter matrix are also relatively close to their corresponding ML estimates.

From Table 2, it can be observed that most of the missing positions are efficiently imputed, with relatively small deviations from their true values. The overall MSE for the SEM technique is 372.0655. Additionally, the 95% confidence intervals for the missing values include the true values. The recovered value for the degrees of freedom is 2.8970, against the ML estimate of 1.9230 and the true value of 3. The values for the location vector are also close to their corresponding ML estimates. However, the deviations between the updated values and the ML estimates for the scatter matrix are relatively large, indicating higher variability.

Table 2
Imputed, updated parameter values, and confidence intervals (SEM)

Row	Column	Imputed value	True value	Deviation	Deviation ${}^{2}$	95% C.I.
8	1	102.075	107.2248	5.1497	26.5199	[70.8545, 134.7684]
19	1	84.6125	84.0179	$-$ 0.5946	0.3535	[47.4675, 124.5025]
13	2	136.0165	90.1431	$-$ 45.8734	2104.3672	[45.0419, 227.0140]
19	2	131.3249	124.6941	$-$ 6.6308	43.9677	[84.4070, 183.3386]
21	2	141.2318	139.836	$-$ 1.3958	1.9482	[104.8106, 178.8097]
24	2	187.8532	180.4211	$-$ 7.4321	55.2364	[97.3942, 276.6947]
MSE				372.0655

Parameter	True	ML	Updated	Absolute difference (true/updated)	Absolute difference (ML/updated)
$\mu_{1}$	93	90.0538	89.5476	3.4524	0.5061
$\mu_{2}$	139	134.4720	135.6327	3.3673	1.1607
$\mu_{2}$	188	183.9852	182.7794	5.2206	1.2058
$v$	3	1.9230	2.8970	0.1030	0.9740
$\sigma_{11}$	435	292.8585	450.0698	15.0698	157.2114
$\sigma_{12}$	326.4583	197.5128	319.4644	6.9938	121.9517
$\sigma_{13}$	291.2623	217.8191	298.7804	7.5182	80.9614
$\sigma_{22}$	500	244.7940	579.6714	79.6714	334.8775
$\sigma_{23}$	312.2659	179.1945	258.1136	54.1524	78.9191
$\sigma_{33}$	398	285.0174	422.1006	24.1006	137.0832

Table 3

Imputed, updated parameter values, and confidence intervals (MCEM)

Row	Column	Imputed value	True value	Deviation	Deviation ${}^{2}$	95% C.I.
8	1	101.2108	107.2248	6.0139	36.1676	[100.3704, 102.0863]
19	1	83.0403	84.0179	0.9776	0.9556	[81.8890, 84.2780]
13	2	138.9945	90.1431	$-$ 48.8513	2386.4531	[136.1975, 141.7524]
19	2	130.5659	124.6941	$-$ 5.8718	34.4774	[129.2697, 131.8128]
21	2	141.3410	139.8360	$-$ 1.5050	2.265	[140.4646, 142.1778]
24	2	186.3729	180.4211	$-$ 5.9518	35.4237	[183.8448, 188.8242]
MSE				372.0655

Parameter	True	ML	Updated	Absolute difference (true/updated)	Absolute difference (ML/updated)
$\mu_{1}$	93	90.0538	89.0929	3.9071	0.9609
$\mu_{2}$	139	134.4720	135.3909	3.6091	0.9189
$\mu_{2}$	188	183.9852	183.3848	4.6152	0.6004
$v$	3	1.9230	2.1398	0.8602	0.2168
$\sigma_{11}$	435	292.8585	312.6941	122.3059	19.8357
$\sigma_{12}$	326.4583	197.5128	219.3672	107.0911	21.8544
$\sigma_{13}$	291.2623	217.8191	223.7708	67.4915	5.9517
$\sigma_{22}$	500	244.7940	289.8785	210.1215	45.0845
$\sigma_{23}$	312.2659	179.1945	178.5619	133.7040	0.6326
$\sigma_{33}$	398	285.0174	313.3401	84.6599	28.3228

Table 3 shows that most of the imputed values are efficiently imputed by the MCEM method. The overall MSE for the method is 415.9571. Notably, the 95% confidence lengths for missing values are narrow, thereby excluding the true values in most of the instances. The recovered value for the degrees of freedom is 2.1398. This is against the ML estimate of 1.9230 and the true value of 3. The values for the location vector and the scatter matrix are also close to their corresponding ML estimates as indicated by their relatively small absolute differences.

3.2 Real data application

The parameter estimates for the real data realized by the three imputation procedures are displayed in Table 4.

It can be observed that the EM, SEM, and MCEM techniques estimate the degrees of freedom for the 161 data respectively as 4.3318, 4.8803, and 4.2701. As earlier observed in the simulation study, the results for imputed 162 values and parameter updates in EM are almost similar to those of the MCEM technique.

Table 4
Parameter estimates of real data for EM, SEM, and MCEM

Parameter	EM	SEM	MCEM
${\mu}_{WT}$	72.8695	72.9183	72.8718
${\mu}_{SC}$	1.1896	1.2085	1.1888
${\mu}_{Age}$	54.2758	54.3832	54.2833
$v$	4.3318	4.8803	4.2701
${\sigma}_{WT,WT}$	127.6682	149.9311	127.3282
${\sigma}_{WT,SC}$	0.1085	0.0835	0.1074
${\sigma}_{WT,Age}$	20.4667	21.3779	20.4460
${\sigma}_{SC,SC}$	0.2191	0.3204	0.2179
${\sigma}_{SC,Age}$	2.5414	2.8829	2.5305
${\sigma}_{Age,\ Age}$	289.4638	301.1791	289.1255

4. Discussion

The EM algorithm is a popular procedure often used in data imputation and parameter estimation problems. The technique is relatively easy to implement in exponential family models because the expectation of the complete data log likelihood function can be reduced to finding the expectations for the complete data sufficient statistics (Ng et al., 2012). In addition, the technique exhibits monotonic convergence, a feature which ensures that throughout the iterations, the log likelihood does not decrease. However, EM does not work well in models whose log likelihoods are intractable (Louis, 1982). Additionally, even in cases where the model is in the exponential family, the algorithm provides no guarantee that the complete data log likelihood function eventually obtained converges at the global maximum. In situations where the likelihood contains several saddle points, the point of convergence significantly relies on the starting point (Rubin & Thayer, 1982; Gupta & Chen, 2011). It is difficult to determine which starting point yields global maximal convergence.

Stochastic variants of EM have been devised as a measure to address some of its shortcomings. The variants are developed on the idea of global maximization. Irrespective of the starting point, the stochastic versions explore over a wide region of the log likelihood function, thereby evading local maximal traps (Delyon et al.,1999). Furthermore, in situations where a log likelihood function has no closed form, the techniques efficiently estimate the expectation of the complete data log likelihood through simulation.

From the results displayed in Tables 1–3, SEM is the most efficient method among the three imputation procedures considered in this study. The method yields the lowest MSE value. The method also yields the best value for the unknown degrees of freedom, managing to recover 2.8970 vis a vis the true value of 3. EM and MCEM methods produce almost similar MSE values. The recovered degrees of freedom recovered by the two methods are almost similar as well.

The SEM technique includes the true missing values in the 95% confidence intervals. This could be explained by its stochastic nature. Unlike the deterministic EM, SEM does not converge to the same value with each repeated sampling, which enables it to explore over a wide range of values (Jank, 2006). In our results, however, MCEM does not include most of the true values in the 95% intervals. Indeed, the method behaves more or less the same like the deterministic EM. It is worth noting that when the number of samples being drawn in the E-step is relatively large, MCEM yields similar results with the EM algorithm (Biscarat et al., 1992; Nielsen, 2000). The confidence lengths end up being too narrow, making it difficult to include the true values.

In the case of EM and MCEM, it can be observed in Tables 1 and 3 that the ML estimates for parameters of the originally simulated dataset are relatively close to the recovered parameter estimates upon imputation of the incomplete dataset. The deterministic nature of both the EM’s parameter updates and the ML parameter estimates obtained from the originally simulated datasets explains their relatively small absolute differences. It can also be observed in Table 2 that the updated parameter estimates in the case of SEM are close to the parameter values used in simulation owing to their relatively small absolute differences. However, the absolute difference between the ML estimates and the updated parameter estimates for SEM is relatively large, which could be associated with the random nature of the algorithm.

5. Conclusion

The study has focused on data imputation and parameter estimation in multivariate $t$ distribution with unknown degrees of freedom using EM, SEM, and MCEM algorithms. Among the three imputation procedures considered, SEM is the most efficient method in the imputation of missing data and in the recovery of the degrees of freedom. In case the log likelihood is likely to have many saddle points, SEM would be an appropriate imputation procedure since it can avoid the local maximal traps and ensure convergence at the global maxima. EM and MCEM yield almost similar results. In situations where the conditional expectation of a log likelihood is intractable, however, MCEM would be preferred to the EM method since it can estimate the function through simulation.

Footnotes

Conflict of interest

The authors declare that they have no competing interests.

Acknowledgements

This work was supported partly by funds from Egerton University Council Best Student Scholarship Award.

References

Biscarat

J. C.

Celeux

., & Diebolt

. (1992). Stochastic versions of the EM algorithm (No. TR-227). Washington University Seattle Department of Statistics.

Celeux

, & Diebolt

. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2, 73-82.

Delyon

Lavielle

, & Moulines

. (1999). Convergence of a stochastic approximation version of the EM algorithm. Annals of Statistics, 94-128.

Dempster

A. P.

Laird

N. M.

, & Rubin

D. B.

, (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), 1-38.

Diffey

S. M.

Smith

A. B.

Welsh

A. H.

, & Cullis

B. R

. (2017). A new REML (parameter expanded) EM algorithm for linear mixed models. Australian & New Zealand Journal of Statistics, 59(4), 433-448.

Doğru

F. Z.

, Bulut, Y. M., & Arslan

. (2018). Doubly reweighted estimators for the parameters of the multivariate t-distribution. Communications in Statistics-Theory and Methods, 47(19), 4751-4771.

Gilks

W. R.

Richardson

, & Spiegelhalter

. (1995). Markov chain Monte Carlo in practice. Chapman and Hall/CRC.

Gupta

M. R.

, & Chen

. (2011). Theory and Use of the EM Algorithm. Foundations and Trends® in Signal Processing, 4(3), 223-296.

Jank

. (2005, August). Stochastic Variants of EM: Monte Carlo, Quasi-Monte Carlo and More. in: Proceedings of the American Statistical Association.

10.

Jank

. (2006). The EM algorithm, its randomized implementation and global optimization: Some challenges and opportunities for operations research. in: Perspectives in Operations Research, Springer, Boston, MA. 367-392.

11.

Karimi

Lavielle

, & Moulines

. (2019). On the convergence properties of the mini-Batch EM and MCEM algorithms.

12.

Korda

Kaufmann

, & Munos

. (2013). Thompson sampling for 1-dimensional exponential family bandits. in: Advances in Neural Information Processing Systems, 1448-1456.

13.

Lange

K. L.

Little

R. J.

, & Taylor

J. M

. (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408), 881-896.

14.

Levine

R. A

. & Casella

. (2001). Implementations of the Monte Carlo EM algorithm. Journal of Computational and Graphical Statistics, 10(3), 422-439.

15.

Lindsay

B. G

. (1983). The geometry of mixture likelihoods, part II: the exponential family. The Annals of Statistics, 11(3), 783-792.

16.

Liu

. (1995). Missing data imputation using the multivariate t distribution. Journal of Multivariate Analysis, 53(1), 139-158.

17.

Liu

, & Rubin

D. B

. (1995). ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, 19-39.

18.

Louis

T. A

. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 226-233.

19.

McKnight

P. E.

McKnight

K. M.

Sidani

, & Figueredo

A. J

. (2007). Missing data: A gentle introduction. Guilford Press.

20.

S. K.

Krishnan

, & McLachlan

G. J.

(2012). The EM algorithm. in: Handbook of Computational Statistics. Springer, Berlin, Heidelberg. 139-172.

21.

Nielsen

S. F.

(2000). The stochastic EM algorithm: Estimation and asymptotic results. Bernoulli, 6(3), 457-489.

22.

Pigott

T. D.

(2001). A review of methods for missing data. Educational Research and Evaluation, 7(4), 353-383.

23.

Raghunathan

T. E

. (2004). What do we do with missing data? Some options for analysis of incomplete data. Annu Rev Public Health, 25, 99-117.

24.

Roth

. (2013). On the multivariate t-distribution, Tech Rep.

25.

Rubin

D. B.

, & Thayer

D. T

. (1982). EM algorithms for ML factor analysis. Psychometrika, 47(1), 69-76.

26.

Shih

W. J.

, & Weisberg

. (1986). Assessing influence in multiple linear regression with incomplete data. Technometrics, 28(3), 231-239.

27.

Tregouet

D. A.

Escolano

Tiret

Mallet

, & Golmard

J. L

. (2004). A new algorithm for haplotypeâ€based association analysis: The Stochastic – EM algorithm. Annals of Human Genetics, 68(2), 165-177.

28.

Varadhan

, & Roland

. (2008). Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scandinavian Journal of Statistics, 35(2), 335-353.

29.

Wahlström

Jalden

Skog

, & Händel

. (2018). Alternative EM algorithms for nonlinear state-space models. in: 2018 21st International Conference on Information Fusion (FUSION). IEEE. 1260-1267.

30.

Wei

G. C.

, & Tanner

M. A

. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association, 85(411), 699-704.

31.

Zhu

, & Peterson

. (2007). Maximum likelihood from spatial random effects models via the stochastic approximation expectation maximization algorithm. Statistics and Computing, 17(2), 163-177.

Missing data imputation in multivariate t distribution with unknown degrees of freedom using expectation maximization algorithm and its stochastic variants

Abstract

Keywords

1. Introduction

2. Materials and methods

2.1 Multivariate t distribution

3. Results

3.1 Simulation study

Table 2 Imputed, updated parameter values, and confidence intervals (SEM)

Table 4 Parameter estimates of real data for EM, SEM, and MCEM

5. Conclusion

Footnotes

Conflict of interest

Acknowledgements

References

2.1 Multivariate $t$ distribution

Table 2
Imputed, updated parameter values, and confidence intervals (SEM)

Table 4
Parameter estimates of real data for EM, SEM, and MCEM