Model based clustering using finite mixtures of multivariate geometric skew normal distribution

Abstract

The non-normal distributions for finite mixture model techniques to clustering have been well developed and much used. Particularly, in case of finite mixture models, the component distributions are skewed for multivariate data. It is shown that clustering approach to finite mixture models analyzes the data for asymmetric behavior and heavy tails. In this paper, clustering using multivariate geometric skew normal mixture models has been discussed. The Expectation Maximization (EM) is used to compute maximum likelihood estimates for finite mixture of multivariate geometric skew normal mixture models. Bayesian Information Criterion and Akaike Information Criterion are used for model selection. Eigen value decomposition of covariance matrix are considered and compared to each other. This clustering approach is illustrated with the help of simulated and real life datasets where comparisons are drawn with other mixture models.

Keywords

Finite mixture models model based clustering multivariate geometric skew normal distribution EM algorithm Bayesian Information Criterion Akaike Information Criterion

1. Introduction

A finite mixture model in clustering finds a large number of applications, because it allows standard statistical modeling tools to be used in order to assess and evaluate in clustering. Finite mixture models consider that the population is a convex combination of finite number of density functions. In model based clustering, each sample is assumed to come from one or more mixture model. Model based clustering fits a finite mixture model to data and help to identify each cluster with one of its components. Broad details of finite mixture models and clustering applications are given by Everitt and Hand (1981), Titterington et al., (1985), McLachlan and Peel (2000). McLachlan and Basford (1988), McNicholas and Murphy (2008, 2010a, 2010b), Beak and McLachlan (2010) have studied about applications of finite mixture models of multivariate Gaussian distribution in model based clustering. Semhar and Melnykov (2016) have discussed about the challenges of model based clustering such as initialization techniques, dimension reduction and variable selection. However, Gaussian mixture model to clustering is not capable of dealing reasonable fits for heavy tails, asymmetric and outliers to the heterogeneous data.

An alternative procedure for finite mixture model has been considered where the component densities are skewed. Several works have been done on skewed distributions such as skew normal (Azzalini, 1985), univariate and multivariate skew-t. Multivariate skew normal distribution has been studied in detail by Azzalini and Valle (1996). In recent years, there has been an increasing attention on non-normal mixture models with skewed data, like the multivariate skew-normal model and the multivariate skew- $t$ distribution to provide improved modeling and clustering of data that consists of asymmetric behavior with outliers. Lin et al. (2007) have proposed finite mixtures of skew normal distribution which deals with population heterogeneity and skewness. Finite mixture models based on student-t, skew normal and normal distribution can be viewed as a special case of the skew-t mixture modeling. Cabral et al., (2008) have developed skew student-t normal mixture modeling using Bayesian approach. Lee and McLachlan (2013a, 2013b) have proposed finite mixture model with skew normal and skew-t distributions and also compared the clustering performance of finite mixture in multivariate skew normal and skew-t distributions with that of non-normal mixture distributions. Lee and McLachlan (2014) have provided some recent developments in mixtures of multivariate skew-t distribution and also discussed about many characterizations of multivariate skew-t distribution. Some other non-normal distributions have also received attention. Sanjeena et al., (2014) have proposed the parameter estimation of finite mixtures of multivariate normal inverse Gaussian distribution using variational Bayes approximation. Multivariate generalized hyperbolic mixture models and parameterization of covariance matrix was proposed by Browne and McNicholas (2015). Adrian et al. (2016) proposed clustering using multivariate normal inverse Gaussian distribution for heavy tails and asymmetric data. Melnykov et al. (2018) have developed finite mixture modeling with components that can handle skewness in matrix-valued data.

In the past years, many research works have been done in non-normal mixture models for clustering and it gives robust clustering procedures. Azzalini skew normal distribution cannot be used to model heavy tailed data. It is well known to be a thin tail distribution. Recently, Debasis (2014) proposed a new three-parameter geometric skew normal distribution as an alternative to Azzalini skew normal distribution. The geometric skew normal distribution can be obtained as a geometric sum of independent identically distributed (i.i.d) normal random variables. In this paper, an attention is paid to model based clustering using finite mixtures of multivariate geometric skew normal distribution (Debasis, 2017) with $G$ components. Multivariate Geometric Skew Normal (MGSN) distribution is an alternative to the finite mixture models of multivariate skew normal distribution to deal with skewness in case of heterogeneous data.

Model selection method is carried out using Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) techniques. The clustering performance is measured by both Adjusted Rand Index (ARI) and Misclassification Rate (MR). The use of finite mixture models of multivariate geometric skew normal distributions provides an effective mathematical basis for clustering. The usefulness of multivariate geometric skew normal mixture models in clustering are illustrated using real and simulated datasets. Also MGSN mixture models are compared with some existing mixture models such as Multivariate Skew Normal (MSN) and Multivariate normal (MN) distributions.

The rest of this article is organized as follows. Section 2 presents model based clustering using finite mixtures of multivariate geometric skew normal distribution. Parameterization of the covariance matrix is discussed in Section 3. Section 4 describes the models selection procedures. Section 5 presents the clustering results are obtained using simulated and real datasets. Conclusion is given in Section 6.

2. Model based clustering using multivariate geometric skew normal distribution

A $d$ -variate MGSN distribution can be defined as follows. Suppose $N\sim\textit{GE}(p),\left\{X_{i};i=1,2,\ldots\right\}$ are i.i.d $N_{d}=(\underline{\mu},\Sigma)$ random vectors and all the random variables are independently distributed. Define

$\displaystyle X\underline{\underline{\textit{dist}}}\sum_{i=1}^{N}X_{i}$

then $X$ is said to have a $d$ -variate geometric skew-normal distribution with parameters $p,\underline{\mu}$ and $\Sigma$ and the probability density function is given by

$\displaystyle f_{X}\left(x;\underline{\mu},\Sigma,p\right)=\sum^{\infty}_{k=1}% {\frac{p{\left(1-p\right)}^{k-1}}{{(2\pi)}^{{d}/{2}}\ {\left|\Sigma\right|}^{{% 1}/{2}}{(k)}^{{d}/{2}}}\ .{\text{exp}\left({-1}/{2k}\ {\left(x-k\underline{\mu% }\right)}^{T}\Sigma^{-1}\left(x-k\underline{\mu}\right)\right)\ }}$ (1)

The MGSN distribution will be denoted by $\textit{MGSN}_{d}(p,\underline{\mu},\Sigma)$ .

Let $x_{1},x_{2},\ldots,x_{n}$ be a $d$ -dimensional random sample which comes from $g$ -th component mixtures of multivariate geometric skew normal distributions. The probability density function of $g$ -th component finite mixture models is given by

$\displaystyle f_{X}\left(x|\Theta\right)=\sum^{G}_{g=1}{\pi}_{g}f(x;{% \underline{\mu}}_{g},\Sigma_{g},p_{g})$ (2) $\displaystyle f_{X}\left(x|\Theta\right)=\sum^{G}_{g=1}{{\pi}_{g}\ \frac{p_{g}% {\left(1-p_{g}\right)}^{k-1}}{{(2\pi)}^{{d}/{2}}\ {\left|\Sigma_{g}\right|}^{{% 1}/{2}}{(k)}^{{d}/{2}}}.{\text{exp}\!\left({-1}/{2k}{\left(x-k{\underline{\mu}% }_{g}\right)}^{T}\Sigma_{g}^{-1}\left(x-k{\underline{\mu}}_{g}\right)\!\!% \right)}};k=1,2,\ldots,\infty$ (3)

where ${\pi}_{g}$ represents the mixing proportion with $\sum^{G}_{g=1}{{\pi}_{g}=1}$ , $0<{\pi}_{g}<1$ .

The EM algorithm (Dempster et al., 1977) involves two steps such as Expectation step (E-step) and Maximization step (M-step). The E-step estimates the expected value of the complete data log likelihood. In the M-step, the maximum likelihood estimates of the model parameters are computed. These two steps are repeated iteratively until convergence is reached. Clustering using finite mixture models is done using the EM algorithm which is an iterative procedure for finding maximum likelihood estimates when the data are incomplete.

The complete data in EM algorithm are considered to be ( $X$ , $Z$ ). $X$ is the original data considered to be incomplete. $Z$ is the missing data indicating the mixture component original label of each observation. The likelihood function of complete data is given by

$\displaystyle L(p,\underline{\mu},\Sigma)=\Pi^{n}_{i=1}{\Pi^{G}_{g=1}{{\left[{% \pi}_{g}\ f(x_{i};{\underline{\mu}}_{g},\Sigma_{g},p_{g})\right]}^{Z_{ig}}}}=% \Pi^{n}_{i=1}{\Pi^{G}_{g=1}{{\!\left[{\pi}_{g}.\frac{p_{g}{\left(1-p_{g}\right% )}^{m_{i}-1}}{{(2\pi)}^{{d}/{2}}{\left|\Sigma_{g}\right|}^{{1}/{2}}{(m_{i})}^{% {d}/{2}}}.{\text{exp}\!\left(\!\!{-1}/{2m_{i}}{\left(x_{i}-m_{i}{\underline{% \mu}}_{g}\right)}^{T}\Sigma_{g}^{-1}\!\left(x_{i}-m_{i}{\underline{\mu}}_{g}% \right)\!\!\right)}\!\right]}^{Z_{ig}}}}$ (4)

The complete data log-likelihood function for multivariate geometric skew normal mixture model is given by

$\displaystyle l(p,\underline{\mu},\Sigma)=\sum^{n}_{i=1}{\sum^{G}_{g=1}{Z_{ig}% \left\{\text{log}{\pi}_{g}+\text{log}\left(f(x_{i};{\underline{\mu}}_{g},% \Sigma_{g},p_{g})\right)\right\}}}=\sum^{n}_{i=1}\sum^{G}_{g=1}Z_{ig}\left\{% \text{log}\pi_{g}+\text{log}\left(\frac{p_{g}{\left(1-p_{g}\right)}^{m_{i}-1}}% {{(2\pi)}^{{d}/{2}}{\left|\Sigma_{g}\right|}^{{1}/{2}}{(m_{i})}^{{d}/{2}}}.% \text{exp}\left(-1/2m_{i}\left(x_{i}-m_{i}{\underline{\mu}}_{g}\right)^{T}% \Sigma_{g}^{-1}\left(x_{i}-m_{i}{\underline{\mu}}_{g}\right)\!\right)\!\right)% \!\right\}$ (5)

EM algorithm for a mixture of MGSN distributions: the E-step.

The EM algorithm is simplified by introducing the latent variable $z_{ig}$ , where $z_{ig}=1$ if samples belongs to the component $g$ and 0 otherwise. The conditional expectation of the log-likelihood as follows

$\displaystyle l(p,\underline{\mu},\Sigma)=\sum^{n}_{i=1}{\sum^{G}_{g=1}{E_{{Z}% /{X}}\left(Z_{ig}\right)\left\{\text{log}{\pi}_{g}+log\left(f(x_{i};{% \underline{\mu}}_{g},Σ_{g},p_{g})\right)\right\}}}$ $\displaystyle l(p,\underline{\mu},\Sigma)=\sum^{n}_{i=1}{\sum^{G}_{g=1}{{\tau}% _{ig}\left\{\text{log}{\pi}_{g}+\text{log}\left(f(x_{i};{\underline{\mu}}_{g},% \Sigma_{g},p_{g})\right)\right\}}}$

where ${\tau}_{ig}$ is the probability of observation i belonging to group $g$ .

$\displaystyle l(p,\underline{\mu},\Sigma)=\sum^{n}_{i=1}\sum^{G}_{g=1}\tau_{ig% }\left\{\text{log}\pi_{g}+\text{log}\!\left(\frac{p_{g}{\left(1-p_{g}\right)}^% {m_{i}-1}}{{(2\pi)}^{{d}/{2}}\ {\left|\Sigma_{g}\right|}^{{1}/{2}}{(m_{i})}^{{% d}/{2}}}.\text{exp}\!\left(\!\!-1/2m_{i}\left(x_{i}-m_{i}{\underline{\mu}}_{g}% \right)^{T}\!\Sigma_{g}^{-1}\!\left(x_{i}-m_{i}{\underline{\mu}}_{g}\right)\!% \right)\!\right)\!\!\right\}$ (6)

$\displaystyle=\sum^{n}_{i=1}\sum^{G}_{g=1}\tau_{ig}{\text{log}{\pi}_{g}}+\sum^% {n}_{i=1}\sum^{G}_{g=1}\tau_{ig}\!\left\{\text{log}\left(\!\frac{p_{g}{\left(1% -p_{g}\right)}^{m_{i}-1}}{{(2\pi)}^{{d}/{2}}\ {\left|\Sigma_{g}\right|}^{{1}/{% 2}}{(m_{i})}^{{d}/{2}}}\ .\text{exp}\left(-1/2m_{i}\left(x_{i}-m_{i}{% \underline{\mu}}_{g}\right)^{T}\Sigma_{g}^{-1}\left(x_{i}-m_{i}{\underline{\mu% }}_{g}\right)\!\right)\!\right)\!\right\}=\sum^{n}_{i=1}{\sum^{G}_{g=1}{{\tau}% _{ig}{\text{log}{\pi}_{g}\ }+\sum^{n}_{i=1}{\sum^{G}_{g=1}{{\tau}_{ig}}}}}% \left\{\text{log}\left({p}_{g}{\left(1-p_{g}\right)}^{m_{i}-1}\right)-\text{% log}{(2\pi)}^{{d}/{2}}-{\text{log}\left({\left|\Sigma_{g}\right|}^{{1}/{2}}% \right)\ }-\text{log}{(m_{i})}^{{d}/{2}}\right\}+\sum^{n}_{i=1}{\sum^{G}_{g=1}% {{\tau}_{ig}\left({-1}/{2m_{i}}{\left(x_{i}-m_{i}{\underline{\mu}}_{g}\right)}% ^{T}\Sigma_{g}^{-1}\left(x_{i}-m_{i}{\underline{\mu}}_{g}\right)\right)}}$ (8)

At the E-step, the posterior probabilities that the $i^{th}$ member of the observation belongs to the $g^{th}$ mixture component are estimated by

$\displaystyle{\widehat{\tau}}_{ig}=\frac{{\pi}_{g}f(x_{i};{\underline{\mu}}_{g% },\Sigma_{g},p_{g})}{\sum^{G}_{j=1}{{\pi}_{j}f(x_{i};{\underline{\mu}}_{j},% \Sigma_{j},p_{j})}};j\neq g$ (9)

EM algorithm for a mixture of MGSN distributions: the M-step.

Maximize the complete data log likelihood with respect to the parameters. Maximize the Eq. (8) with respect to ${\underline{\mu}}_{g},{\mathit{\Sigma}}_{g}$ and ${\pi}_{g}$ equating them zero to get

$\displaystyle{\widehat{\underline{\mu}}}_{g}=\frac{\sum^{n}_{i=1}{{\widehat{% \tau}}_{ig}}x_{i}}{\sum^{n}_{i=1}{{\widehat{\tau}}_{ig}}m_{i}}$ (10) $\displaystyle{\widehat{\mathit{\Sigma}}}_{g}=\frac{\sum^{n}_{i=1}{{\widehat{% \tau}}_{ig}{\left(x_{i}-m_{i}{\widehat{\underline{\mu}}}_{g}\right)}^{T}\left(% x_{i}-m_{i}{\widehat{\underline{\mu}}}_{g}\right)}}{\sum^{n}_{i=1}{{\widehat{% \tau}}_{ig}}\ m_{i}}$ (11) $\displaystyle{\widehat{\pi}}_{g}=\frac{\sum^{n}_{i=1}{{\widehat{\tau}}_{ig}}}{n}$ (12)

The problem of initializing techniques for EM algorithm is not well studied and it is not unique one. Many researchers have developed various initialization techniques for model based clustering approach. McLachlan (1988) has proposed the use of principal component analysis for choosing the initial values for multivariate mixture models. The standard techniques for tackling the issue of EM algorithm initialization is the Multiple Restart approach (MREM). MREM approach for the EM algorithm is run many times, each run being started with different random initial values (McLachlan et al., 2000). The best result of the MREM method is to obtain highest log likelihood value.

Initialization in EM algorithm using k-means with Euclidean distance measures is widely used for model based clustering. Euclidean distance measures are used for homogeneous and spherical clusters. In this paper, Mahalanobis distance measure is used for initialization procedure. Mahalanobis distance measure is used to capture the covariance structures of clusters. Mahalanobis distance measures are used to identify and correctly classify non-spherical clusters for non-homogeneous data.

The model-based clustering with a finite mixture of multivariate geometric skew normal distribution using EM algorithm is as follows

Fix $\varepsilon>$ 0. The initial value of mixing proportion ${{\pi}_{g}}^{(0)}$ is obtained using the formula.

$\displaystyle{\pi}_{g}^{(0)}=\frac{w_{g}}{\sum^{G}_{g=1}{w_{g}}};$ $\displaystyle w_{g}=\sum^{n}_{i=1}{{\widehat{\tau}}_{ig}.d_{ig\ }};g=1,2,% \ldots,G$

where ${\widehat{\tau}}_{ig}=\frac{{\pi}_{g}f(x_{i};{\underline{\mu}}_{g},{\mathrm{% \Sigma}}_{g},p_{g})}{\sum^{G}_{j=1}{{\pi}_{j}f(x_{i};{\underline{\mu}}_{j},{% \Sigma}_{j},p_{j})}};j\neq g$ and $d_{ig\ }$ is the Mahalanobis distance measure.

The initial values of parameters ${\underline{\mu}}_{g}^{(0)}$ are obtained using the sample weighted k-means using Mahalanobis distance measure (Deepana, 2017).

Compute the different covariance matrix $\Sigma_{g}^{(0)}$ using the algorithm from Section 3.

E-step: Compute ${\widehat{\tau}}^{(a)}_{ig}$ with ${\pi}_{g}^{(0)}$ , $\underline{\mu}_{g}^{(0)}$ and $\Sigma_{g}^{(0)}$ .

Set $a=$ 1; Compute ${{\widehat{\underline{\mu}}}_{g}}^{(a)}$ with ${\widehat{\tau}}^{(a)}_{ig}$ .

Compute ${{\widehat{\mathit{\Sigma}}}_{g}}^{(a)}$ with ${\widehat{\tau}}^{(a)}_{ig}$ .

Compute ${{\widehat{\pi}}_{g}}^{(a)}$ with ${\widehat{\tau}}^{(a)}_{ig}$ .

M Step: Update ${\widehat{\tau}}^{(a)}_{ig}$ with ${{\widehat{\pi}}_{g}}^{(a)},{{\widehat{\underline{\mu}}}_{g}}^{(a)}$ , ${{\widehat{\mathit{\Sigma}}}_{g}}^{(a)}$ .

Compute BIC and AIC using the Eq. (14).

10.

Compute adjusted rand index and misclassification rate.

11.

Compare ${{\widehat{\underline{\mu}}}_{g}}^{(a)}$ and ${{\widehat{\underline{\mu}}}_{g}}^{(a-1)}$ . If $\left.{{\widehat{\underline{\mu}}}_{g}}^{(a)}-{{\widehat{\underline{\mu}}}_{g}% }^{(a-1)}\right.<\varepsilon$ . STOP

Else $a=a+1$ and return to step 4.

3. Parameterization of the covariance matrix

The covariance matrix represents the geometric features such as volume, shape and orientation of the clusters. Several techniques have been developed for covariance structures of Gaussian mixture models to clustering. To provide easy and simple interpretable models, Banfield and Raftery (1993) have reparameterized the covariance matrices in terms of the eigen value decomposition. Celeux and Govaert (1995) classified the covariance models into three families, namely, spherical, diagonal and general families. Random generation of the covariance matrix $\Sigma_{g}$ is based on the eigen-value decomposition which is given by

$\displaystyle\Sigma_{g}={\lambda}_{g}D_{g}A_{g}D_{g}^{T}$ (13)

where ${\lambda}_{g}$ is a scalar, $D_{g}$ is a orthogonal matrix of eigen vectors, $A_{g}$ is a diagonal matrix whose elements are proportional to the eigen values of $\Sigma_{g}$ . $D_{g}$ determines the orientation of the principal components of $\Sigma_{g},$ $A_{g}$ determines the shape of the density contours and ${\lambda}_{g}$ determines the volume of the ellipsoid which is proportional to $\lambda_{g}^{p}\ \left|A_{g}\right|,$ where $p$ is the dimension of the data. Fraley and Raftery (1998, 2002) developed an eigen-value decomposition of the cluster covariance matrices to provide a wide range of parsimonious covariance structures. This work is implemented in the MCLUST family of models, which consists of 10 mixture models that arise from the imposition of constraints upon the group covariance matrix. Cluster structure leads to different size and shape. Some clusters have spherical shape and some clusters have elliptical shape. Eigen-value decomposition of the component covariance matrices provides different shape, volume and orientation. EII and VII indicate the spherical components without and with varying volume. In EEE model, the clusters are elliptical and the same covariance structures are applied to all clusters. EEV refers to the diagonal and homoscedastic covariance matrices. In this model, the clusters are elliptical with same volume and size but different orientation. In VVV model, the clusters are elliptical with different volume, size and orientation. The summary of eigen-value decomposition covariance structures is given in Table 1.

Table 1

Nomenclature, scale matrix structure and the number of free scale parameters for the eigen-decomposed family of models

Model	${\lambda}_{g}$	$A_{g}$	$D_{g}$	$\Sigma_{g}$	No. of covariance parameters
EII	Equal	Spherical	–	$\lambda I$	1
VII	Variable	Spherical	–	${\lambda}_{g}I$	G
EEV	Equal	Equal	Variable	$\lambda D_{g}\textit{AD}^{\prime}_{g}$	${Gd(d+1)}/{2}-(G-1)d$
EEE	Equal	Equal	Equal	$\lambda\textit{DAD}^{\prime}$	${d(d+1)}/{2}$
VVV	Variable	Variable	Variable	${\lambda}_{g}D_{g}A_{g}D^{\prime}_{g}$	${\ Gd(d+1)}/{2}$

3.1 Covariance estimation

An alternative estimation method for covariance matrix is presented in this paper. The decomposed elements of the covariance matrix are updated according to the following algorithm. ${\widehat{\tau}}_{ig}$ represents the probability that observation $i$ belongs to group $g$ given the current component parameters

$\displaystyle n_{g}={\widehat{\tau}}_{ig}=\frac{{\pi}_{g}f(x_{i};{\underline{% \mu}}_{g},\Sigma_{g},p_{g})}{\sum^{G}_{j=1}{{\pi}_{j}f(x_{i};{\underline{\mu}}% _{j},\Sigma_{j},p_{j})}};j\neq g$

M-step involves the conditionally maximizing the parameters with respect to complete log-likelihood. The estimated mixing proportion and sample cross-product matrix for the $g^{\rm th}$ component is given by

$\displaystyle{\widehat{\pi}}_{g}=\frac{n_{g}}{n};\ g=1,\ 2,\ldots,\ G$ $\displaystyle W_{g}=\sum^{n}_{i=1}{n_{g}{(x_{i}-{m_{i}\underline{\mu}}_{g})(x_% {i}-{m_{i}\underline{\mu}}_{g})}^{T}};g=1,\ 2,\ldots,\ G$

1)
Iteration $t=1$
2)
Update

$\displaystyle{\lambda}_{g}=\frac{\sum^{G}_{g=1}{tr(n_{g}.W_{g})}}{nd}$

where $n$ is the number of observations and $d$ is the dimension.
3)
Update

$\displaystyle A_{g}=\frac{\textit{diag}(n_{g}.W_{g})}{{\left|n_{g}.W_{g}\right% |}^{{1}/{d}}}$
4)
Update

$\displaystyle D_{g}=n_{g}.W_{g}.a_{g}$

where $a_{g}$ is the largest eigen value of $W_{g}$ .
5)
Update ${\lambda}_{g},D_{g}$ and $A_{g}$ in $\Sigma_{g}$ .
6)
Calculate $E_{t}=\frac{1}{\lambda}\sum^{G}_{g=1}{tr(n_{g}{\lambda}_{g}D_{g}A_{g}{D_{g}}^{% T})}+nd\ \text{log}(\lambda)$
7)
If $t>$ 1, $E_{t}-E_{t-1}>\epsilon$ . If true $t=t+1$ and return to step 2, or else end.

Five types of covariance structures discussed in Celeux G & Govaert (1995) are considered for finite mixtures of multivariate geometric skew normal distributions to clustering.

1)
EII – $\lambda$ I

$\displaystyle W=\sum^{n}_{i=1}{n_{g}(x_{i}-m_{i}\underline{\mu}){(x_{i}-m_{i}% \underline{\mu})}^{T}};$ $\displaystyle\lambda=\frac{tr(W)}{nd}$
2)
VII – $λ_{g}$ I

$\displaystyle W_{g}=\sum^{n}_{i=1}{n_{g}(x_{i}-{m_{i}\underline{\mu}}_{g}){(x_% {i}-{m_{i}\underline{\mu}}_{g})}^{T}}$ $\displaystyle\lambda_{g}=\frac{\sum^{G}_{g=1}{tr(n_{g}.W_{g})}}{n_{g}d}$
3)
EEE – $\lambda$ DA $D^{T}$

$\displaystyle\lambda=\frac{tr(W)}{nd}$ $\displaystyle A=\frac{diag(W)}{{\left|W\right|}^{{1}/{d}}}$ $\displaystyle D=W.a$

where $a$ is the largest eigen value of $W$ .
4)
EEV – $\lambda$ $D_{g}$ A ${D_{g}}^{T}$

$\displaystyle W_{g}=\sum^{n}_{i=1}{n_{g}{(x}_{i}-{m_{i}\underline{\mu}}_{g}).{% (x_{i}-{m_{i}\underline{\mu}}_{g})}^{T}}$ $\displaystyle D_{g}=n_{g}.W_{g}.a_{g}$
5)
VVV – $\lambda_{g}D_{g}A_{g}{D_{g}}^{T}$

$\displaystyle\lambda_{g}=\frac{\sum^{G}_{g=1}{tr(n_{g}.W_{g})}}{n_{g}d}$ $\displaystyle A_{g}=\frac{\textit{diag}(n_{g}.W_{g})}{{\left|n_{g}.W_{g}\right% |}^{{1}/{d}}}$ $\displaystyle D_{g}=n_{g}.W_{g}.a_{g}$

4. Model selection and clustering performance

In model based clustering approach, model selection criteria are generally used to choose the best model and to select the number of groups. In this paper, Bayesian Information Criterion (Schwarz, 1978) and Akaike Information Criterion (Akaike, 1973) is used for model selection.

$\displaystyle\text{BIC}=m{\text{log}\left(n\right)-2l\ }\ \text{and AIC}=2m-2l$ (14)

where $l$ is the maximized observed-data log-likelihood, $m$ is the number of parameters in the model and $n$ is the number of observations. Adjusted Rand Index (Hubert & Arabie, 1985) is used for evaluating the clustering performance and the ARI values lies between 0 to 1. The misclassification rate also is used to check the clustering results.

4.1 Experimental results

This section provides experimental validation and illustrative examples for model based clustering using finite mixtures of multivariate geometric skew normal distribution. The performance of the model based clustering using multivariate geometric skew normal mixture models is illustrated with real and simulated datasets.

4.2 Banknote dataset

Swiss banknote dataset is considered for the analysis. The banknote dataset consists of 200 samples and 6 variables. In the dataset contains of 100 counterfeit notes and 100 genuine notes. The variables are length of bill, width of left edge, width of right edge, bottom margin width and top margin width. All measurements are in millimeters. All variables are considered for this study. This dataset is recorded by Flury et al. (1988). This dataset is available in the Mclust package (Fraley et al., 2006).

Mardia’s test has been used to check the skewness of the multivariate dataset. The $p$ -value is 0 and the skewness is 6.982568. It indicates departure from normality. The data follows multivariate non-normality.

The effectiveness of different covariance structures for clustering based on the finite mixture of multivariate geometric skew normal distribution is investigated. Initial values for finite mixture models of multivariate geometric skew normal distributions are obtained from the procedure described in the algorithm in Section 2. The summary of clustering results for multivariate mixture models are listed in Table 2.

Table 2
Clustering performance of various multivariate mixture models on the Banknote dataset

	Models	BIC	AIC	MR	ARI	Log likelihood
MGSN mixture	EII	1886.912	1834.562	0.09	0.8249	$-$ 796.53
	VII	1897.072	1856.582	0.05	0.7903	$-$ 769.94
	EEE	1899.649	1883.544	0.28	0.9499	$-$ 868.27
	EEV	1925.195	1892.052	0.005	0.9893	$-$ 893.82
	VVV	1895.839	1869.302	0.12	0.8093	$-$ 788.28
EMMIXskew		1892.361	1825.427	0.22	0.8692	$-$ 834.25
Mclust		1852.597	1852.597	0.13	0.8383	$-$ 819.29

Table 3

Classification table MSN, MN and MGSN using five covariance models

EII			VII			EEE
Actual	Clusters		Actual	Clusters		Actual	Clusters
	Cluster1	Cluster2		Cluster1	Cluster2		Cluster1	Cluster2
Counterfeit	3	97	Counterfeit	2	98	Counterfeit	13	87
Genuine	85	15	Genuine	92	8	Genuine	57	43

EEV			VVV
Actual	Clusters		Actual	Clusters
	Cluster1	Cluster2		Cluster1	Cluster2
Counterfeit	0	100	Counterfeit	15	85
Genuine	99	1	Genuine	91	9

EMMIXskew (MSN)			MCLUST (MN)
Actual	Clusters		Actual	Clusters
	Cluster1	Cluster2		Cluster1	Cluster2
Counterfeit	81	19	Counterfeit	86	14
Genuine	25	75	Genuine	12	88

Figure 1.

Clustering plot for banknote dataset using different covariance models for MGSN mixture models.

The highest BIC and AIC values are selected for all covariance models. Among five covariance structures of MGSN mixture models, EEV model attains the highest values of BIC (1925.19) and AIC (1892.052). We observe from Table 2 that the EEV model achieved the lowest misclassification error (0.005) and the highest ARI (0.9893). It shows close match to the true labels. Other covariance models also provide reasonable clustering results. EEV model is also compared with Multivariate Skew Normal (MSN) and Multivariate Normal (MN) mixture models. The classification table of multivariate mixture models for banknote dataset is given in Table 3. In Table 3, the classification result from the mclust, EMMIXskew and finite mixture of MGSN distributions is presented. Clustering using finite mixture of MGSN distribution attains higher value of BIC and AIC as compared to multivariate skew normal and multivariate normal mixture models.

The above result indicate that MGSN mixture model outperforms other mixture models. Figure 1 shows the scatter plots for the banknote dataset with five different covariance models. Figure 2 shows the scatter plots of MSN and MN mixture models.

4.3 Simulated dataset

In this simulation study, a dataset ( $n=$ 1000) generated with $d=$ 3 and $G=$ 5 groups is used. Each group size is 200. Original scatter plot of the data is given in the Fig. 3. This dataset is used to find clustering using finite mixtures of multivariate geometric skew normal distributions with different covariance models. Initial parameter values for MGSN mixture model are obtained using the procedure described in Section 2.

Different covariance structures in MGSN mixture models are considered. The best results of model based clustering using MGSN mixture models are also compared with other multivariate mixture models. The clustering results of the simulated dataset are provided in Table 4.

From Table 4, it is observed that VVV model gives lowest misclassification rate (0.018). The ARI is 83% with BIC (18912.02) and AIC (19382.39). Among five covariance structures of MGSN mixture models, VVV model achieved the highest ARI. Classification table for multivariate mixture models are shown in Table 5. The number of misallocated observations for simulated dataset is reported in Table 5. The comparison of the classification results from the mclust, EMMIXskew and finite mixture of MGSN with five different covariance models are shown in Table 5.

Table 4
Clustering performance of various multivariate mixture models on the simulated dataset

	Models	BIC	AIC	MR	ARI	Log likelihood
MGSN	EII	18713.03	18781.74	0.112	0.7812	$-$ 9404.868
mixture	VII	17891.34	17734.42	0.297	0.7231	$-$ 9201.637
models	EEE	18213.47	18281.74	0.132	0.7650	$-$ 9443.168
	EEV	18813.02	18981.94	0.087	0.7330	$-$ 9413.817
	VVV	18912.05	19382.39	0.018	0.8312	$-$ 9351.023
MSN		17753.23	18292.42	0.20	0.7478	$-$ 9218.768
MN		18623.03	18581.74	0.149	0.7150	$-$ 9124.668

Figure 2.

a) Clustering plot using multivariate normal mixture models; b) Clustering plot using multivariate skew normal mixture models.

Table 5

Classification table MSN, MN and MGSN using five covariance models

Actual							Actual
EII						VII
Cluster	Group 1	Group 2	Group 3	Group 4	Group 5	Cluster	Group 1	Group 2	Group 3	Group 4	Group 5
Cluster 1	0	200	0	199	0	Cluster 1	0	200	0	195	0
Cluster 2	0	0	1	0	197	Cluster 2	0	0	1	0	197
Cluster 3	1	0	190	1	3	Cluster 3	1	0	193	0	3
Cluster 4	97	0	8	0	0	Cluster 4	97	0	5	5	0
Cluster 5	102	0	1	0	0	Cluster 5	102	0	1	0	0

EEE							EEV
Actual							Actual
Cluster	Group1	Group 2	Group 3	Group 4	Group 5	Cluster	Group 1	Group 2	Group 3	Group 4	Group 5
Cluster 1	200	0	26	0	0	Cluster 1	189	0	0	0	0
Cluster2	0	0	159	0	200	Cluster 2	0	0	128	0	0
Cluster 3	0	87	0	0	0	Cluster 3	11	0	71	1	3
Cluster 4	0	113	0	4	0	Cluster 4	0	200	0	199	0
Cluster 5	0	0	15	196	0	Cluster 5	0	0	1	0	197

VVV							MSN
Actual							Actual
Cluster	Group1	Group 2	Group 3	Group 4	Group 5	Cluster	Group 1	Group 2	Group 3	Group 4	Group 5
Cluster 1	198	0	0	5	0	Cluster 1	193	0	36	0	0
Cluster 2	0	200	5	0	0	Cluster 2	0	10	139	0	179
Cluster 3	0	0	0	193	0	Cluster 3	0	87	0	10	0
Cluster 4	0	0	1	0	197	Cluster 4	0	103	0	4	21
Cluster 5	2	0	194	2	3	Cluster 5	7	0	25	186	0

MN
	Actual
Cluster	Group 1	Group 2	Group 3	Group 4	Group 5
Cluster 1	175	0	0	0	0
Cluster 2	0	10	110	0	0
Cluster 3	11	0	71	11	13
Cluster 4	0	190	18	189	0
Cluster 5	14	0	1	0	187

Figure 3.

Scatter plot of the original dataset.

Figure 4 shows the scatter plot for five different covariance structures using MGSN mixture model. The scatter plot for multivariate skew normal and normal mixture models are depicted in Fig. 5. The goal of this study is to check whether the multivariate geometric skew normal mixture models can be fitted using model based clustering approach.

Figure 4.

Clustering plot for simulated dataset using different covariance structures for MGSN mixture models.

Figure 5.

a) Clustering plot using multivariate normal mixture models; b) Clustering plot using multivariate skew normal mixture models.

5. Conclusion

This paper presents non-Gaussian model based clustering using multivariate geometric skew normal mixture models for skewed data. Parameter estimation using EM algorithm is outlined. Different covariance structures are considered for multivariate geometric skew normal mixture models. Model based clustering using finite mixtures of multivariate geometric skew normal distribution is evaluated using both simulated and real life datasets. The clustering results based on finite mixtures of MGSN distribution produced the lowest misclassification error and highest ARI values when compared to MSN and MN mixture models.

References

Adrian

O. H.

Thomas

B. M.

Isobel

C. G.

Paul

D. M.

& Dimitris

(2016). Clustering with the multivariate normal inverse Gaussian distribution, Computational Statistics and Data Analysis, 93, 18-30.

Akaike

(1973). Inforation theory and an extension of the maximum likelihood principle. Proceeding of the second international symposium on information theory, B.N Petrov and F. Caski, eds., Akademiai Kiado, Budapest, pp. 267-281.

Azzalini

(1985). A class of distributions which includes the normal ones, Scandinavian Journal of Statistics (Theory and Applications), 12(2) , 171-178.

Azzalini

, & Dalla

V. A.

(1996). The multivariate skew-normal distribution. Biometrika, 83(4), 715-726.

Banfield

J. D.

, & Raftery

A. E.

(1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49(3), 803-821.

Baek

, & McLachlan

G. J.

(2010). Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high dimensional data, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(7), 1298-1309.

Biernacki

Celeux

, & Govaert

(2003). Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models, Computational Statistics and Data Analysis, 41(3), 561-575.

Bouveyron

, & Brunet

(2012). Simultaneous model-based clustering and visualization in the fisher discriminative subspace, Statistics and Computing, 22(1), 301-324.

Browne

R. P.

, & McNicholas

P.D.

(2015). A mixture of generalized hyperbolic distributions, Canadian Journal Statistics, 43(2), 176-198.

10.

Cabral

C. R. B.

Bolfarine

Pereira

J. R. G.

(2008). Bayesian density estimation using skew student-t normal mixture, Computational Statistics and Data Analysis, 52(12), 5075-5090.

11.

Cabral

C. R. B.

Lachos

V. H.

, & Prates

O. M.

(2012). Multivariate mixture modeling using skew-normal independent distributions. Computational Statistics and Data Analysis, 56, 126-142.

12.

Celeux G & Govaert (1995), Gaussian parsimonious clustering models, Pattern Recognition, 28(5), 781-793.

13.

Debasis

(2014). Geometric Skew Normal Distribution, Sankhya: The Indian Journal of Statistics, 76-B(2), 167-189.

14.

Debasis

(2017). Multivariate geometric skew normal distribution, Journal of Theoretical and Applied Statistics, 51(6), 1377-1397.

15.

Deepana

(2017). On sample weighted clustering algorithm using Euclidean and Mahalanobis distance measures, International Journal of Statistics and Systems, 12(3), 421-430.

16.

Dempster

A. P.

Laird

N. M.

, & Rubin

D. B.

(1977). Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B), 39, 1-38.

17.

Everitt

B. S.

& Hand

D. J.

(1981). Finite Mixture Distributions. Chapman and Hall, London.

18.

Flury

, & Riedwyl

(1998). Multivariate Statistics. a Practical Approach. Cambridge University Press, Cambridge.

19.

Fraley

, & Raftery

A. E.

(1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis, The Computer Journal, 41(8), 578-588.

20.

Fraley

, & Raftery

A. E.

(2002). Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97(458), 611-631.

21.

Fraley

, & Raftery

A. E.

, MCLUST version 3 for R: Normal mixture modeling and model-clustering, Technical Report 504, Department of Statistics, University of Washington. First Published September 2006. Minor revisions January 2007 and November.

22.

Hubert

, & Arabie

(1985). Comparing partitions. Journal of Classification, 2(1), 193-218.

23.

Karlis

(2002). An EM type algorithm for maximum likelihood estimation of the normal-inverse Gaussian distribution, Statistics Probability and Letters, 57(1), 43-52.

24.

Karlis

, & Xekalaki

(2003). Choosing initial values for the EM algorithm for finite mixtures, Computational Statistics and Data Analysis, 41(3-4), 577-590.

25.

Lin

T. I.

Lee

J. C.

, & Yen

S. Y.

(2007). Finite Mixture Modeling using Skew Normal Distribution, Statistica Sinica, 17(3), 909-927.

26.

Lee

X. L.

, & McLachlan

G. J.

(2013a). On mixtures of skew normal and skew t-distributions. Advances in Data Analysis and Classification, 7(3), 241-266.

27.

Lee

X. L.

, & McLachlan

G. J.

(2013b). Model-based clustering and classification with non-normal mixture distributions. Statistical Methods and Applications, 22(4), 427-454.

28.

Lee

X. L.

, & McLachlan

G. J.

(2013c). EMMIXuskew: An R Package for fitting mixtures of multivariate skew t distributions via the EM algorithm. Journal of Statistical Software, 55(12), 1-22.

29.

Lee

X. L.

, & McLachlan

G. J.

(2014). Finite mixtures of multivariate skew t- distributions: some recent and new results. Statistics and Computing, 24(2), 181-202.

30.

Melnykov

, & Xuwen

(2018). On Model-based clustering of skewed matrix data, Journal of Multivariate Analysis, 167(c), 181-194.

31.

McLachlan

G. J.

, & Basford

K. E.

(1988). Mixture Models: Inference and Applications. Marcel Dekker, New York.

32.

McLachlan

G. J.

, & Peel

(2000). Finite Mixture Models. John Wiley and Sons, Inc, New York.

33.

McNicholas

P. D.

, & Murphy

T. B.

(2008). Parsimonious Gaussian mixture models, Statistics and Computing, 18(3), 285-296.

34.

McNicholas

P. D.

, & Murphy

T. B.

(2010a). Model based clustering for longitudinal data, Canadian Journal of Statistics, 38(1), 153-168.

35.

McNicholas

P. D.

, & Murphy

T. B.

(2010b). Model based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, 26(21), 2705-2712.

36.

Semhar

, & Volodymyr

. (2016). Studying Complexity of Model-based Clustering? Communications in Statistics-Simulation and Computation, 45(6), 2051-2069.

37.

Sanjeena

, & McNicholas

P. D.

(2014). Variational bayes approximations for clustering via mixtures of normal inverse gaussian distributions. Advanced in Data Analysis and Classification, 8, 167-193.

38.

Schwarz

(1978). Estimating the dimension of the model. Annals of Statistics, 6(2), 461-464.

39.

Titterington

Smith

, & Makov

(1985). Statistical Analysis of Finite Mixture Distributions. John Wiley and Sons, New York.

Model based clustering using finite mixtures of multivariate geometric skew normal distribution

Abstract

Keywords

1. Introduction

2. Model based clustering using multivariate geometric skew normal distribution

4.2 Banknote dataset

Table 2 Clustering performance of various multivariate mixture models on the Banknote dataset

Table 4 Clustering performance of various multivariate mixture models on the simulated dataset

References

Table 2
Clustering performance of various multivariate mixture models on the Banknote dataset

Table 4
Clustering performance of various multivariate mixture models on the simulated dataset