Abstract
The non-normal distributions for finite mixture model techniques to clustering have been well developed and much used. Particularly, in case of finite mixture models, the component distributions are skewed for multivariate data. It is shown that clustering approach to finite mixture models analyzes the data for asymmetric behavior and heavy tails. In this paper, clustering using multivariate geometric skew normal mixture models has been discussed. The Expectation Maximization (EM) is used to compute maximum likelihood estimates for finite mixture of multivariate geometric skew normal mixture models. Bayesian Information Criterion and Akaike Information Criterion are used for model selection. Eigen value decomposition of covariance matrix are considered and compared to each other. This clustering approach is illustrated with the help of simulated and real life datasets where comparisons are drawn with other mixture models.
Keywords
Introduction
A finite mixture model in clustering finds a large number of applications, because it allows standard statistical modeling tools to be used in order to assess and evaluate in clustering. Finite mixture models consider that the population is a convex combination of finite number of density functions. In model based clustering, each sample is assumed to come from one or more mixture model. Model based clustering fits a finite mixture model to data and help to identify each cluster with one of its components. Broad details of finite mixture models and clustering applications are given by Everitt and Hand (1981), Titterington et al., (1985), McLachlan and Peel (2000). McLachlan and Basford (1988), McNicholas and Murphy (2008, 2010a, 2010b), Beak and McLachlan (2010) have studied about applications of finite mixture models of multivariate Gaussian distribution in model based clustering. Semhar and Melnykov (2016) have discussed about the challenges of model based clustering such as initialization techniques, dimension reduction and variable selection. However, Gaussian mixture model to clustering is not capable of dealing reasonable fits for heavy tails, asymmetric and outliers to the heterogeneous data.
An alternative procedure for finite mixture model has been considered where the component densities are skewed. Several works have been done on skewed distributions such as skew normal (Azzalini, 1985), univariate and multivariate skew-t. Multivariate skew normal distribution has been studied in detail by Azzalini and Valle (1996). In recent years, there has been an increasing attention on non-normal mixture models with skewed data, like the multivariate skew-normal model and the multivariate skew-
In the past years, many research works have been done in non-normal mixture models for clustering and it gives robust clustering procedures. Azzalini skew normal distribution cannot be used to model heavy tailed data. It is well known to be a thin tail distribution. Recently, Debasis (2014) proposed a new three-parameter geometric skew normal distribution as an alternative to Azzalini skew normal distribution. The geometric skew normal distribution can be obtained as a geometric sum of independent identically distributed (i.i.d) normal random variables. In this paper, an attention is paid to model based clustering using finite mixtures of multivariate geometric skew normal distribution (Debasis, 2017) with
Model selection method is carried out using Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) techniques. The clustering performance is measured by both Adjusted Rand Index (ARI) and Misclassification Rate (MR). The use of finite mixture models of multivariate geometric skew normal distributions provides an effective mathematical basis for clustering. The usefulness of multivariate geometric skew normal mixture models in clustering are illustrated using real and simulated datasets. Also MGSN mixture models are compared with some existing mixture models such as Multivariate Skew Normal (MSN) and Multivariate normal (MN) distributions.
The rest of this article is organized as follows. Section 2 presents model based clustering using finite mixtures of multivariate geometric skew normal distribution. Parameterization of the covariance matrix is discussed in Section 3. Section 4 describes the models selection procedures. Section 5 presents the clustering results are obtained using simulated and real datasets. Conclusion is given in Section 6.
Model based clustering using multivariate geometric skew normal distribution
A
then
The MGSN distribution will be denoted by
Let
where
The EM algorithm (Dempster et al., 1977) involves two steps such as Expectation step (E-step) and Maximization step (M-step). The E-step estimates the expected value of the complete data log likelihood. In the M-step, the maximum likelihood estimates of the model parameters are computed. These two steps are repeated iteratively until convergence is reached. Clustering using finite mixture models is done using the EM algorithm which is an iterative procedure for finding maximum likelihood estimates when the data are incomplete.
The complete data in EM algorithm are considered to be (
The complete data log-likelihood function for multivariate geometric skew normal mixture model is given by
EM algorithm for a mixture of MGSN distributions: the E-step.
The EM algorithm is simplified by introducing the latent variable
where
At the E-step, the posterior probabilities that the
EM algorithm for a mixture of MGSN distributions: the M-step.
Maximize the complete data log likelihood with respect to the parameters. Maximize the Eq. (8) with respect to
The problem of initializing techniques for EM algorithm is not well studied and it is not unique one. Many researchers have developed various initialization techniques for model based clustering approach. McLachlan (1988) has proposed the use of principal component analysis for choosing the initial values for multivariate mixture models. The standard techniques for tackling the issue of EM algorithm initialization is the Multiple Restart approach (MREM). MREM approach for the EM algorithm is run many times, each run being started with different random initial values (McLachlan et al., 2000). The best result of the MREM method is to obtain highest log likelihood value.
Initialization in EM algorithm using k-means with Euclidean distance measures is widely used for model based clustering. Euclidean distance measures are used for homogeneous and spherical clusters. In this paper, Mahalanobis distance measure is used for initialization procedure. Mahalanobis distance measure is used to capture the covariance structures of clusters. Mahalanobis distance measures are used to identify and correctly classify non-spherical clusters for non-homogeneous data.
The model-based clustering with a finite mixture of multivariate geometric skew normal distribution using EM algorithm is as follows
Fix
where The initial values of parameters Compute the different covariance matrix E-step: Compute Set Compute Compute M Step: Update Compute BIC and AIC using the Eq. (14). Compute adjusted rand index and misclassification rate. Compare Else
The covariance matrix represents the geometric features such as volume, shape and orientation of the clusters. Several techniques have been developed for covariance structures of Gaussian mixture models to clustering. To provide easy and simple interpretable models, Banfield and Raftery (1993) have reparameterized the covariance matrices in terms of the eigen value decomposition. Celeux and Govaert (1995) classified the covariance models into three families, namely, spherical, diagonal and general families. Random generation of the covariance matrix
where
Nomenclature, scale matrix structure and the number of free scale parameters for the eigen-decomposed family of models
An alternative estimation method for covariance matrix is presented in this paper. The decomposed elements of the covariance matrix are updated according to the following algorithm.
M-step involves the conditionally maximizing the parameters with respect to complete log-likelihood. The estimated mixing proportion and sample cross-product matrix for the
Iteration Update
where Update
Update
where Update Calculate If
Five types of covariance structures discussed in Celeux G & Govaert (1995) are considered for finite mixtures of multivariate geometric skew normal distributions to clustering.
EII –
VII –
EEE –
where EEV –
VVV –
In model based clustering approach, model selection criteria are generally used to choose the best model and to select the number of groups. In this paper, Bayesian Information Criterion (Schwarz, 1978) and Akaike Information Criterion (Akaike, 1973) is used for model selection.
where
This section provides experimental validation and illustrative examples for model based clustering using finite mixtures of multivariate geometric skew normal distribution. The performance of the model based clustering using multivariate geometric skew normal mixture models is illustrated with real and simulated datasets.
Banknote dataset
Swiss banknote dataset is considered for the analysis. The banknote dataset consists of 200 samples and 6 variables. In the dataset contains of 100 counterfeit notes and 100 genuine notes. The variables are length of bill, width of left edge, width of right edge, bottom margin width and top margin width. All measurements are in millimeters. All variables are considered for this study. This dataset is recorded by Flury et al. (1988). This dataset is available in the Mclust package (Fraley et al., 2006).
Mardia’s test has been used to check the skewness of the multivariate dataset. The
The effectiveness of different covariance structures for clustering based on the finite mixture of multivariate geometric skew normal distribution is investigated. Initial values for finite mixture models of multivariate geometric skew normal distributions are obtained from the procedure described in the algorithm in Section 2. The summary of clustering results for multivariate mixture models are listed in Table 2.
Clustering performance of various multivariate mixture models on the Banknote dataset
Clustering performance of various multivariate mixture models on the Banknote dataset
Classification table MSN, MN and MGSN using five covariance models
Clustering plot for banknote dataset using different covariance models for MGSN mixture models.
The highest BIC and AIC values are selected for all covariance models. Among five covariance structures of MGSN mixture models, EEV model attains the highest values of BIC (1925.19) and AIC (1892.052). We observe from Table 2 that the EEV model achieved the lowest misclassification error (0.005) and the highest ARI (0.9893). It shows close match to the true labels. Other covariance models also provide reasonable clustering results. EEV model is also compared with Multivariate Skew Normal (MSN) and Multivariate Normal (MN) mixture models. The classification table of multivariate mixture models for banknote dataset is given in Table 3. In Table 3, the classification result from the mclust, EMMIXskew and finite mixture of MGSN distributions is presented. Clustering using finite mixture of MGSN distribution attains higher value of BIC and AIC as compared to multivariate skew normal and multivariate normal mixture models.
The above result indicate that MGSN mixture model outperforms other mixture models. Figure 1 shows the scatter plots for the banknote dataset with five different covariance models. Figure 2 shows the scatter plots of MSN and MN mixture models.
In this simulation study, a dataset (
Different covariance structures in MGSN mixture models are considered. The best results of model based clustering using MGSN mixture models are also compared with other multivariate mixture models. The clustering results of the simulated dataset are provided in Table 4.
From Table 4, it is observed that VVV model gives lowest misclassification rate (0.018). The ARI is 83% with BIC (18912.02) and AIC (19382.39). Among five covariance structures of MGSN mixture models, VVV model achieved the highest ARI. Classification table for multivariate mixture models are shown in Table 5. The number of misallocated observations for simulated dataset is reported in Table 5. The comparison of the classification results from the mclust, EMMIXskew and finite mixture of MGSN with five different covariance models are shown in Table 5.
Clustering performance of various multivariate mixture models on the simulated dataset
Clustering performance of various multivariate mixture models on the simulated dataset
a) Clustering plot using multivariate normal mixture models; b) Clustering plot using multivariate skew normal mixture models.
Classification table MSN, MN and MGSN using five covariance models
Scatter plot of the original dataset.
Figure 4 shows the scatter plot for five different covariance structures using MGSN mixture model. The scatter plot for multivariate skew normal and normal mixture models are depicted in Fig. 5. The goal of this study is to check whether the multivariate geometric skew normal mixture models can be fitted using model based clustering approach.
Clustering plot for simulated dataset using different covariance structures for MGSN mixture models.
a) Clustering plot using multivariate normal mixture models; b) Clustering plot using multivariate skew normal mixture models.
This paper presents non-Gaussian model based clustering using multivariate geometric skew normal mixture models for skewed data. Parameter estimation using EM algorithm is outlined. Different covariance structures are considered for multivariate geometric skew normal mixture models. Model based clustering using finite mixtures of multivariate geometric skew normal distribution is evaluated using both simulated and real life datasets. The clustering results based on finite mixtures of MGSN distribution produced the lowest misclassification error and highest ARI values when compared to MSN and MN mixture models.
