Abstract
Simplex distribution has been proved useful for modelling double-bounded variables in data directly. Yet, it is not sufficient for multimodal distributions. This article addresses the problem of estimating a density when data is restricted to the (0,1) interval and contains several modes. Particularly, we propose a simplex mixture model approach to model this kind of data. In order to estimate the parameters of the model, an Expectation Maximization (EM) algorithm is developed. The parameter estimation performance is evaluated through simulation studies. Models are explored using two real datasets: i) gene expressions data of patients’ survival times and the relation to adenocarcinoma and ii) magnetic resonant images (MRI) with a view in segmentation. In the latter case, given that data contains zeros, the main model is modified to consider the zero-inflated setting.
Introduction
Some statistical processes can be assumed to result from a mixing of several known or unknown distributions. The natural representation of these phenomena is called mixture of distributions, and it attempts to recover the information of each component in the mixture model. This recovering step is generally translated into the estimation of the parameters of those distributions. Often such estimation is the main aim of the investigation, but sometimes, the purpose of the research may be to scrutinize the original distribution of each particular observation. This step is the main goal of the model-based clustering or classification.
In model-based clustering, the empirical distribution is modelled by a finite mixture of distributions. Usually, Gaussian ones are the most ubiquitously considered (Carreira-Perpiñán, 2000; McLachlan and Peel, 2000; Contreras-Reyes and Cortés, 2016), and consequently, this kind of mixtures lie in the real space. In practice, nevertheless, most variables are bounded because they are subject to the impossility of infinite precision in measuring devices. We are concerned about this limited aspect of mixture modelling. Though the literature is limited about variables bounded in an interval, much effort has been made recently to manage variables limited to the
We present a distribution mixture model restricted to the
This article proceeds as follows: Section 2 briefly describes the main features of simplex distribution and the way we are going to compare models. Section 3 resumes the classical expectation maximization (EM) algorithm for mixtures model. Section 4 shows some simulations to evaluate the performance of the estimation. Later in the same section, two real data applications are presented and some conclusions are given in Section 5.
Background and related work
Dispersion models
The development of the simplex distribution model is highly related with DMs. The (reproductive) dispersion model is represented by densities taking the form
The advantages of the parametrization of this family distribution are similar to those that natural exponential models (McCullagh and Nelder, 1989) offer in the way that parametrization allows to recover a significant amount of information from the general expression. From function
The simplex distribution can be obtained by conditioning
In this work, we lean to a directional and symmetric test developed by Vuong (1989) for choosing between models in a non-nested context (Greene, 2012). Vuong's method is helpful to test if two underlying densities have the same Kullback–Liebler information criterion (Kullback and Leibler, 1951). The test takes classical likelihood ratio statistic theory as basis and computes the difference of maximum log-likelihood values for two competing normalized models. Specifically, let
Mixture distribution
According to McLachlan and Peel (2000),
Some simplex mixture densities.
In this section, we briefly describe the procedure to estimate the mixture distribution parameters using maximum likelihood method. Traditional log-likelihood function for model (3.1) has the form
·
·
However, since the solution of equation (3.2) is not available in a closed form, we use a numerical approximation to get the updated estimates. Among several available methods, we settled for a finite-difference scheme to obtain the derivatives, such as the general purpose Nelder–Mead's optimization algorithm. Another possible option presents the BFGS quasi-Newton method (Nocedal and Wright, 2006). Finally, iterations must repeatedly alternate between the E- and M-step until the difference
As in the generalized linear model, a transformation on parameters is often defined using strictly monotonic and twice differentiable link functions that map parameter space into ℝ. Then, to connect the structural part of the model with the predicted parameter, we have chosen the logit and logarithm functions for location and dispersion parameters, respectively. We have not attempted to measure the efficiency of this selection, although some studies have been conducted in this regard (Andrade, 2007). This selection also provides comparative benefits because some current alternative modelling softwares use those functions as default links (Grün et al., 2012).
Furthermore, we used as starting values for the location parameters, the means produced by an exploratory
On the other hand, we have used the moment estimator for dispersion parameters (Song, 2007)
Simulation
To evaluate the performance of simplex mixture distribution estimation, we use a Monte Carlo simulation as outlined further. The mixtures are simulated with 2, 3, 4 and 5 components using the same parameters of Figure 1, and then they are estimated by carrying out the EM algorithm described in Section 3.2, using particularly the BFGS algorithm for the maximization step. The simulations were repeated for
Note that RB and MSE are expressed in percentage and squared units, respectively. In addition, classical quantities for the classification problem, such as percentage of overall accuracy (% OA), sensitivity (Sens.) and specificity (Spec.), are also considered. Furthermore, in order to measure the tightness of data and estimated density model, we use the Hellinger distance (HD). Particularly, for a pair of distributions with densities
The simulation results are summarized in Table 1. The second column shows true parameter values: the first half are locations and the second half dispersion parameters. As expected, the estimation tends to be more precise, in RB terms, when sample size increases. Nonetheless, the dispersion parameters show the most inaccurate estimation. For instance, the parameter whose real value is 0.1 in the fifth mixture model, was estimated more than two times beside its correct value. This is represented under the respective RB values, 164.57, 154.60 and 142.18%.
On the other hand, classification results (% OA, Sens. and Spec.) perform pretty well when number of clusters are of moderate size. For instance, the 93% of individual membership was correctly recovered, on average, for three clusters of 50 size each. Generally, the values are over 70% of correct classification. The same behaviour is present in the HD measure where, moreover, in all cases, the estimation are as closer or less than 30%. However, mixture model rate classification decreases largely when the number of components jumps from fourth to fifth group. In our experience, this distortion is due to the numerical method's inability to manage different and increasing sources of dispersion (Yao, 2010).
Estimation (Estim.), relative bias (RB) and MSE of simulations
Estimation (Estim.), relative bias (RB) and MSE of simulations
Estimation of the gene expression models
Gene expression data
Ji et al. (2005) used a set of genes to investigate the relation between the survival times of patients and adenocarcinoma. The authors arbitrarily selected one gene, tyrosinehydroxylase (TH), from the original list and computed 7 128 correlation coefficients. After applying the transformation
We take the same gene to compare several models: mixture of simplex, mixture of beta and mixture of normal using two components, as in the original work. Results of the fitted models appear in Table 2. Elapsed times for estimations were 33.37, 323.58 and 1.29 seconds, respectively. The process was obtained on an Intel Core i5 processor and 7.7 GB of random-access memory running GNU/Linux (kernel 4.10). The table has both pairs of parameters for three models, that is, for simplex distribution location and dispersion parameters, for beta distribution mean and precision, and the mean and variance parameters for normal distribution. Location parameters are similar across the models. Estimated precision parameters of mixture beta distribution are notably high and, according to their standard deviations, they are not significant. Additionally, Vuong's test recommends the simplex over the beta selection model. Values of likelihoods for each model are added to Table 2. However, Ji et al. (2005) have found in practice that most traditional model selection criteria, mainly based in log-likelihood values, do not discriminate correctly the size of
Vuong's test relates to the first column model, that is, simplex mixture is the M
Glioblastoma multiforme data
Fitted results in gene expression example: _______: simplex, ______: beta, ____: normal.
Fitted results in gene expression example: _______: simplex, ______: beta, ____: normal.
Totalized Voung's test for GBM images. Greatest log-likelihood results in parenthesis. Median Hellinger Distance (HD) values are appended
Three GBM images (patients) at 100
100 pixels size: (a), (b) and (c) refer to an original image, ZIMS and ZIMB model fit, respectively, for
. Images represented in batches (d)–(f) and (g)–(i) follow the same guidelines for
and
, correspondingly.
Continuation of Figure 3. (a)–(c) represent a GBM image (patient) for K = 5 and its estimation under ZIMS and ZIMB. This is the case when K = 6 is shown in (d)–(f).
In the next section, an individual image taken from Burzynski et al. (2014) in the treatment of an inoperable, recurrent, persistent or progressed GBM is explored in detail.
An MRI was undertaken by patient number 4 of Burzynski et al.(2014) before the treatment of ‘pazopanib and other agents, in combination with administered sodium phenylbutyrate’, and is shown in Figure 5a. There were estimated five zero-inflated models mixturing beta and simplex distributions, respectively (Table 4). Estimation times have ranged from 35 to 7 194.02 seconds for simplex models and from 98 to 2 039.33 seconds for beta models. According to the greatest log-likelihood value, we chose the model with two components.
Results for different ZIMS and ZIMB models of the Burzynski image example
Results for different ZIMS and ZIMB models of the Burzynski image example
In addition, a ZIMB mixture distribution with the same structure is estimated, as well as a normal mixture model with two components (see Table 5). Based on this, it can be noted that one of the two groups has very similar values in mean or location value (around
Results for Burzynski image example for K = 2
Figures 5b–5d suggest the use the ZIMS model because it recovers the tumour area. It is interesting to note that the Vuong's test criterion gives a non-conclusive result between the first and second model (
(a): Original Burzynski image at 100 × 100 pixels size. Models with two components, (b) ZIMS estimation, (c) ZIMB estimation and (d) mixture normal estimation. Models with four components: (e) ZIMS estimation and (f) ZIMB estimation.
Some other explorations are made. In particular, the second model with higher log-likelihood (3 531.34) is that with four components (Figures 5e–5f). Vuong's test between inflated models yields
Results for Burzynski image example for K = 4
Burzynski image example: Models with two components: _____ zero-inflated mixture simplex estimation. ______: zero-inflated mixture beta estimation. ______: mixture normal estimation
Assuming that the tumour area in this particular example lie between the values 0.45 and 0.60, we are able to estimate the performance measures as in the simulation section. They are shown at the last line of Tables 5 and 6. Figures 6 and 7 offer the approximation to the histogram of this image under
Burzynski image example: Models with four components: ______: zero-inflated mixture simplex estimation. ______: zero-inflated mixture beta estimation. ______: mixture normal estimation.
In this article, we proposed a new model for data that is restricted to the
The proposed methods were applied to two kinds of real datasets. The first application was concerned with an already worked dataset about correlation among genes. Our proposed model provides slightly better adjustment than the referenced models, according to log-likelihood values and Vuong's test. In the second application, we worked with medical images. Colours in images are naturally limited to an interval and, in some cases, it is possible to get values at extremes of these intervals. For instance, in a greyscale image, the image can contain absolutely white or black colour in a pixel. Therefore, we have modified the original model of equation (3.1) to allow zero values to be treated within images. Depending on the situation, the model can be modified to include other inflated values. The estimation times for presented simplex mixture models have been mostly similar to the established beta models, although both have notably increased when zero-inflated values have been incorporated. Normal mixtures, which are not correct for this kind of data, have times of computation really low, around 10 seconds for the most expensive data.
This study has not been focused on proving the superiority of one model over any competitors. Rather, the goal has been its introduction and its assessment using simulation and real data. Compared to the widely applied beta model, results for these examples pointed out that no one model clearly captures all modes and clusters in the data. Evaluated criteria do not lead to a generally superior model to be employed in most cases. This is because criteria suggest the use of the simplex-based model for some values of
Footnotes
Acknowledgments
F.O. López Quintero research was partially supported by UTFSM under PIIC No. 017/2014. J.E. Contreras-Reyes research was funded by CONICYT No. 21160618 (Res. Ex. 4128/2016). We would like to thank the editor and an anonymous reviewer for their helpful comments, suggestions and valuable discussion of this work.
