Abstract
This article aims at identifying the high risk provinces in Iraq using a finite Poisson mixture. Through this methodology, the levels of relative risk is determined through identifying the number of components. In this article we do not investigate spatial correlation among regions and assume that the levels of risk observed in different regions are independent each other. The estimation of the model parameters and the model selection are performed using the Bayesian approach which allow to allocate each province to an identified risk level. We consider the data of the Coronavirus disease (COVID-19) infections in 18 provinces in Iraq and determining the levels of relative risks of this pandemic. The results are spatially shown in map which illustrates that the best Bayesian model fitted the data is 3 components model (high, medium and low risk).
Introduction
Identification of the level of disease risk in a certain region has a high importance to governments. The spatial diagnostic, or disease mapping, of infections can give the opportunity for those governments to review their health systems and also controls diseases. The duty of statistics modelling here is to discovery those risks according to several approaches. Mixing the statistical modelling with the disease mapping can give a clear picture for geographical disease variation across the study regions (Fernández & Green, 2002). A traditional method so-called the standardized mortality ratio (SMR) is often used in literature as an epidemiological measure to determine and map the relative risk. However, this approach has a disadvantage as it is unstable especially in small regions (Böhning et al., 2000). Another approach with respect to the disease mapping for modelling the relative risk is Gamma-Poisson model which has been introduced intensively by Wolpert and Ickstadt (1998); Catelan et al. (2010); Lawson (2013). However, it was mentioned that this approach may cause a problem in identifying the level of risk for each sub-region and the corresponding proportion of the overall population (Böhning et al., 2000). This leads to thinking about a hidden part in the model since those sub-regions to which each region belongs remain unobserved. Thus, implicating the unobserved or latent variables in the model which are working as indicators for each sub-region is more realistic. Consequently, latent class models such as mixture and hidden Markov model can be more appropriate approaches to accommodate such data (McLachlan & Peel, 2004).
In this article, we follow a discrete mixture taking
Under the Bayesian principle, the development of mixture model can be considered as a hierarchical model by including of latent variables. By adding latent variables to the model, the so-called the data augmentation approach is resulted. The benefit of this approach is to simplify the inference of multidimensional models which became excessively used after the seminal papers by Tanner and Wong (1987), Gelfand and Smith (1990). Hence, the data augmentation approach has become the basis for Bayesian analysis. The Gibbs sampler (Geman & Geman, 1984), based on the data augmentation approach, is one of the Markov chain Monte Carlo (MCMC) methods that is often used for estimating the mixture model parameters. In this article, we adopt this sampler to the model estimation. Despite a lot of literature refer to use the the Metropolis-Hastings (M-H) sampler (Metropolis et al., 1953) as an alternative to the Gibbs sampler, the former sampler has some downsides. Firstly, the chain convergence is based on the proposal density. A proposal density with large jumps to places far from the support of the posterior has low acceptance rate and causes the Markov chain to stand still most of the time. Secondly, a proposal density with small jumps and high acceptance rate may cause the chain to move slowly and to become stuck in one state. Thirdly, in multi-dimensional cases, the M-H algorithm can require a proposal density for the whole vector, which is extremely difficult when the dimension is high (Casella & Robert, 2004).
The model selection is a key issue to determine how many components for the best model fitting the data. So, the number of model components,
The article is organized as follows. Section 2 contains the definition of the model, data source and the implement of the model. In Section 3, we introduce the results and discussion of the component-based relative risk model. Finally, we show the conclusions in Section 4.
Material and methods
Finite mixture model with poisson distribution
We use a model with a finite number of Poisson mixtures to model spatial-specific relative risks through a certain period time. We first estimate the relative risk using so-called the standardized morbidity ratio (SMR) (Lawson, 2013). That means the SMR represents the component-specific parameter and we shall denoted as the parameter
and
where
where
where
The formal representation of mixture models using latent allocation process make it useful for the purpose of interpretation and also a convenient tool for the complicated numerical computations. For a mixture with
which is called the complete data function. The task of allocation variable,
The log-likelihood function in Eq. (7) can be approximated over the posterior distribution. Given
Levels of relative risks are being determined based on selecting the best model fits the data (i.e. the number of levels equal the number of components). For the purpose, we perform MCMC run for several model with different number of components and then select the model that gives the best fitting for the model. We adopt the DIC Spiegelhalter et al., (2002) as criterion to determine the best model. There are several versions for this criterion that have been proposed by Celeux et al., (2006) who recommended the version that is based on the complete-data likelihood. In this paper, we apply this version which is given by:
with its effect number of parameters,
where
KwInInitialization KwOutIteration
Gibbs sampling for a Poisson mixture model with
for
Generate
Generate
Generate
Compute
In this section, we implement a checking the predictive performance of the traditional model which is SMR model and the proposed Poisson mixture model using the prediction posterior distribution (PPD) (Gelman et al., 2013). The PPD is based on computing the predictive observations,
While, the PPD of the Poisson mixture model is computed as follows:
After obtaining the predictive observations of both models, the predictive performance of each model is checked using the logarithm score (LS) proposed by Gneiting and Raftery (2007) which is defined as:
where LS is a negative log-likelihood and smaller value of score means the better prediction of the model. Along with LS, a graphical tool is also used to check the predictive performance of the model.
The data used in this study concerned the number COVID-19 infections in 18 provinces in Iraq. The observed count of infections,
The number of COVID-19 infections from March 2020 to January 2021, population, expected infections and SMR
