Spatial identification of component-based relative risks

Abstract

This article aims at identifying the high risk provinces in Iraq using a finite Poisson mixture. Through this methodology, the levels of relative risk is determined through identifying the number of components. In this article we do not investigate spatial correlation among regions and assume that the levels of risk observed in different regions are independent each other. The estimation of the model parameters and the model selection are performed using the Bayesian approach which allow to allocate each province to an identified risk level. We consider the data of the Coronavirus disease (COVID-19) infections in 18 provinces in Iraq and determining the levels of relative risks of this pandemic. The results are spatially shown in map which illustrates that the best Bayesian model fitted the data is 3 components model (high, medium and low risk).

Keywords

MCMC relative risk mapping mixture model COVID-19

1. Introduction

Identification of the level of disease risk in a certain region has a high importance to governments. The spatial diagnostic, or disease mapping, of infections can give the opportunity for those governments to review their health systems and also controls diseases. The duty of statistics modelling here is to discovery those risks according to several approaches. Mixing the statistical modelling with the disease mapping can give a clear picture for geographical disease variation across the study regions (Fernández & Green, 2002). A traditional method so-called the standardized mortality ratio (SMR) is often used in literature as an epidemiological measure to determine and map the relative risk. However, this approach has a disadvantage as it is unstable especially in small regions (Böhning et al., 2000). Another approach with respect to the disease mapping for modelling the relative risk is Gamma-Poisson model which has been introduced intensively by Wolpert and Ickstadt (1998); Catelan et al. (2010); Lawson (2013). However, it was mentioned that this approach may cause a problem in identifying the level of risk for each sub-region and the corresponding proportion of the overall population (Böhning et al., 2000). This leads to thinking about a hidden part in the model since those sub-regions to which each region belongs remain unobserved. Thus, implicating the unobserved or latent variables in the model which are working as indicators for each sub-region is more realistic. Consequently, latent class models such as mixture and hidden Markov model can be more appropriate approaches to accommodate such data (McLachlan & Peel, 2004).

In this article, we follow a discrete mixture taking $K$ of different components with their respective probabilities to modelling the relative risk of disease. The relative risk of a given region is represented by SMR measure which is computed based on the population of each region and the counts of infections of the same region (Lawson, 2013). According to our approach, the population is assumed to consist of $K$ sub-populations with different level of disease risk, leading to a mixture of Poisson distributions with different mean for each subpopulation.

Under the Bayesian principle, the development of mixture model can be considered as a hierarchical model by including of latent variables. By adding latent variables to the model, the so-called the data augmentation approach is resulted. The benefit of this approach is to simplify the inference of multidimensional models which became excessively used after the seminal papers by Tanner and Wong (1987), Gelfand and Smith (1990). Hence, the data augmentation approach has become the basis for Bayesian analysis. The Gibbs sampler (Geman & Geman, 1984), based on the data augmentation approach, is one of the Markov chain Monte Carlo (MCMC) methods that is often used for estimating the mixture model parameters. In this article, we adopt this sampler to the model estimation. Despite a lot of literature refer to use the the Metropolis-Hastings (M-H) sampler (Metropolis et al., 1953) as an alternative to the Gibbs sampler, the former sampler has some downsides. Firstly, the chain convergence is based on the proposal density. A proposal density with large jumps to places far from the support of the posterior has low acceptance rate and causes the Markov chain to stand still most of the time. Secondly, a proposal density with small jumps and high acceptance rate may cause the chain to move slowly and to become stuck in one state. Thirdly, in multi-dimensional cases, the M-H algorithm can require a proposal density for the whole vector, which is extremely difficult when the dimension is high (Casella & Robert, 2004).

The model selection is a key issue to determine how many components for the best model fitting the data. So, the number of model components, $K$ , is a parameter that has to be estimated like other model parameters. In this paper, we do not wish to made strong a priori assumptions about the number of components where we try to interpret the model in relation to a fixed number of states. Therefore, in this paper, we assume a fixed but unknown number of components $K$ where $K\in(1,2,\ldots,K_{\max})$ . The model selection process can be done by several well-known methods. For example, criteria such Akaike information criterion (AIC) (Akaike, 1973) and the Bayesian information criterion (BIC) (Schwarz, 1978) have been used to determine the best mixture model in many applications (Zucchini & MacDonald, 2009). However, these criteria may cause under-fitting or over-fitting, where the number of components that are analysed is smaller or greater than the true number of components as well as they are based on point estimates of parameters to assess the model, hence, they do not naturally incorporate the uncertainty pertaining to those quantities. On this basis, we introduce a criterion so-called the deviance information criterion (DIC) proposed by Spiegelhalter et al., (2002) that is based on Bayesian theory. This paper aims at developing of the relative risk classification model based on a finite mixture of Poisson distributions. In other words, we seek to model and determine the spatial-specific relative risk over certain time period for every region. In this article we do not investigate spatial correlation among regions and assume that the levels of risk observed in different regions are independent from each other. The process of the model fitting and selecting is summarized by building several models to the data with a known number of components. Then, the best fitted model to the data is selected using the model selection criterion DIC. As an application, we consider the infections of Coronavirus disease (COVID-19) in the provinces of Iraq and determining the levels of spatial-specific relative risks.

The article is organized as follows. Section 2 contains the definition of the model, data source and the implement of the model. In Section 3, we introduce the results and discussion of the component-based relative risk model. Finally, we show the conclusions in Section 4.

2. Material and methods

2.1 Finite mixture model with poisson distribution

We use a model with a finite number of Poisson mixtures to model spatial-specific relative risks through a certain period time. We first estimate the relative risk using so-called the standardized morbidity ratio (SMR) (Lawson, 2013). That means the SMR represents the component-specific parameter and we shall denoted as the parameter $\theta$ along this article. The measure or parameter $\theta$ represents the ratio between the observed and expected event for a particular region. Let consider $y_{i}$ is the number of observed cases in the $i^{\text{th}}$ region and $e_{i}$ is the expected cases in the same region, $i$ , then the $\text{SMR}_{i}=\theta_{i}$ is given as:

$\displaystyle\theta_{i}=\frac{y_{i}}{e_{i}},$ (1)

and $e_{i}$ is given by (Lawson, 2013):

$\displaystyle e_{i}=\frac{n_{i}\times\sum_{j=1}^{L}y_{j}}{\sum_{j=1}^{L}n_{j}},$ (2)

where $n_{i}$ is the population size of the $i^{\text{th}}$ region and $L$ is the total number of regions. Now, to formulate our model, let assume $y_{i}$ , that represents the number of observed cases in the $i^{\text{th}}$ region at a certain period time, is distributed as a Poisson distribution, i.e

$\displaystyle y_{i}|\lambda_{i}\sim\text{Poisson}(\lambda_{i}=e_{i}\theta_{i}),$ (3)

where $\lambda_{i}$ is the mean parameter of Poisson distribution which represents the function of expected number of cases, $e_{i}$ , and relative risk, $\theta_{i}$ , of $i^{\text{th}}$ region. A mixture Poisson model with its corresponding allocation proportions, $w_{j}$ , $j=1,2,\ldots,K$ , is given as follows:

$\displaystyle\Pr(\bm{y}|\underline{e},\underline{\theta},\underline{w},K)=\sum% _{j=1}^{K}w_{j}f(y_{i}|\lambda_{j}=e_{j}\theta_{j}),$ (4)

where $f(\cdot|\lambda_{j})$ represents the probability mass function, $k$ is the number of components, the vector $\underline{\lambda}=(\lambda_{1},\lambda_{2},\ldots,\lambda_{K})$ represents the component-based parameters and the vector $\underline{w}=(w_{1},w_{2},\ldots,w_{K-1})$ represents allocation probabilities that satisfy $w_{j}>0$ with $\sum_{j=1}^{K}w_{j}=1$ . The likelihood (observed) of model is given by

$\displaystyle\Pr(\bm{y}|\underline{\lambda},\underline{w},K)=\prod_{i=1}^{n}% \sum_{j=1}^{K}w_{j}f(y_{i}|\lambda_{j}),$ (5)

The formal representation of mixture models using latent allocation process make it useful for the purpose of interpretation and also a convenient tool for the complicated numerical computations. For a mixture with $K$ -components, the representation of model now is given by introducing $n$ independent discrete variables, $z_{1},z_{2},\ldots,z_{n}$ , with the multinomial distribution $p(z_{i}=j|\underline{\theta},\underline{w},K)=w_{j}$ , for $j=1,2,\ldots,K$ . Given $\bm{z}=(z_{1},z_{2},\ldots,z_{n})$ , the Eq. (5) can be written as

$\displaystyle\Pr(\bm{y},\bm{z}|\underline{\lambda},\underline{w},K)=\prod_{i=1% }^{n}\prod_{j=1}^{K}[w_{j}f(y_{i}|\lambda_{j})]^{z_{ij}},$ (6)

which is called the complete data function. The task of allocation variable, $z_{i}$ , is to assign the observation $y_{i}$ to one of the mixture components. By taking the logarithm for the Eq. (6), we obtain:

$\displaystyle\ell(\underline{\lambda},\underline{w}|\bm{y},\bm{z})=\log\left(% \prod_{i=1}^{n}\prod_{j=1}^{K}[w_{j}f(y_{i}|\lambda_{j})]^{z_{ij}}\right),=% \sum_{i=1}^{n}\sum_{j=1}^{K}[\log w_{z_{ij}}f(y_{i}|\lambda_{z_{ij}})].$ (7)

The log-likelihood function in Eq. (7) can be approximated over the posterior distribution. Given $(\ell^{(0)},\ell^{(1)},\ldots,$ $\ell^{(M)})$ computed over a full MCMC run, we can obtain the estimated log-likelihood by post-processing the posterior outcome:

$\displaystyle\hat{\ell}(\underline{\lambda},\underline{w}|\bm{y},\bm{z})={% \displaystyle\frac{1}{M}}\sum_{m=1}^{M}\sum_{i=1}^{n}\sum_{j=1}^{K}\left[\log w% _{z_{ij}^{(m)}}^{(m)}f(y_{i}|\lambda_{z_{ij}^{(m)}}^{(m)})\right].$ (8)

Levels of relative risks are being determined based on selecting the best model fits the data (i.e. the number of levels equal the number of components). For the purpose, we perform MCMC run for several model with different number of components and then select the model that gives the best fitting for the model. We adopt the DIC Spiegelhalter et al., (2002) as criterion to determine the best model. There are several versions for this criterion that have been proposed by Celeux et al., (2006) who recommended the version that is based on the complete-data likelihood. In this paper, we apply this version which is given by:

$\displaystyle\text{DIC}=-4\text{E}_{\underline{\lambda},\underline{w},\bm{z}}[% \log\Pr(\bm{y},\bm{z}|\lambda,{w},K)]+2\text{E}_{z}[\log\Pr(\bm{y},\bm{z}|\hat% {\lambda}(\bm{z}),\hat{w}(\bm{z}))],$ (9)

with its effect number of parameters, $p_{\text{DIC}}$ , defined as follows:

$\displaystyle p_{\text{DIC}}=-2\text{E}_{\underline{\lambda},\underline{w},\bm% {z}}[\log\Pr(\bm{y},\bm{z}|\lambda,{w},K)]+2\text{E}_{z}[\log\Pr(\bm{y},\bm{z}% |\hat{\lambda}(\bm{z}),\hat{w}(\bm{z}))],$ (10)

where $\hat{\underline{\lambda}}(\bm{z})$ and $\hat{\underline{w}}(\bm{z})$ are the complete-data posterior modes of the parameter ${\lambda}$ and ${w}$ respectively which are computed for each samples from the posterior $p(z|\bm{y},\lambda,{w})$ .

KwInInitialization KwOutIteration

Gibbs sampling for a Poisson mixture model with $K$ componentsChoose $\underline{w}^{(0)}$ and $\lambda^{(0)}$ ,

for $m=1,2,\ldots,M$

Generate $w_{j}^{(m)};i=1,2,\ldots,n$ from $j=1,2,\ldots,K$ :

$w_{j}^{(m)}=\Pr(z_{i}^{(m)}=j)={\displaystyle\frac{w_{j}^{m-1}(\lambda_{j}^{m-% 1})^{y_{i}}\exp{(-\lambda_{j}^{m-1})}}{\sum_{j=1}^{K}w_{j}^{m-1}(\lambda_{j}^{% m-1})^{y_{i}}\exp{(-\lambda_{j}^{m-1})}}}.$

$\text{Compute: }n_{j}^{(m)}=\sum_{i=1}^{n}\mathbb{I}_{{z}_{i}^{(m)}=j}.$

Generate $z_{i}^{(m)}$ from $\text{Multinomial}(w_{1}^{(m)},w_{2}^{(m)},\ldots,w_{K}^{(m)}).$

Generate ${{\lambda}}_{j}^{(m)}$ from $\text{Gamma}(\alpha+\sum_{t:z_{i}=j}{y}_{i},\beta+\sum_{t:z_{i}=j}n_{i}).$

Compute $\ell^{(m)}(\underline{\lambda},\underline{w})=\sum_{i=1}^{n}\sum_{j=1}^{K}\Pr(% z_{ij}^{(m)}=1|\lambda^{(m)},\bm{y})\log[w_{j}^{(m)}Poi(y_{i}|\lambda_{j}^{(m)% })]$

2.2 Model validation

In this section, we implement a checking the predictive performance of the traditional model which is SMR model and the proposed Poisson mixture model using the prediction posterior distribution (PPD) (Gelman et al., 2013). The PPD is based on computing the predictive observations, $\bm{y}^{*}=(y_{1}^{*},y_{2}^{*},\ldots,y_{n}^{*})$ given estimated parameters of the model. Given the estimated relative risk $\hat{\theta_{i}}=\frac{\sum_{i=1}^{n}\frac{y_{i}}{e_{i}}}{n}$ , which is computed directly from the observed and expected count of infections, the PPD of SMR model is defined as follows:

$\displaystyle y_{t}^{*}\sim\text{Poisson}(e_{i}.\hat{\theta_{i}}).$ (11)

While, the PPD of the Poisson mixture model is computed as follows:

$\displaystyle y_{t}^{*}\sim\hat{w_{z_{i}}}\text{Poisson}(e_{i}.\hat{\theta_{z_% {i}}}).$ (12)

After obtaining the predictive observations of both models, the predictive performance of each model is checked using the logarithm score (LS) proposed by Gneiting and Raftery (2007) which is defined as:

$\displaystyle LS=-\sum_{i=1}^{n}\log p(y_{i}|\hat{\theta_{i}}),$ (13)

where LS is a negative log-likelihood and smaller value of score means the better prediction of the model. Along with LS, a graphical tool is also used to check the predictive performance of the model.

2.3 Data source

The data used in this study concerned the number COVID-19 infections in 18 provinces in Iraq. The observed count of infections, $y_{i}$ , for the period from March 2020 to January 2021 and the population of each province are presented in Table 2.3. The infections count were obtained directly from the World Health Organization (WHO) (World Health Organization, 2020) and the population size, $n_{i}$ , of each province is obtained from the Central Statistical Organization of Iraq (Central Statistical Organization of Iraq, 2019) (CSO) based on the estimated census of 2019. The data showed that Baghdad, the capital of Iraq, was the province having the most infections, while Al-Anbar had least infections.

Table 1

The number of COVID-19 infections from March 2020 to January 2021, population, expected infections and SMR

Province	$y_{i}$	$n_{i}$	$e_{i}$	$\text{SMR}=\theta_{i}$
Al-Anbar	8080	1818318	28893.01817	0.2796523
Al-Basrah	40198	2185073	34720.74405	1.1577516
Al-Muthanna	12524	835797	13280.78912	0.9430162
Al-Najaf	22558	1510338	23999.22526	0.9399470
Al-Qadissiya	18566	1325031	21054.70262	0.8817982
As-Sulaym	33495	2219194	35262.92500	0.9498644
Babil	21222	2119403	33677.24974	0.6301583
Baghdad	185662	8340711	132533.6462	1.4008669
Diyala	22345	1680328	26700.36123	0.8368801
Duhok	35063	1326562	21079.03016	1.6634067
Arbil	36613	1903608	30248.27370	1.2104161
		1250806	19875.26961	1.1483114
At-Taḿim	33845	1639953	26058.80370	1.2987933
Maysan	18316	1141966	18145.80529	1.0093792
Ninewa	25780	3828197	60829.93485	0.4238044
Sala ad-Din	15369	1637232	26015.56709	0.5907616
Thi Qar	23898	2150338	34168.80595	0.6994098
Wasit	32672	1415034	22484.84757	1.4530674
Total	609029	38327889