Abstract
When modelling multivariate binomial data, it often occurs that it is necessary to take into consideration both clustering and overdispersion, the former arising from the dependence between data, and the latter due to the additional variability in the data not prescribed by the distribution. If interest lies in accommodating both phenomena at the same time, we can use separate sets of random effects that capture the within-cluster association and the extra variability. In particular, the random effects for overdispersion can be included in the model either additively or multiplicatively. For this purpose, we propose a series of Bayesian hierarchical models that deal simultaneously with both phenomena. The proposed models are applied to bivariate repeated prevalence data for hepatitis C virus (HCV) and human immunodeficiency virus (HIV) infection in injecting drug users in Italy from 1998 to 2007.
Introduction
Clustering and overdispersion are two major issues that must be addressed when modelling data that cannot be assumed to be normally distributed, e.g., aggregated binary data and count data.
The clustering issue refers to the hierarchical structure of data, where measurements belonging to the same cluster are assumed to be associated. This issue can be accommodated using cluster-specific random effects, usually assumed to be normally distributed, which induce association between the repeated or multivariate measurements. Such models can be easily fitted within the framework of generalized linear mixed models (GLMMs, Breslow and Clayton, 1993; Molenberghs and Verbeke, 2005).
We encounter issues of overdispersion when the data present additional variability than that prescribed by the mean-variance relation of the distribution. The phenomenon of overdispersion has been widely considered in the literature, mostly in relation to the binomial and the Poisson distributions. Ignoring to account for overdispersion can lead to the underestimation of the standard errors and therefore to a wrong inference for the regression parameters. Possible solutions to this issue can be of two types (Hinde and Demétrio, 1998). The first approach consists in generalizing the variance function by including additional parameters, such as, the heterogeneity factor in overdispersed binomial data, and then estimating the regression parameters using quasi-likelihood methods (Agresti, 2002). The second approach assumes a two-stage model, where in the first stage we define for the data a distribution depending on certain parameters, whose distribution is then specified in the second stage. Examples are the beta-binomial model (Skellam, 1948) for binomial data and the negative binomial model (Breslow, 1984) for count data, but also some versions of the GLMMs can be used. A wide review of approaches that can deal with overdispersion can be found in Hinde and Demétrio (1998).
However, often interest may lie in simultaneously combining these two phenomena, clustering and overdispersion. Both marginal and random-effects models can be used to address them. If the focus is on the estimation of the fixed effects rather than on modelling the correlation structure, marginal models can be used to adjust the variance-covariance structure to accommodate clustering and overdispersion. For instance, both the GEE2 approach (Qaqish and Liang, 1992) and, especially suited for binomial data, alternating logistic regression (Carey et al., 1993) can be used to model the marginal means of the outcomes as well as the correlation between pairs of within-cluster measurements. Chen and Ahn (1997) go further in this direction and develop a marginal model for multivariate overdispersed binomial data where the mean structure depends on two multiplicative nested random effects. On the other hand, conditional models depending on random effects are a better choice if we are more interested in modelling the individual profiles rather than the population mean, as well as in estimating the correlation among the measurements. For instance, Booth et al. (2003) developed a negative binomial loglinear mixed model and they applied it to dependent and overdispersed count data. More generally, Molenberghs et al. (2007, 2010) proposed a class of extended GLMMs that accommodate clustering and overdispersion, making use of two separate sets of random effects, which then are estimated with maximum likelihood (ML) methods. These are meant to be used for modelling normal, binomial, Poisson and time-to-event data.
In this article, we focus on conditional models, and we thus extend the work presented in Molenberghs et al. (2010) by focusing on modelling multivariate, repeated and overdispersed binomial data. For this purpose, we develop a series of Bayesian hierarchical models that account for overdispersion through a set of random effects, either additively or multiplicatively included in the model, while dealing with clustering. We chose to use the Bayesian approach in order to avoid the computational difficulties encountered with GLMMs and ML estimation (Molenberghs et al., 2010). All the Bayesian hierarchical models presented in the article are fitted using Monte Carlo Markov Chain (MCMC) methods (Clayton, 1996). In this way, it is possible to specify a prior distribution for the unknown parameters, in particular for the overdispersion random effects and for their covariance matrix, and then calculate their posterior distribution through Gibbs sampling.
We apply the proposed methodology to a sample of prevalence data of hepatitis C virus (HCV) and human immunodeficiency virus (HIV) infection of injecting drug users (IDUs) in treatment from 20 Italian regions from 1998 to 2007. The article is organized as follows. In Section 2, we introduce the data. In Section 3, we describe the proposed methodology, giving details about the Bayesian hierarchical models with additive and multiplicative overdispersion parameters. Section 4 describes posterior computation, model selection and model checking. Section 5 is dedicated to the presentation of the main results, while in Section 6 we wrap up with a discussion of the proposed methodology. Model diagnostic and convergence assessment are presented in the supplementary material web appendix of the article.
Data
The data analyzed in the paper were reported to the European Monitoring Centre for Drugs and Drug Addiction (EMCDDA) and consist of diagnostic testing binomial data providing information about the HCV and HIV infection prevalence from samples of IDUs in treatment from 20 regions of Italy between 1998 and 2007. For each IDU, a serum specimen was taken and tested for antibodies against HCV and HIV. Further details about data collection can be found in Del Fava et al. (2011). Note that the data analyzed in this article are updated to 2007.
Figure 1 shows the observed prevalence profiles over the years for HCV and HIV infections, with a bold line representing the national prevalence profile, obtained by pooling together the regional results. We notice that the prevalence of HCV infection is much higher than the prevalence of HIV infection, reflecting the fact that HCV is reported to be about 10 times more infectious than HIV (Coutinho, 1998). In addition, the two plots reveal a pattern of large between-region and within-region variability, revealing an issue of overdispersion within regions over the years.
Methodology
A joint model for HCV and HIV infection
The data consist of multivariate repeated binomial measurements collected in a period of 10 years, from 1998 to 2007. Let

The primary interest is in the estimation of πij1 and πij2, which are the prevalence of HCV and HIV infections in the θth region in year j, respectively, and in the association between the two infections at the population level, ρ(πij1, πij2).
In the second stage of the hierarchical model, for infection k, they specified a logistic model for the prevalence πij. It contains unstructured fixed effects adjusting for time, α and βj, and region-specific random intercepts,
The time-specific coefficients, αj and βj, are assumed to be independently normally distributed with mean zero and variance depending on the hyperparameter
We then specify a uniform distribution for the variance hyperparameter σb:
Gelman (2006) argued that this choice of prior on σ is preferable over other possible priors, e.g., the inverse-Gamma(ε, ε) or the half-Cauchy, because it is, in practice, less ‘informative’. The same prior will be therefore used for all the univariate variance hyperparameters in the other models.
The random intercepts γik, which are assumed to follow a bivariate normal distribution, account for the association between the repeated measurements within region, independently of time. Depending on the structure of their covariance matrix
The infection-specific variances of the random intercepts,
Prior distributions for the other unknown parameters can be found in Del Fava et al. (2011), where this basic model is introduced.
In this section, we propose extensions to the random-intercept joint model (3.2) to deal simultaneously with clustering and overdispersion. This is achieved by including separate sets of random effects for overdispersion and for clustering. We opt for a set of random effects θijk, which, for convenience, are assumed to be independent of the random intercepts γik. In this section, we focus on additive overdispersion parameters θijk (McLachlan, 1997), which are introduced on the same scale of the linear predictor. We consider four possible situations of interest for the overdispersion random effects: (i) they are shared by the two infections; (ii) they are differentiated by infection and independent; (iii) they are differentiated by infection and allowed to be dependent; (iv) they are differentiated by infection and by year, leading thus to time-specific covariance matrices,
Shared overdispersion parameters
The first model we consider is the shared overdispersion model, where the overdispersion parameters are shared between the infections, i.e., θij1 = θij2 = θij:
We assume that θij has a normal prior distribution,
Another possible model is the independent overdispersion model, which, unlike model (3.7), includes infection-specific random effects, θijk:
For θijk we now assume a bivariate normal prior distribution, with covariance between θij1 and θij2 equal to zero:
Regarding the variance components for the overdispersion parameters θijk, we assume that
This model assumes that, although there is overdispersion in the time evolution of prevalence among the regions, all correlation between HCV and HIV infections at the regional level is captured fully by the random intercepts, not by the overdispersion parameters.
As a further extension, model (3.8) can be expressed as a correlated overdispersion model. For this model, the joint distribution of θij1 and θij2 can be rewritten as
For the covariance matrix
Assuming an unstructured covariance matrix for
The last additive model that we consider extends model (3.10) relaxing the assumption of a constant correlation between HCV and HIV infections captured by the overdispersion parameters. We now let the covariance matrix
For the covariance matrix
We consider a setting in which we account for overdispersion using multiplicative effects (McLachlan, 1997; Molenberghs et al., 2010). While the random intercepts γik induce association between the clustered measurements, the parameters θijk account for additional overdispersion. Hence, in this section, we assume that πijk = θijkκijk and
Note that 0 ≤ θijk ≤ 1 must hold to ensure that 0 ≤ θijkκijk ≤ 1.
We specify a beta prior distribution for θijk. As a special case, we use a beta distribution with parameters equal to 1, equivalent to a uniform distribution over the range (0,1), which implies a flat prior for the overdispersion parameters:
However, in general, we can assume that the parameters of the beta prior distributions are hyperparameters to estimate:
Then, the infection-specific variance of the overdispersion random effects can be calculated as
For the prior distribution of the hyperparameters a and b in (3.15), we choose independent flat uniform distributions within the range [0,100], i.e., a, b ~ U(0, 100).
Posterior computation
The hierarchical Bayesian models presented in the previous section are fitted to data using MCMC methods, specifically Gibbs sampling implemented through the JAGS software (Plummer, 2003). For each model, we used three chains of 250 000 iterations each, burn-in of 125000 and thinning of 125.
Initial values for each of the three chains were generated randomly according to prespecified distribution. For instance, initial values for the coefficients αj and βj, as well as for the random effects γik and θijk, were drawn from independent standardized normal distributions; initial values for the variance hyperparameter σb were drawn from a uniform distribution within the range [0,1]; finally, initial values for the covariance matrices
For convergence assessment, we used diagnostic tools contained in the R package CODA (Plummer et al., 2006), namely, the potential scale reduction factor
Model selection
A hierarchical mixed-effects model may be seen as a missing-data problem, where the random effects are regarded as the missing information. When dealing with missing data problems, Celeux et al. (2006) showed that the deviance information criterion (DIC, Spiegelhalter et al., 2002), which is the typical selection criterion used for Bayesian models, does not work properly with distributions outside the exponential family. Moreover, Plummer (2008) suggested that the approximation used routinely to compute the DIC is valid only when the effective number of parameters (the penalty pD) is much smaller than the number of independent observations μ.
In their work, Celeux et al. (2006) suggest some alternative versions of the original DIC, which vary among them depending on the focus of the analysis. The most straightforward among these alternative is a ‘conditional’ DIC, denoted as DIC7 by the authors, which considers the random effects,
where
Moreover, keeping in mind the specific concerns raised against the DIC, we decided to use two further criteria to select the best model: the penalized expected deviance (PED, Plummer, 2008), and the difference in posterior deviance (Aitkin et al., 2009; Aitkin, 2010).
The PED can be considered as a loss function when predicting the data Y using the same data Y. The issue of using the data twice (for estimation as well as for prediction) makes the expected deviance
Even though outside the exponential family it is hard to calculate the optimism parameter popt; it can be estimated for general models using MCMC methods. According to Plummer (2011), the software JAGS uses importance sampling to estimate the parameter popt. However, the author warns that the estimates may turn out to be numerically unstable when the effective number of parameters is high, as it typically occurs with random-effects models. Similar to the DIC, the smaller the PED, the better.
The difference in posterior deviances (Aitkin et al., 2009; Aitkin, 2010) permits to compare pairs of models to select the best one. This approach is based on the observation that models with growing numbers of parameters are automatically penalized by the increasing diffuseness of the posterior distributions in their parameters. Thus, we can base the selection between two models on the difference between the whole posterior distributions of their deviances,
where M is the length of the MCMC chain. We can derive the posterior probability that Model 1 is better than Model 2,
where the constant K is assumed to be equal to -2log(9) = 4.39, a value that is calibrated to correspond to a likelihood ratio test favouring Model 1 with a posterior probability of 0.9 (Aitkin, 2010). It is also possible to derive 95% credible intervals (CI) for the difference in deviances: a totally negative 95% CI implies that we favour Model 1 over Model 2.
Once we have selected the best models, it is necessary to assess the goodness-of-fit of these chosen models. Indeed, it might be that the best fitting model still fits the data poorly. Here we follow the approach of conducting posterior predictive checks, by comparing the observed data with the replicated data obtained from the posterior predictive distribution (Gelman et al., 1996; Neelon et al., 2010). The idea behind this approach is that the distance between the observed data
In addition to the Bayesian p-value, we also compute the mean squared error (MSE) for the in-sample and the out-sample predictions. In particular, for the out-sample predictions, we remove the observations for the last two years (2006 and 2007) and then we predict them using the model based on the remaining observations. The MSEs for both types of predictions are given by the following formulae:
and
Of course, the smaller the MSE, the closer the replicated data to the observed data and thus the better the goodness-of-fit of the model.
Table 1 presents the information criteria for model selection, discussed in Section 4.2, while MSEs and Bayesian p-values, discussed in Section 4.3, are presented in Table 2. We refer to the supplementary material web appendix for the posterior means and respective 95% CI of the variance components of each of the models.
Table 1 shows a comparison of all the fitted models in terms of DIC7, PED (the latter based on further 5000 iterations of the MCMC at convergence), as well as according to the difference between couples of models in their posterior deviances. We notice that the PED and the difference in posterior deviances lead us to favour different models. As expected, the worst model is the basic joint model (model (3.2)). Among the models with additive overdispersion parameters, the PED indicates that the shared random-effects model is the best one (model (3.7)), while the DIC7 and the difference in posterior deviances favour the model with correlated overdispersion parameters and time-specific correlation coefficients between HCV and HIV infection (model (3.11)).
Information criteria for model selection
Information criteria for model selection
MSE for in-sample and out-sample predictions and Bayesian p-values for model checking
Therefore, we discard both the models with independent overdispersion parameters (model (3.8)), which rules out any correlation between the overdispersion parameters between the two infections, as well as the correlated overdispersion mode, which assumes a constant correlation of 0.36 between the overdispersion parameters for HCV and HIV infection. As a matter of fact, from Figure 2 (panel d), which shows the posterior means of the time-dependent correlation coefficients with their respective 95% CI, we observe that the correlation is never significantly different from zero, except for 2004, when it is borderline positive, and for 2006, when it becomes instead significantly positive.
For the multiplicative models, according to the difference in posterior deviances and DIC7, model (3.13) outperforms model (3.14), while the PED ranks the models the other way around.
Finally, when comparing the best additive model and the best multiplicative model, the PED favours the additive model with shared random effects, while, according to the difference in posterior deviance, we do not have enough confidence to choose between model (3.11) and model (3.13). The DIC7 favours the additive time-specific correlated model (3.11).
If we consider the MSE estimates shown in Table 2, those for the in-sample predictions allow to label the basic model the additive shared model and, at a lesser extent, the multiplicative Be(a,b) model as poorly fitting models. More helpful, the MSEs for the out-sample predictions highlight the additive time-specific correlated model (3.11) as good predicting model. In this sense, the latter MSEs agree with the conclusions drawn by the DIC7 and the difference in posterior deviances. Concerning the Bayesian p-values, the chosen discrepancy measures do not allow us to discriminate between the various additive and multiplicative models, since all the PB are close to 0.50.

Table 3 gives us more details about the estimated parameters, since it shows, for the basic model and the three selected overdispersion models, the so-called ‘time evolution odds ratios’ of year j, j = 2, …, 10, with reference year 1998. These odds ratios are computed by exponentiating the unstructured coefficients αj (for HCV infection) and βj (for HIV infection), that is to say, ORHCV|j,1998 = exp(αj) and ORHIV|j,1998 = exp(βj). The posterior means of the odds ratio are accompanied by the posterior means of the respective standard errors. These time evolution odds ratios compare the prevalence odds of each year compared to 1998. We notice that, generally, the prevalence odds are smaller than one, implying that the prevalence of HCV and HIV infection decreased over the years, mostly after 2004. The estimated odds ratios are more or less similar among all the models, but not their standard errors, which are larger for the additive overdispersed model and, to a greater extent, for the multiplicative one. This occurs because these models take into account the additional variability in the data, not considered by the basic model.
Time evolution odds ratios for HCV and HIV infection, comparing prevalence odds of year j, j = 2, …, 10, with reference year 1998
Figure 2 (panels a–c), for each fitted model, presents the posterior means of the variance components (with 95% CI) for HCV and HIV infection, respectively, and the correlation for the random effects γik. For the models with additive random effects only (models (3.2)–(3.11)), we notice that misspecifying the overdis-persion random effects or even ignoring them affects neither the estimates of the variances components for the clustering random effects, nor the length of their credible intervals (panels a and b). This may be in relation to the fact the γik and θik are assumed to be independent and are both introduced on the same scale of the linear predictor; therefore, the former accommodate the within-region association all over the years, while the latter capture the unexplained additional variability. However, the same argument does not hold for the models with multiplicative random effects. For instance, for model (3.13) and, to a lesser extent, for model (3.14), we notice that
We refer to Figures 3–4 for a graphical representation of the results. We notice that model (3.2) in Figure 3 (panel a) shows parallel regional profiles, because only the cluster-specific random effects are specified; thus, the model fails to account for the exceeding variability in the data. This is instead accomplished by the best additive and multiplicative models (according to PED and the difference in posterior deviances) plotted in panels b–d. What is most striking from the graphical representation is that the fitted regional profiles from the pair of additive models look very similar, as also happens with the pair of multiplicative models. Finally, Figure 4 displays the marginal prevalence of HCV and HIV infection with the respective 95% CI for the basic model as well as the overdispersion models with additive and multiplicative random effects. For comparison, we plotted also the national prevalence per year, obtained by pooling together all the regional results. We notice that the five estimated prevalence profiles are fairly close to the national observed prevalence. What really distinguishes the marginal prevalence estimates from each other is their credible intervals, which account for all the variability shown by the regional profiles.

In this article, we discussed a set of hierarchical models that account simultaneously with clustering and overdispersion for binomial data. This objective has been achieved using two separate sets of random effects, one to induce association among the within-region measurements and between the infections, the other to account for the extra binomial variability in the data. For the latter set of random effects, two different settings have been considered: (1) the overdispersion parameters are included additively into the model, on the same scale of the linear predictor; (2) the overdispersion parameters are included multiplicatively in the model, correcting the prevalence in order to encompass the additional variability. All fitted combined models outperform the so-called basic joint model (Del Fava et al., 2011), which accommodates only the clustered nature of the data. This result was expected because the random-intercept model (3.2) cannot capture the complete association structure in the data.

Under the additive approach, the overdispersion parameters can be seen as random slopes adjusting for clustering, as it happens with normal outcomes (Molenberghs et al., 2010), where overdispersion and cluster-specific random effects coincide. Indeed, the region-specific random intercepts γik induce association averaging out the yearly measurements, while the overdispersion random parameters θijk further adjust for the variation within each year. Under the multiplicative approach, the random effects deal differently with the clustering and the overdispersion. While the random intercepts γik induce the within-region association and are responsible for shifting the regional profiles, the overdispersion random effects θijk act directly on the prevalence πijk and adjust the standard mean-variance relation of the binomial distribution by inflating the variance of the estimated prevalence according to the time-specific variations.
Several models have been proposed for each setting according to the dependence structure among the overdispersion random effects and therefore to their covariance matrix
A very critical point with the presented models is that model selection can be quite a difficult task to accomplish, considering that the small differences among the models, all related to the parameterization of the overdispersion parameters. The original DIC (Spiegelhalter et al., 2002) cannot be considered a valid option when the condition pD << n does not hold (Plummer, 2008), because it tends to under-penalize more complex models. The other selection criteria here used, i.e., the DIC7, the PED and the difference in posterior deviances, may lead to different conclusions. To our experience, it seems that PED, whose penalization is almost twice the penalty for DIC (Plummer, 2008), tends to favour less complex models, while the DIC7 and the difference in posterior deviances select more complex models. In our case, the posterior means estimated by the different best models look similar, which is apparent from the graphical representation of prevalence. For a more detailed discussion of model selection in the Bayesian context, we refer to Lesaffre and Lawson (2012).
Furthermore, another critical issue of these Bayesian hierarchical models is clearly the modelling of the variance components. In the article, we used for the covariance matrices a prior distribution based on a standard inverse-Wishart distribution with degrees of freedom equal to k. However, recent literature highlights that the standard inverse-Wishart distribution is not exactly ‘non-informative’, since it shows a strong dependence between the variance and correlation: the higher the variance, the higher the correlation. Besides being dependent on each other, the variances and the correlation cannot be estimated independently. A possible solution is using a ‘scaled’ version of the inverse-Wishart distribution (O’Malley and Zaslavsky, 2008), where the parameterization keeps separate the correlation and the variances, so that they can be estimated independently from data. Future research is needed to incorporate such a prior in our modelling framework.
Finally, we note that, apart from temporal association and overdispersion, the design of our study, consisting of 20 Italian regions, suggests the possible presence of spatial correlation. Nonetheless, given the typical way of transmission of these infections (by injecting), we expect their transmission to occur more locally, in small injecting networks, and to be less influenced by transmission in neighbouring regions (in contrast to airborne infectious diseases, for which the spatial patterns matter). From a more general point of view, while such a source of association could well be weaker than the ones incorporated in the model, it should not be excluded a priori, even though it will require substantial additional work. It is a topic of future research.
Even though in this article we presented several extensions to the basic model of Del Fava et al. (2011), some enhancements are still possible. A possible extension for the model with unstructured time-dependent overdispersion parameters could be a parametric model for the correlation between HCV and HIV infections at the level of the overdispersion parameters over time. Furthermore, we could model the correlation between the overdispersion parameters for HCV and HIV using information collected at regional level, concerning risk factors related to injecting drug use (percentage of sharing syringes or other paraphernalia, etc.) or results of interventions (percentage of supplies of clean drug injection equipment, opioid substitution and other forms of drug dependence treatments, antiviral treatments for HIV, health promotion, etc.). An investigation of these type of models is a subject for an ongoing research and will be published in the near future.
Footnotes
Appendix
This appendix provides supplementary material for the analyses presented in Del Fava et al. (2013), giving more details about the output of the presented models. Section A1 gives details about the convergence assessment of the variance components for the models discussed in Section 3 in Del Fava et al. (2013). Section A2 presents individual prevalence and marginal prevalence profiles, estimated by the models discarded by the selection criteria. In Section A3 we discuss the implementation of the models in the JAGS software and in Section A4 the raw data are presented. An Excel file with the data is available upon request from the first author at the following e-mail address:
