Abstract
Count data are most commonly modeled using the Poisson model, or by one of its many extensions. Such extensions are needed for a variety of reasons: (1) a hierarchical structure in the data, e.g., due to clustering, the collection of repeated measurements of the outcome, etc.; (2) the occurrence of overdispersion (or underdispersion), meaning that the variability encountered in the data is not equal to the mean, as prescribed by the Poisson distribution; and (3) the occurrence of extra zeros beyond what a Poisson model allows. The first issue is often accommodated through the inclusion of random subject-specific effects. Though not always, one conventionally assumes such random effects to be normally distributed. Overdispersion is often dealt with through a model developed for this purpose, such as, for example, the negative-binomial model for count data. This can be conceived through a random Poisson parameter. Excess zeros are regularly accounted for using so-called zero-inflated models, which combine either a Poisson or negative-binomial model with an atom at zero. The novelty of this article is that it combines all these features. The work builds upon the modelling framework defined by Molenberghs et al. (2010) in which clustering and overdispersion are accommodated for through two separate sets of random effects in a generalized linear model.
Introduction
Count data are encountered in a wide range of applications, including medical and biomedical research. As we will now explain, they may exhibit a variety of features that somehow need to be taken into account in the modelling process: zero-inflation, overdispersion, and correlation. In this contribution, we will accommodate them all at once, within a likelihood framework, allowing for easy implementation in standard statistical software.
Univariate count data is typically modeled within the class of generalized linear models (GLM, Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989; Agresti, 2002) and the exponential family is used to formulate distributional assumptions (McCullagh and Nelder, 1989). In this regard, Poisson regression models provide a standard basis for the analysis of count data. Nevertheless, it has been clear for several decades that a key feature of the GLM framework and many of the exponential family members, the so-called ‘mean-variance relationship’, may be overly restrictive. By this relationship, we indicate that the variance is a deterministic function of the mean; for the Poisson model v(n) = n. In several cases, one can expect that the Poisson assumption, however, is not satisfied: (1) when the data are hierarchically structured, as happens in longitudinal studies or when individuals are grouped into clusters; (2) when overdispersion is present in the data, for example, due to unobserved confounding; (3) when an excess of zeros is present in the data. The novelty of our work is that it combines these three features in into a single, flexible framework. As such, it goes beyond what is available in the literature. A general model framework for counts will be presented in which hierarchy, overdispersion, and zero-inflation can be modeled, extending the work by Molenberghs et al. (2010). While a relatively straightforward extension is given earlier work, it has practical relevance as our data analysis will illustrate. It will be investigated whether the different components are conveniently and jointly estimable. Indeed, zero-inflation is a special case of overdispersion, but then it is of a very particular kind. We will also examine how failure to account for at least one of the features affects the results.
There is a lot of literature on Poisson-model extensions. Breslow (1984) targets overdispersion in the Poisson model. One of the important models in this respect is the negative-binomial model, where the natural parameter is assumed to follow a gamma distribution, which has the effect of relaxing the mean–variance relationship. Lawless (1987) also contributed to this class of extensions.
When focusing on hierarchical data, the so-called generalized linear mixed model (GLMM, Breslow and Clayton, 1993; Wolfinger and O’Connell, 1993; Engel and Keen, 1994) has gained popularity as a tool to accommodate overdispersion and/or hierarchy-induced association for outcomes that are not necessarily of a Gaussian type. Booth et al. (2003) extended the negative binomial log-linear model to the case of dependent counts, where dependence among the counts is handled by including linear combinations of random effects.
Overdispersion and excess zeros for cross-sectional count data are studied by, for example, Lambert (1992) and Greene (1994). Multi-level zero-inflated Poisson regression is considered by Lee et al. (2006). Zero-inflated count models provide a way of modelling the excess zeros in addition to allowing for overdispersion by using two simultaneously operating data generation processes; one generates only zeros and the other is either a Poisson or negative-binomial data generating process. The hurdle model, which is a two-part model, is also available to model excess zeros (Mullahy, 1986). One part is a binary model for whether the response outcome is zero or positive. Conditional on a positive outcome, the second part uses a truncated Poisson or negative-binomial that modifies an ordinary distribution by conditioning on the positive outcomes. For zero-inflated correlated data, the hurdle model has been studied by Min and Agresti (2005).
The article is organized as follows. In Section 2, two motivating case studies with count outcomes are described, with analyses reported in Section 7. In Section 3, a review is given of the so-called ‘combined model’ the general modelling framework proposed by Molenberghs et al. (2010) to model overdispersed hierarchical data. Section 4 proposes an extension of this modelling framework to also deal with zero-inflation. Avenues for parameter estimation and ensuing inferences are explored in Section 5, with particular emphasis on so-called partial marginalization. Section 6 deals with a simulation study to investigate the importance of accounting for clustering, overdispersion, and a preponderance of zero counts. The case studies are analyzed in Section 7.
Motivating case studies
The Jimma Infant Growth Study
The Jimma Infants Survival Differential Longitudinal Growth Study is a survey to study infant survival in Ethiopia. Risk factors, including socio-economic, maternal, and infant-rearing factors, were recorded to be able to study their relationship with the child’s early survival. The study is described in detail by Asefa and Tessema (2002). Children born in Jimma, Keffa, and the Illubabor Zones, Southwestern Ethiopia were examined for their first-year growth characteristics. At baseline, there were a total of 7969 infants whereby 4317, 1494, and 2158 were from rural, urban, and semi-urban areas, respectively. The children were visited every two months starting from birth until the age of one year (Table 1).
Jimma Infant Growth Study. The mean number of days of illness and standard deviation at each of the seven follow-up times
Jimma Infant Growth Study. The mean number of days of illness and standard deviation at each of the seven follow-up times
One of the questions of interest in the survey is to assess the diarrheal disease burden. In this article, it is investigated whether the number of days of diarrheal illness in the two-month period prior to each visit, changes over time (i.e., age), whether the evolution differs for gender (male or female), place of residence (urban or rural), medical care (medical help given or not) and breast feeding behaviour (breast or artificial feeding). Of the 49000 observations in total, only about 8 000 observations are non-zero, indicating that there is a non-negligible dominance of zero counts.
These data are obtained from a randomized, double-blind, parallel group multi-centre study for the comparison of placebo with a new anti-epileptic drug (AED), in combination with one or two other AED’s. The study is described in full detail in Faught et al. (1996) and Molenberghs and Verbeke (2005). In this study, 45 patients were randomized to the placebo group and 44 to the new treatment group. The number of epileptic seizures was recorded on a weekly basis during a 16-week period. Thereafter, patients were entered into a long-term open-extension study, which contains follow-up measurements of patients up to 27 weeks. The key research question is whether or not the additional new treatment reduces the number of epileptic seizures.
Zero counts represent 33% of the measurements; the sample average and standard deviation are 3.18 and 6.14, respectively. Thus, there is a large proportion of zeros, as well as evidence of overdispersion and correlation stemming from the longitudinal aspect of the data.
Review of the combined model
Molenberghs et al. (2010) proposed a unified modelling framework for the analysis of overdispersed and hierarchical non-Gaussian data by bringing together normal random effects and conjugate random effects within the generalized linear model framework. On the one hand, the GLMM (Breslow and Clayton, 1993; Wolfinger and O’Connell, 1993; Engel and Keen, 1994) with normally distributed subject-specific random effects is likely the most frequently used model in the context of non-Gaussian repeated measurements. On the other hand, an elegant way to accommodate overdispersion is through a two-stage approach, in which a conjugate measurement-specific random effect on the scale of the natural parameter is used, leading to models such as the negative-binomial model for count data. These two features are brought together.
We apply the following notational convention. The model that brings both features together, i.e., the combined mode, is denoted as (PNG), where the first symbol ‘P’ refers to basic Poisson model, the second symbol ‘N’ is for normal random effects and the final one for gamma random effects. Three special cases follow by leaving out one or more of the random-effects structures: (P--) for the Poisson model, (PN-) for the Poisson-normal GLMM, and (P-G) for the negative-binomial model.
Let Yij be the jth outcome measured for subject i, with i = 1,..., N and j = 1,..., ni. In general, the combined family is given by:
with η and ϕ the natural and dispersion parameter, respectively, and
It is computationally convenient, but not strictly necessary, to assume that the two sets of random effects, θ
i
and
For count data, we assume that:
In zero-inflated count models, it is assumed that there are two processes that can generate zeros: zeros may come from both a point mass (process 1) as well as from the count component (process 2). It is assumed that for observation i at time j, process 1 is chosen with probability πij and process 2 with probability
leading to the probabilities
The zero-inflation component
The model is denoted as ZI(PNG), as an obvious extension with earlier notational conventions. Three obvious special cases are ZI(PN-), ZI(P-G), and ZI(P--). Also, all four models without zero inflation are special cases as well. The conditional mean and variance of the ZI(PNG) are:
It can be seen that the conditional variance is inflated as a result of either overdispersion in the data (parameter α), or as a result of zero-inflation (parameter πij), or both.
Likelihood estimation of the (PNG) is done by integrating over the random effects, assembling the marginal likelihood, and maximizing it in the usual way. Molenberghs, Verbeke, and Demétrio (2007) and Molenberghs et al. (2010) marginalized analytically over the gamma random effect, with then further numerical integration over the normal random effects. This enables the use of a flexible normal random-effects tool such as the SAS procedure NLMIXED. Example code can be found in the Appendix. The partially marginalized (PNG) takes the form:
This idea extends in a straightforward fashion to the ZI(PNG):
with
In this section, we report on a simulation study set up to examine the bias in estimating the regression parameters when dealing with overdispersed, longitudinal count data with excess zeros. For such data, the bias is likely to result from not appropriately accounting for the excess zero counts, misspecification of the overdispersion, which is a very common situation for count data in a way that the prescribed mean-variance link is violated and misspecification of the correlation results from the repeated-measurements nature of the data.
Simulation setting
Data are generated along a design inspired by the Jimma Infant Study. Age in months, status of getting medical help, and breast feeding behaviour were among the covariates of interest in the study, and are used in the simulation study as well.
We randomly generated 200 data sets from the zero-inflated combined model for 2000 subjects with 10 measurements per subject. The response vector
Three different scenarios were considered for data generation: S1: without excess zeros; S2: with an excess of zeros of around 20%; S3: with an excess of zeros of roughly 40%. The corresponding total zero percentages are 48%, 68%, and 88%, respectively. This was achieved, for each scenario, by appropriately choosing the zero-inflation coefficients. The true parameter values used to generate the data were
Simulation results
The simulated data are analyzed by the ZI(PNG), ZI(P-G), ZI(PN-), and ZI(P--), as well as by their non-zero-inflated counterparts. Mean, relative bias (rbias) and predicted probabilities of zero counts are summarized for the three scenarios in Tables 2–4, respectively.
Parameter estimates of the ZI(PNG) were in agreement with their true model in all scenarios. This shows that the different components: zero-inflation, overdispersion and correlation, can be well separated in practice, in settings like the ones considered here. The zero-inflated model converged for almost all simulated sets of data, apart from one perhaps idiosyncratic failure to do so.
Simulation study under scenario S1. Mean, standard error, and relative bias of the parameter estimates in ZI(PNG), ZI(P-G), ZI(PN-), ZI(P--), and its non-zero-inflated counterparts
Simulation study under scenario S1. Mean, standard error, and relative bias of the parameter estimates in ZI(PNG), ZI(P-G), ZI(PN-), ZI(P--), and its non-zero-inflated counterparts
Simulation study under scenario S2. Mean, standard error, and relative bias of the parameter estimates in ZI(PNG), ZI(P-G), ZI(PN-), ZI(P--), and its non-zero-inflated counterparts
Simulation study under scenario S3. Mean, standard error, and relative bias of the parameter estimates in ZI(PNG), ZI(P-G), ZI(PN-), ZI(P--), and its non-zero-inflated counterparts
Under S1, as shown in Table 2, the ZI(PNG) and the (PNG) performed well and fairly similar in terms of relative bias, except for the intercept p0 for which a larger bias is observed in the (PNG). The percentage of zero counts (48%) is nearly equally predicted in both cases. But, severe impact starts to emerge in the non zero-inflation models when excess zero counts are present, but not accounted for, as evidenced in Tables 3 and 4. The predicted number of zero counts is largely underestimated in the non-zero-inflated models. When many zeros are allowed for, as in S3, the effect is more pronounced in the intercept term and the negative-binomial parameter α as compared to S2. Moreover, the bias in the standard deviation of the random-effects, for instance, in the ‘true’ model tends to increase in S3, which gets substantially higher for models with neglected zero-inflation component, such as the (PNG) and (PN-).
The impact of omitting the overdispersion is remarkable. This can be clearly observed, for example, from the considerable increment in the relative bias of the ZI(PN-). When overdispersion is omitted, the zero-inflation component will try to recover part of the overdispersion.
When the correlation stemming from the repeated measurements is misspecified, substantial impact appears in inferences of the ZI(P-G), which gets even worse in the (P-G), as evidenced quite clearly from the larger relative bias of the intercept term. When correlation is omitted from the model, the overdispersion term will try to recover for this misspecification.
Unlike in S1, the ZI(PNG) significantly beats the (PNG), confirming the importance of accounting for the excess zeros in addition to the repeated measures nature and the overdispersion.
We conclude that failure to account for excess zeros, overdispersion, and/or correlation has a substantial impact on bias and predicted probabilities. This was clearly shown on such key model parameters as the intercept term, the overdispersion parameter, and the variance of the random effects. All scenarios suggest that the zero-inflated combined model is the preferred one in terms of relative bias and predicted probabilities of zeros.
The Jimma Infant Growth Study
We will fit the ZI(PNG) to the data, introduced in Section 2.1, and compare it to its special cases: (P--), (P-G), (PN-) (PNG), ZI(P--), ZI(PN-), and ZI(P-G). We model
and the zero-inflation probability
with Ri an indicator for rural residence and Ui for urban residence. The semi-urban residence category is taken as the reference. Further, Gi is a gender indicator and Tij is the time point at which the jth measurement is taken for the ith subject; Bij and Hij denote, respectively, whether or not the ith infant is breastfed and given any medication between the (j – 1) st and jth measurement occasions.
Clearly, as can be observed from Tables 5 and 6, the zero-inflated models performed much better than their respective non-zero-inflated counterparts, resulting in a substantial improvement in fit, thence implying that the extra zeros need to be accommodated, which is expected given the excessive zero counts in these data as shown in Section 2.1.
The ZI(PN-) model is an important improvement, in terms of likelihood, relative to the ZI(P--), while much more improvement is gained in the case of the ZI(P-G) relative to the ZI(P–). Moreover, considering the ZI(PNG), there is a strong improvement in fit when the gamma and normal random effects, in addition to zero-inflation, are simultaneously included. A similar observation can be made for the non-zero-inflated models.
There is a very strong improvement in fit of the ZI(P-G), when compared to the ZI(PN-). It points to the fact that overdispersion is more important an effect than the repeated-measures nature, hence the ZI(P-G) is able to perform better from the start. It underscores, once more, that overdispersion with count data is a very common situation. Eventually, both are needed.
The zero-inflation regression coefficients are similar in all models, statistically significant, and can be interpreted as model coefficients for the proportion of extra zeros.
Jimma Infant Growth Study. Parameter estimates and standard errors for the regression coefficients in (P--), (P-G), (PN-), and (PNG)
Jimma Infant Growth Study. Parameter estimates and standard errors for the regression coefficients in (P--), (P-G), (PN-), and (PNG)
ZI(PNG) and ZI(P-G) exhibit similar fits, not only in terms of parameter estimates but also in inference, except that gender is significant in the former (p = 0.0311) while this is not the case for the latter (p = 0.0922). Both models suggest that medical help, breast feeding, main effect of rural place of residence are significant; the same is true for time interactions with breast feeding and urban place of residence.
Jimma Infant Growth Study. Parameter estimates and standard errors for the regression coefficients in ZI(P--), ZI(P-G), ZI(PN-), and ZI(PNG)
We analyze the epilepsy data, introduced in Section 2.2. Let Yij represent the number of epileptic seizures that patient i experiences during week j of the follow-up period. Also, let tij be the time-point at which Yij has been recorded. Consider the combined model (1)–(4), with parameterization similar to the one in Molenberghs et al. (2010), but now accounting for zero inflation, assuming that counts are generated from a (PN-) process with mean mij:
or from a (PNG) process with mean
The zero-inflation probability
Epilepsy Study. Parameter estimates and standard error in ZI(P--), ZI(P-G), ZI(PN-), ZI(PNG), (P--), (P-G), (PN-), and (PNG)
The ZI(P-G) is an important improvement relative to the ZI(P--), while much more improvement is gained in the case of the ZI(PN-). Moreover, the ZI(PNG) leads to a substantially improved fit. Further, we observe that, omitting either the overdispersion or the correlation underestimates the predicted probability of zeros, which becomes worse when both are omitted at the same time. The ZI(PNG), fitted without random effects in the zero-inflation part, results in -2log-likelihood of 5386.8, and predicted probability of zeros equal to 0.3271. This implies that inclusion of random effects in the zero-inflation part tends to have little impact on the predicted probability of zeros. However, based on likelihood comparison, model fit improves considerably. This same phenomenon is also evident in the ZI(PN-) fitted with random effects included only in the non-zero count part (-2log-likelihood is 5971.9, and predicted probability of zeros 0.3112).
None of the zero-inflated models suggests evidence of significance in slope difference and slope ratio, except for the ZI(P--), where significance is maintained for the slope difference (p = 0.0004). However, the latter, unrealistically, omits correlation and overdispersion. The zero-inflation regression coefficients can be interpreted as model coefficients for the proportion of extra zeros, and are statistically significant in all ZI models.
There is quite a bit of research on longitudinal count data, with or without overdispersion, and with or without excess zeros. In particular, the combined model by Molenberghs, Verbeke and Demétrio (2007) and Molenberghs et al. (2010) uses normal random effects to capture the hierarchy in the count data and some overdispersion, with gamma random effects to more flexibly capture overdispersion. Also, zero inflation has been studied in the literature. The novelty of our work is that all these features are combined into one model, with more conventional models following as special cases.
In terms of estimation, we have focused on maximum likelihood estimation, in such a way that standard statistical software, such as the SAS procedure NLMIXED, can be used. An example of such code is given in the Appendix.
Of course, with the considerations of not only one but multiple sets of random effects, comes the obligation to reflect on the precise nature of such latent structures. As underscored by Verbeke and Molenberghs (2010), full verification of the adequacy of a random-effects structure is not possible based on statistical considerations alone, because there is a many-to-one map from hierarchical models to the implied marginal model. Of course, this should not stop the user from considering such models, but rather issues a word of caution.
In this sense, it would be of interest to study extensions of or alternative formulations for the normal random effects. For example, normal mixtures for the normal random effects could be used. Such mixtures can be generated by assuming normality conditional on the mean vector, which itself is assumed to be sampled from a discrete distribution with as many support points as the number of mixture components (see, for example, Verbeke and Molenberghs, 2000, Ch. 12).
Footnotes
Acknowledgements
The authors are grateful to M. Assefa and F. Tessema for the permission to use the data. Financial support from the Institutional University Cooperation of the Council of Flemish Universities (VLIR-IUC) is gratefully acknowledged. The authors gratefully acknowledge support from IAP research Network P7/06 of the Belgian Government (Belgian Science Policy).
