While there do exist several statistical tests for detecting zero modification in count data regression models, these rely on asymptotical results and do not transparently distinguish between zero inflation and zero deflation. In this manuscript, a novel non-asymptotic test is introduced which makes direct use of the fact that the distribution of the number of zeros under the null hypothesis of no zero modification can be described by a Poisson-binomial distribution. The computation of critical values from this distribution requires estimation of the mean parameter under the null hypothesis, for which a hybrid estimator involving a zero-truncated mean estimator is proposed. Power and nominal level attainment rates of the new test are studied, which turn out to be very competitive to those of the likelihood ratio test. Illustrative data examples are provided.
There are many reasons why one would observe (or suspect) that the number of zeros in a given count dataset is unusually large or unusually small. These reasons can be roughly classified into two major categories: (a) bias arising from the data collection procedure (b) structural zeros due to an underlying physical reason. To give an example for (a), we cite Dietz and Böhning (2000) who modelled zero-deflated ‘Decayed, Missing and Filled Teeth’ (DMFT) index data from a dental epidemiological study previously published by Mendonca (1995). Specifically, the DMFT index quantifies the dental status of an individual through a count of ‘Decayed, Missing and Filled Teeth’, and it was noted that an ‘incorrect sampling procedure’ had led to the non-inclusion of some children whose score was zero.
An example for (b) is illustrated through the two datasets displayed in Table 1, which report results from laboratory (in vitro) experiments where frequencies of chromosome aberrations were counted after exposing blood samples to 200 kV X-rays (Heimers et al., 2006). To be more precise, blood (from healthy volunteers) was mixed and then divided into five parts, with each part getting exposed to one of the doses . The radiation exposure may lead to double-strand breaks, which, when incorrectly repaired by the DNA-damage response mechanism, can produce dicentric chromosomes (i.e., chromosomes with two centromeres) or centric rings, which can be counted under a microscope. While Table 1 (left) is representing data collected under a ‘whole–body–exposure’ scenario, Table 1 (right) represents a partial exposure scenario in which 25% exposed blood was mixed with 75% unexposed blood. It is clear that the three quarters of blood which have not been exposed to radiation will contribute very little chromosome aberrations (there does exist a background prevalence of such aberrations, for instance caused by naturally occurring ionizing radiation, but this rate is very low). Hence, one naturally would assume many ‘structural’ zeros in this dataset, as is indeed observed.
Such considerations lead to the question of what it actually means to speak of ‘too few’ or ‘too many’ zeros. Usually, this notion is related to a specific statistical model. For instance, in the field of radiation biodosimetry, the model of choice has been traditionally the Poisson model, based on solid physical arguments and empirical evidence. If the number of zeros is too large or too small relative to what would be expected under the assumed model, be it due to bias or for structural reasons, the Poisson model will fit poorly. A possible solution to the problem is to resort to a more complex model. In the case of partial body radiation exposure, a zero-inflated model appears to be a natural choice, though a plethora of alternative models including the negative binomial distribution and Hermite models have been suggested for this kind of data (Oliveira et al., 2016).
Number of chromosome aberrations in blood samples exposed to sparsely ionizing radiation. Left: whole body exposure scenario; right: partial body exposure scenario. These datasets have been labelled (A3) and (C1) in Oliveira et al. (2016)
(A3)
Dose
Frequency
(C1)
Dose
Frequency
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
8
1
1 715
268
15
2
0
0
0
0
1
2 713
78
8
0
1
0
0
0
0
2
638
298
56
8
0
0
0
0
2
1 302
71
22
5
0
0
0
0
0
3
247
225
85
37
6
0
0
0
3
1 116
46
28
7
2
1
0
0
0
4
99
129
92
52
21
5
2
0
4
929
18
14
22
13
2
0
1
1
5
48
88
97
99
36
25
5
2
5
726
17
18
12
9
13
1
4
0
But, taking the decision aside on which alternative model to choose, it remains the immediate question of whether or not there is evidence for deflation or inflation of zeros relative to the baseline model. While this seems quite likely in Table 1, where the right-hand table features much more zeros (and far fewer ones) than the left-hand table, this question would be much harder to assess if we had not seen the left-hand side table. Hence, there is a need for quantitative methods which, relative to a given model, help to decide whether zero inflation or deflation exists. The terms zero–inflation and zero–deflation have sometimes been combined to zero modification, meaning that there are either too few or too many zeros in the data, relative to the specified count data model. We follow this convention henceforth.
Of course, such methods do exist, in principle, already in the statistician's toolbox, with the most prominent representative being the likelihood ratio (LR) test, which we will outline in detail in Section 2.2. Also score (Rao) and Wald tests are available for this purpose. While these tests are all viable, they rely upon asymptotic results and hence implicitly on large samples, and they do, in their standard form, not transparently distinguish between zero inflation and zero deflation (at least not without proper adjustment, which will be unknown to many applied users). It should also be noted that although Vuong's test for non-nested models has recently become popular as a test for zero inflation, Wilson (2015) shows such use to be methodologically erroneous.
We propose here a new and intuitive test of zero modification that avoids such issues elegantly and which possesses similar attainment rates and power to previous tests. The proposed test relates more directly to the character of zero modification than the other tests: the test will employ the number of zeros in the data as the test statistic, and tests whether this number is consistent with the non-zero-modified model. We will demonstrate that this statistic, under the null hypothesis of no zero-modification, follows a Poisson-binomial distribution, based on which critical values can be obtained.
The rest of the article develops as follows. Section 2 will introduce the test problem, and review the LR test in this context. Section 3 will introduce our new test of zero modification. Section 4 will discuss the important question of how to robustly estimate the Poisson mean parameter (which is needed for the computation of our test statistic) in the absence of the knowledge of whether or not the Poisson assumption is correct. Section 5 provides real data examples with and without covariates, including a detailed study of the chromosome aberration data. Section 6 provides concluding remarks.
Testing for zero modification
Hypotheses
To fix terms, denote an independent sample drawn from count random variables . We denote further the mean of by , and the number of zeros in by , which can be considered as a realization of the random variable .
The question of interest is whether the distribution of the is zero-modified with respect to a given count distribution with densities , where the mean parameter may depend on a set of covariates , in some pre-specified form, and captures further model parameters such as shape or scale parameters of . (In principle, could be modelled by further covariates, though for ease of presentation we assume that this is not the case.) That is, we assume that with some monotonic and known link function , and model parameters which will have to be estimated. While the new test procedure is applicable to test for zero modification w.r.t. any baseline count distribution, in this work the most important application will be the Poisson distribution, in which case corresponds just to the Poisson parameter, and is empty.
Expressing the foregoing general framework in other words, we wish to establish whether the distributional assumption is consistent with the number of zeros observed. It is clear that both the count distribution and the predictor specification for impact on the model fit. We consider our test as a tool to assess the adequacy of given the specification of , but not as a tool to simultaneously assess and . Hence, we use in what follows as short-hand notation for the entire model specification, that is, we identify notationally .
We formulate the null hypotheses and three possible alternatives as follows:
Notably, our approach will not require fitting the model under the alternative, which is a property shared with the score test but not with the Wald test and the LR test. The latter procedure, which can be considered as the most prominent among the three asymptotic sister tests, is briefly reviewed in the following.
Likelihood ratio tests
LR tests are usually employed to determine whether a larger model fits significantly better than a competing smaller (or ‘restricted’) model that is nested within it (though some variants for non-nested models have also been proposed; see, for example, Cox (1962) and Vuong (1989). In the case of testing for zero inflation, the larger model takes the shape
where is the zero-inflation parameter, is some base density (corresponding to the restricted model) such as Poisson or negative binomial, and is a point mass at , that is, . The test problem of zero inflation can then be stated as
For nested models where the smaller model does not sit on the boundary of the parameter space of the larger model, it is well known that the distribution of the LR test statistic (under the restricted model, corresponding to the null hypothesis) follows a distribution, with the degrees of freedom being equal to the number of parameters by which the two models differ. However, when testing for zero-inflation then we are precisely in a scenario where the restricted model, for , does sit on that boundary. Molenberghs and Verbeke (2007) showed that the resulting LR test statistic
where , with , and the superscript indicating that all model parameters have been estimated under the restriction , follows an equal mixture of a (i.e., a point mass at zero) and a distribution. Table 2 compares the theoretical 95%, 98% and 99% quantiles of such a distribution with the estimates of those quantiles of the distribution of the log-likelihood ratios (based upon 10 000 resamples) when zero-inflated Poisson and Poisson models are fitted to samples of sizes , and drawn from Poisson data with parameters , and , respectively. As is apparent, even for relatively large sample sizes and Poisson means, the approximation is somewhat poor.
While model (2.2) was originally only thought as a ‘zero-inflated’ model, it actually allows for zero deflation. In the particular case of zero-modified Poisson (ZMP), where , one can show that the density is still well defined for all . The test problem of zero modification can then be stated as
and in this case the asymptotic distribution reverts to . However, note that especially (but not only) in the presence of covariates, often a monotonic link function r is applied, with common choices being the complementary log–log (cloglog) link or the logit link, . Notably, the use of a logit or cloglog link excludes the detection of zero deflation since they imply the restriction . Hence, if zero deflation is to be detected, then the identity link is the best choice. It is finally noted that, even though the LR test can be used to test for any of zero inflation, zero deflation or zero modification in principle, the LR test statistic (2.3) as such is uninformative for the direction of the modification.
Observed quantiles of the distribution of the log-likelihood ratios ZIP versus Pois and the theoretical quantiles under a distribution
Quantiles
n
μ
95%
98%
99%
1 000
2.0
2.592
4.099
5.461
40
0.8
2.340
3.758
5.082
20
0.5
2.235
4.415
4.785
Theoretical
2.706
4.218
5.412
The proposed test
Distribution of test statistic
Assume is true and hence is the correct model, with density . Let (i.e., in the Poisson case, ), and a random variable which takes the value , if and otherwise. Clearly, is a Bernoulli random variable with parameter , and so the random variable , which serves as test statistic, can be formulated as the sum over independent Bernoulli experiments .
Based on this simple observation, consider the special case that there are no covariates, that is . In this case, the ’s are equal also, and so the distribution of is the binomial distribution Bin, and thus has mean and variance . Based on this distribution, one can immediately compute quantiles corresponding to a given significance level, and use these as critical values for the test; see Section 3.2.
The situation is more interesting when does depend on covariates , that is, , and hence the ’s are not all equal. The distribution of a sum of Bernoulli distributions with different success probabilities is known as a ‘Poisson–binomial’ distribution (Chen and Liu, 1997), with probability mass function
where , , and the summation is over all possible combinations of distinct from .
Note that this is not a ‘compound’ Poisson-binomial distribution. Daskalakis et al. (2012) remark that ‘It is believed that Poisson (1837) was the first to consider this extension of the binomial distribution, and the distribution is sometimes referred to as “Poisson's binomial distribution”.’
The R package poibin (Hong, 2013b) implements both exact and approximate methods for computing the cumulative distribution function of the Poisson-binomial distribution based upon algorithms presented by Hong (2013a). It also provides the probability mass function, quantile function and random number generation for the Poisson-binomial distribution. Four options for the model fitting algorithms are available in poibin; throughout this article we use the default DFT-CF algorithm.
Test procedure
To carry out the actual test, specify a significance level and decide for one of the test scenarios (a), (b) or (c) as given in (2.1). Denote by an appropriate -quantile of the Poisson-binomial distribution of (to be discussed later). The test consists of carrying out the following procedure:
Fit the relevant count data regression model to the data, yielding means
and, if relevant, further distributional parameters ;
for each estimate ;
use a Poisson-binomial distribution with parameters to determine the distribution of . [This reduces to the binomial distribution in the absence of covariates, where .]
Depending on the chosen alternative, do one of the following:
(a) Reject in favour of if or .
(b) Reject in favour of if .
(c) Reject in favour of if .
Otherwise, one fails to reject .
For the use in our test, appropriate quantiles, or, equivalently, p-values, need to be extracted from the relevant Poisson-binomial distribution. For instance, for test problem (b), the customarily defined quantile and p-value are given by and , respectively. However, it has been argued in the literature that these quantities behave unfavourably for discrete distributions, both from a theoretical and practical viewpoint (Franck, 1986); Ma et al., 2011). The former reference strongly advocates the use of the ‘mid-p-value’, drawing on previous research by Lancaster (1961), Dempster and Schatzoff (1965) and Stone (1969). Specifically, for a given value t of the test statistic , the mid-p-value is given by
and, under the null hypothesis, enjoys the property that unlike for the customarily defined p-value for which this expectation may range between and for discrete distributions (Franck, 1986).
Following similar lines of reasoning, one can motivate and define the ‘mid–quantile’ (Ma et al., 2011); the mathematical definition of which is a bit lengthy and is therefore omitted here. For a precise formulation, in the context of the test under study, see Wilson and Einbeck (2017). For all application studies to be carried out in Section 5, we will employ mid-p-values and mid-quantiles. We refer to the interval as a mid-quantile interval (MQI) for .
Estimating the Poisson parameter
A key component of our test which has not been discussed in detail yet is how to estimate the mean function (3.1) in step (a) of the test introduced in Subsection 3.2. The reader may be surprised that there is an issue at this stage—the problem is that we need to estimate a Poisson mean parameter in the absence of the knowledge of whether this Poisson assumption is correct, that is, whether there is zero modification or not. For a given sample without covariates, the ‘obvious’ choice under the Poisson assumption would be the ‘whole sample mean’ , which corresponds to the maximum likelihood estimator, and is unbiased for . However, as we will demonstrate in Subsection 4.2, this estimate may lead to a severe underestimation of μ if the data is in fact zero-inflated, or an overestimation if the data is in fact zero-deflated. We therefore consider in Section 4.1 an alternative mean estimator based on the zero-truncated distribution, which resolves this problem at the expense of an increased variance. A hybrid version of the two estimators is introduced in Subsection 4.3, and its properties in terms of the test under consideration are analysed in Subsection 4.4. For ease of presentation, all considerations in Subsections 4.1 to 4.4 are provided in the case without covariates. The required adaptations when including covariates are discussed in Subsection 4.5.
Estimation through zero-truncated distribution
As before, we denote by the number of zero-valued observations in y. When the latter is Poisson, then the distribution of the non-zero observations will follow a zero-truncated Poisson distribution with probability mass function
for . It is well known that
and hence
Irwin (1959) gives an explicit expression for involving a Lagrange series expansion; this is sometimes slow to converge and not conveniently implementable. Plackett (1953) shows that can be estimated without bias through the expression . Ridout and Demétrio (1992) show that a very accurate estimate of may be obtained using
where
and is the mean of the positive observed data. We use estimator (4.4) in what follows.
Bias and precision of estimators
For , it is important to recognize that although, unconditionally, is an unbiased estimator of , this is not the case when conditioning on the number of observed zeros, . With as defined in Subsection 2.1, and assuming w.l.o.g. that the first observations , …, give the non-zero results, one has from (4.2)
If we substitute (i.e., ) for in (4.6), the right-hand side reduces to , and hence if , the Poisson parameter tends to be underestimated, and if , it tends to be overestimated. It is worth noting that the derivation of (4.6) remains valid when allowing for zero modification (that is when assuming the Poisson assumption to hold only for the non-zero part).
In contrast, the estimator of (4.3) does not incur bias when conditioning on , since the number of zeros is not involved in its calculation. However, it is less precise than . This is illustrated in Figure 1 which shows the estimates of the Poisson means obtained when observations are sampled from a Pois distribution. The black circles indicate whole sample mean (Poisson) estimates , and the grey crosses the estimates obtained from the positive observations. The horizontal axis gives the number of zeros, , with the expected number of zeros under the Poisson model, , highlighted by a dotted line. It is clear that the whole sample mean estimator has smaller variance, but is biased if the observed number of zeros is far from their expected number. On the other hand, the ZTP-derived mean estimator does not demonstrate a noticeable bias, at the expense of a large variance.
Estimation from the zero-truncated and whole sample
The unsuitability of using either or in our test problem is shown by Figure 2. The left-hand diagram illustrates the attainment of the test for vs with a nominal significance level of for sample of size using , and the right-hand diagram using . Clearly, even for such a sample size, neither estimator is suitable.
Attainment rate using and
A hybrid estimator
We propose here a hybrid estimator for the Poisson parameter, , that balances the precision of with the accuracy of :
Iterative schemes which alternately optimize (in terms of MSE) and update were considered, but found rather unsuitable since the additional variance created in this process defeats the purpose of the hybrid estimator. Instead, we give the following, simpler, recommendations based on simulation studies which are presented in summarized form in Figure 3. It is apparent that for larger mean parameter values, the value of is less critical than for smaller values, and that
(a)
returns a parameter estimate that results in good power and attainment of the nominal level of significance for all values of the Poisson parameter. Based on comprehensive simulations which we have carried out but do not present in detail, we also suggest an ‘adaptive’ selection method for that results in slightly improved power and attainment, namely
(b)
the rationale for which being that smaller mean parameters will lead to many zeros and thus few positive observations; hence the weight of the truncated estimator should decrease in this case. (The constant is chosen so that is continuous.) Detailed study of the performance of schemes (a) and (b), for one-sided and two-sided tests, is provided as follows.
Left: observed attainment under various values of ; right: observed power for
Attainment and power of the proposed test
In a simulation study, the nominal level attainment of the proposed tests was studied under a nominal 0.05 level of significance and sample sizes of 500, 100 and 30. Figure 4 shows the attainment rates as a function of the true Poisson parameter for both schemes, ‘fixed’ and ‘adaptive’, with the corresponding rates for the LR test shown for comparative purposes. Results for both, a two-tailed test of zero modification (alternative hypothesis ) and a one-tailed test of zero-inflation (), are presented. It is apparent that, for both test scenarios, both the fixed and adaptive mixing parameters have excellent attainment rates, the latter especially so.
Figures 5 and 6 show the power of the proposed zero-modification and zero-inflation tests, respectively, for sample sizes of 500, 100 and 30 and Poisson parameters of 0.5, 1 and 2. The power of the LR test is shown for comparative purposes. It is observed that under both test scenarios, the adaptive and fixed mixing parameters lead to tests with nearly identical powers which are either extremely similar to that of the LR test, or greater. The relatively weaker power of the LR test becomes more pronounced for small Poisson parameters and sample sizes, noting however that for very small sample sizes, the comparison becomes difficult since then all attainment curves behave rather erratically (Figure 4 bottom).
Concerning the execution of the simulation, in the right-hand side diagrams of Figure 4 and the diagrams of Figure 6, which pertain to the one-sided version of the test, the competing models of the LR test are a Poisson model, where is modelled by a log link, and a zero-inflated Poisson model, where is modelled by a log link and by a logit link. For the estimates required for the proposed test, the estimate of also uses a log link. In the left-hand side diagrams of Figure 4 and the diagrams of Figure 5, which pertain to the two-sided version of the test, the competing models of the LR test are a Poisson model, where is modelled by an identity link, and a ZMP model, where both and are modelled by an identity link, and the estimates of required for the proposed test are also derived using identity links. Depending upon the value of the Poisson parameter and the sample size either 5 000 or 25 000 resamples were used to determine the rejection rates. It is further noted that where power curves appear incomplete (such as Figure 6 bottom left), the configuration of parameters led to occasional samples which consisted almost entirely of zeros, and hence could not be reliably fitted within the framework of a simulation study.
Attainment rate under hybrid estimator. Left: alternative ; right:
Power under hybrid estimators (test of zero–modification; )
Power under hybrid estimators (test of zero–inflation; )
Mean function estimation in the presence of covariates
In this case, the hybrid version of the estimated mean for case i may be obtained by computing fitted values under a Poisson model (say, ) and a zero-truncated Poisson model (), respectively, and then applying the hybrid technique (4.7) on the respective pairs of fitted values. The fitted values from the ZTP model can be obtained using statistical software such as the R-package VGAM (Yee, 2010). (Of course, this methodology could also be applied in the absence of covariates, in which case the fitted values will all be equal.) Denoting by , the resulting hybrid mean estimates, this implies that for scheme (a) one simply has
whereas for scheme (b), an adaptive choice of is obtained via , yielding the case-wise hybrid rule
Figure 7 illustrates the power and attainment of the proposed test in comparison to the LR test in the presence of covariates. The left-hand diagram displays the powers obtained when observations are simulated from a zero-modified Poisson model, with zero-modification parameter (on the horizontal axis) and Poisson parameter of the form , where is a random draw of 50 observations from a uniform distribution on the interval . The right-hand diagram illustrates the powers obtained when observations are simulated from a zero-modified Poisson model with Poisson parameter of the form , where and are both random draws of 100 observations from a uniform distribution on . The adaptive hybrid parameter has been used, but the results remain similar for the fixed estimator. Overall, these plots give evidence that the proposed test compares strongly to the LR test also in the presence of covariates.
Power under hybrid estimator (covariate model, left: alternative ; right: )
Examples
In this section, we present a collection of examples, with and without covariates. R Code to reproduce these examples will be provided in the Statistical Modelling archive under www.statmod.org/smij/.
We initially present an example of the proposed test applied to covariate-free data, in which case the Poisson-binomial distribution reduces to a binomial distribution, and proceed with two covariate-bearing examples in the subsections which follow. For all the examples of this section, the adaptive (scheme (b)) hybrid estimator of the Poisson mean was used in the execution of the proposed test.
For the one-sided tests of zero inflation, the Poisson parameter is modelled by a log link and the zero-inflation parameter by a logit link; for the one-sided tests of zero deflation and the two-sided test of zero modification, both the Poisson and zero-inflation parameter are modelled by identity links. The estimates of the Poisson parameters necessary for the proposed test are derived from the Poisson and truncated-Poisson models with log link for the one-sided tests, and with identity link for the two-sided tests.
The ‘Prussian horse kicks’ data
The ‘prussian horse kicks data’
y
0
1
2
3
4
≥ 5
Count
144
91
32
11
2
0
Table 3 is the famous ‘horse kicks’ data of von Bortkiewicz (1898), which summarizes the number of deaths by horse or mule kicks per Prussian army corps annually between 1875 and 1894. Table 4 illustrates the use of one-sided and two-sided versions of the proposed test. Concerning the latter, we fail to reject : data is Poisson in favour of : data is ‘zero-modified’ Poisson as the test statistic, that is, the observed number of zeros lies within the 95% MQI, or equivalently as . Note that this is in agreement with the results of a LR test of the same hypothesis.
One–and two–sided tests of zero modification
Proposed Test
LR Test
95% MQI
90% MQI
p–value
Statistic
p–value
[120.07, 148.23]
0.30
1.026
0.288
144
[118.09, 150.87]
0.137
1.026
0.144
144
0.863
0.856
Further, we fail to reject : data is Poisson in favour of the zero-inflated alternative since the observed number of zeros is not greater than the 95th quantile of the relevant binomial distribution (i.e., the upper limit of the 90% MQI, ), or equivalently as ; similarly, we fail to reject : data is Poisson in favour of the zero-deflated alternative as the test statistic, that is, the observed number of zeros is not less than the 5th quantile of the relevant binomial distribution (the lower limit of the 90% MQI, ), or equivalently as . Again both these results are in agreement with the results of a LR test of the same hypotheses.
Chromosome aberration data
We consider four datasets consisting of chromosome aberration counts in human blood cells after in vitro exposure to ionizing radiation. These datasets have previously been studied by Oliveira et al. (2016), where detailed descriptions of the datasets can be found.
Table 5 summarizes the results obtained when the proposed test and an LR test are used to test for zero-inflation relative to a Poisson regression model with log link and a quadratic linear predictor for covariate ‘absorbed radiation dose’ [Gy]. The third column provides the 90% MQI, the upper bound of which coincides with the critical value for the zero-inflation test (). We see that for all datasets except A3, the observed number of zeros exceeds , hence clearly rejecting the Poisson model in favour of the zero-inflated Poisson model for A1, B1 and C1, but not for A3. These results are in full agreement with the corresponding one-sided LR test. Note that the lower limit of the 90% MQI is included in Table 5 for informational purposes, but is not required for test.
Analyses of chromosome aberration data. Data labels refer to notation in Oliveira et al. (2016)
Proposed Test
LR Test
Data
90% MQI
p-value
Statistic
p-value
A1
14 430
[14 213.9, 14 318.4]
16.37
A3
2 747
[2 726.5, 2 814.3]
0.368
0.98
0.322
B1
7 280
[6 716.6, 6 818.4]
85.31
C1
6 786
[5 041.1, 5 152.8]
1 330.65
Trajan data
The data are the number of roots produced by micropropagated shoots of the columnar apple cultivar ‘Trajan’. During the rooting period, all shoots were maintained under identical conditions, but the shoots themselves were cultured on media containing different concentrations of the cytokinin BAP, in growth cabinets with an 8- or 16-hour photoperiod. Full details of the experiment are to be found in Marin et al. (1993). A striking feature of the data is that although almost all of the 140 shoots produced under the 8-hour photoperiod rooted, only about half of the 130 shoots produced under the 16-hour photoperiod did. Overall, 64 shoots produced zero roots, of which only 2 were from the shorter photoperiod.
These data were analysed by Ridout and Demétrio (1992) and Ridout et al. (1998). The latter paper presents a table of the fits of various Poisson and negative binomial models, and their zero-inflated counterparts, and finds evidence of zero-inflation with respect to both models. The authors comment that there is little evidence of an effect due to BAP concentration, but the effect of photoperiod is significant.
The results when the proposed method is used with a Poisson model (where the mean is modelled by photoperiod) as the model of the null hypothesis are summarized in Table 6, noting again a very good agreement between the proposed and the LR test.
Poisson analyses of Trajan data
Proposed Test
LR Test
Data
90% MQI
p-value
Statistic
p-value
All
64
[0, 4.9]
314.2
Period = 16
62
[0, 4.13]
316.8
Period = 8
2
[0, 0.55]
0.003
7.846
0.003
Conclusion
We have developed a novel test for zero inflation or zero deflation in count data models with or without covariates, which tackles the problem more directly than existing asymptotic tests, by asserting whether or not the observed number of zeros is plausible under the hypothesized count distribution. The plausibility is assessed with reference to appropriate quantiles of a Poisson-binomial distribution. Essential to this procedure is the estimation of the parameters of the count data model. The question of how to estimate the mean parameter robustly has been given detailed attention in the case of the Poisson hypothesis, and a ‘hybrid’ rule which mixes the whole sample mean with a zero-truncated mean estimator has been developed which yields excellent attainment and power properties of the resulting zero-modification test. This hybrid estimator was developed specifically for the purpose of the proposed test, but may be of more general use than the one presented here. The extension of the test to other base distributions is straightforward; however, the investigation of the requirement for, and shape of, robust parameter estimation techniques such as the hybrid estimator for other base distributions than Poisson requires further attention.
Acknowledgments
We would like to thank two anonymous referees for their insightful comments and suggestions which have led to a major improvement in the presentation and clarity of this manuscript.
References
1.
ChenSXLiuJS (1997) Statistical applications of the Poisson-binomial and conditional Bernoulli distributions. Statistica Sinica, 7, 875–92.
2.
CoxDR (1962) Further results on tests of separate families of hypotheses. Journal of the Royal Statistical Society Series B, 24, 406–23.
3.
DaskalakisCDiakonikolasIServedioRA (2012) Learning Poisson binomial distributions. Proceedings of the 44th Symposium on Theory of Computing, KarloffH. ed., 709–28, New York.
4.
DempsterAPSchatzoffM (1965) Expected significance level as a sensitivity index for test statistics. Journal of the American Statistical Association, 60, 420–36.
5.
DietzEBöhningD (2000) On estimation of the Poisson parameter in zero-modified Poisson models. Computational Statistics & Data Analysis, 34, 547–48.
6.
FranckW (1986) p-values for discrete test statistics. Biometrical Journal, 28, 403–06.
7.
HeimersABredeHJGiesenUHoffmannW (2006) Chromosome aberration analysis and the influence of mitotic delay after simulated partial-body exposure with high doses of sparsely and densely ionising radiation. Radiation and Environmental Biophysics, 45, 45–54.
8.
HongY (2013a) On computing the distribution function for the Poisson binomial distribution. Computational Statistics and Data Analysis, 59, 41–51.
9.
HongY (2013b) poibin: The Poisson Binomial Distribution. URL http://CRAN.Rproject.org/package=poibin
10.
IrwinJO (1959) On the estimation of the mean of a Poisson distribution from a sample with the zero class missing. Biometrics, 15, 324–26.
11.
LancasterHO (1961) Significance tests in discrete distributions. Journal of the American Statistical Association, 60, 233–34.
12.
LoeysTMoerkerkeBDe SmetOBuysseA (2012) Expert tutorial: The analysis of zero-inflated count data—Beyond zero-inflated Poisson regression. British Journal of Mathematical and Statistical Psychology, 65, 163–80.
13.
MaYGentonMParzenE (2011) Asymptotic properties of sample quantiles of discrete distributions. Annals of the Institute of Statistical Mathematics, 63, 227–43.
14.
MarinJAJonesOPHadlowWC (1993) Micropropagation of columnar apple trees. Journal of Horticultural Science, 68, 289–97.
15.
MendoncaL (1995) Longitudinalstudie zu kariespräventiven Methoden, durchgeführt bei 7–bis 10–jährigen urbanen Kindern in Belo Horizonte (Brasilien) [Longitudinal study on methods for the prevention of caries, carried out with urban children of 7-10 years of age in Belo Horizonte (Brazil)]. Inaugural–Dissertation zur Erlangung der zahnmedizinischen Doktorwürde am Fachbereich Zahn–, Mund–und Kieferheilkunde der Freien Universität Berlin[Dissertation presented for the degree of Dr med. dent. Free University of Berlin].
16.
MolenberghsGVerbekeG (2007) Likelihood ratio, score, and Wald tests in a constrained parameter space. The American Statistician, 61, 22–7.
17.
OliveiraMEinbeckJHiguerasMAinsburyEPuigPRothkammK (2016) Zero-inflated regression models for radiation-induced chromosome aberration data: A comparative study. Biometrical Journal, 58, 259–79.
18.
PlackettRL (1953) The truncated Poisson distribution. Biometrics, 9, 485–88.
19.
PoissonSD (1837) Recherches sur la probabilite des jugements en matiere criminelle et en matiere civile [Researches into the Probabilities of Judgements in Criminal and Civil Cases]. Paris: Bachelier.
20.
RidoutMSDemétrioCB (1992) Generalized linear models for positive count data. Revista de Matemática e Estatística, 10, 139–48.
21.
RidoutMSDemétrioCBHindeJ (1998) Models for count data with many zeros. Proceedings of the XIXth International Biometric Conference, 19, 179–92.
22.
StoneM (1969) The role of significance testing: Some data with a message. Biometrika, 56, 485–93.
23.
von BortkiewiczL (1898) Das Gesetz der kleinen Zahlen. Leipzig: BG Teubner.
24.
VuongQ (1989) Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica, 57, 307–33.
25.
WilsonP (2015) The misuse of the Vuong Test for non-nested models to test for zero-inflation. Economics Letters, 127, 51–3.
26.
WilsonPEinbeckJ (2017) Sample quantiles corresponding to mid p-values for zero-modification tests. Proceedings of the 32nd International Workshop on Statistical Modelling, Groningen, Grzegorczyk, M. and Ceoldo G. eds, 275–79.
27.
YeeTW (2010) The VGAM package for categorical data analysis. Journal of Statistical Software, 32, 1–34.