Abstract
When count data exhibit excess zero, that is more zero counts than a simpler parametric distribution can model, the zero-inflated Poisson (ZIP) or zero-inflated negative binomial (ZINB) models are often used. Variable selection for these models is even more challenging than for other regression situations because the availability of p covariates implies 4 p possible models. We adapt to zero-inflated models an approach for variable selection that avoids the screening of all possible models. This approach is based on a stochastic search through the space of all possible models, which generates a chain of interesting models. As an additional novelty, we propose three ways of extracting information from this rich chain and we compare them in two simulation studies, where we also contrast our approach with regularization (penalized) techniques available in the literature. The analysis of a typical dataset that has motivated our research is also presented, before concluding with some recommendations.
Introduction
Excess zeros in count data is a situation where we observe more zeros than expected under the assumption that the response variable follows a count distribution such as the Poisson, negative binomial or even Poisson log-normal. Count data with excess zeros are often encountered in health or biomedical sciences, but also in other fields. A non-exhaustive list of typical examples include modelling of dental caries (Mwalili et al. 2008; Preisser et al. 2012), health care utilization (Deb and Holmes 2000), sexual risk behaviour (Yu et al. 2013), highway safety (Lord et al. 2007) and abundance in marine species conservation (Wenger and Freeman 2008). In Section 4, we analyze a popular dataset from Riphahn et al. (2003) on the demand of health care, measured as the number of visits to the doctor in the last three months before the survey, which represents the type of data that have motivated our work. The proportion of zero responses is
For independent data, the zero-inflated regression model (Lambert 1992) and zero-altered or hurdle model (Mullahy 1986) are both commonly used. The zero-inflated (ZI) model is a mixture model which add a point mass distribution on top of a count distribution, whereas the hurdle model describes the zero counts and the positive counts separately. In both models, covariates are used parametrically to describe the two components of the model. Nonparametric extensions to the hurdle model have been proposed by Barry and Welsh (2002), whereas parametric versions with random effects are a current research topic (e.g., Cantoni et al. 2017). For a review of these (and related) models and their properties, we refer to Ridout et al. (1998) and Zuur et al. (2009).
Variable selection is important in applications because it allows to explain data in the simplest way for better interpretability (e.g., efficient identification of risk factors). It is also a mean for cost management, if the model is next used for prediction. From the statistical point of view, parsimonious models have to be preferred for better prediction accuracy and better interpretation. In addition, collinearity-related problems are mitigated when there are fewer variables involved. We focus on the more challenging ZI model even though our methodology can be applied to the hurdle model as well: for
The ZI model being a regression model fitted by maximum likelihood, all of the classical techniques for variable selection based on the likelihood are a priori available. These include several kinds of information criteria, stepwise and subset selection. We refer to Burnham and Anderson (2002) as a general reference on model selection. The mentioned selection methods either imply exploring all possible models, which is unfeasible in presence of a large number of covariates, or require selecting a prior set of promising covariates (or models), which make them prone to fail in identifying the correct model which has generated the data. This is one of the reasons why stochastic random search procedures are used (e.g., George and McCulloch 1993; Qian and Field 2002; Cantoni et al. 2007; Jochmann 2013). They have the advantage to avoid an exhaustive search. Markov Chain Monte Carlo (MCMC) techniques are used to screen potential models and ensure efficiency in selecting the set of candidate models. Alternatives that do not require fitting of all possible models are regularization techniques which simultaneously fit the model and perform variable selection; for example, Buu et al. (2011), Wang et al. (2014a), Tang et al. (2014) and Wang et al. (2015) for ZIP and/or ZINB models.
We extend to ZI models the approach by Qian and Field (2002), which combines the use of an MCMC procedure with a criterion for model assessment. We will use an information criterion as in Qian and Field (2002), but also an estimated prediction error based on cross-validation (CV). Within this framework where the algorithm produces an MCMC chain, we pay particular attention to how we can take advantage of the rich output produced and suggest to look at three new concepts: (a) the consensus model, which selects the variables whose marginal frequencies of appearance are above a certain threshold, (b) the minimum criterion, which retains the model with minimum value of the criterion used for model assessment and (c) the model vote, which selects the model which has been most visited.
We contrast our proposal to recent regularization techniques. The simulation show that our proposal is effective in identifying the variables generating the data and either outperforms or is comparable to its competitors. The benefits of our technique are also evident in our real data example, where the number of possible models is
The manuscript has the following structure. In Section 2, we introduce the ZI model and our proposal for stochastic variable selection. Section 3 presents two simulations settings, where two versions of our approach are contrasted with existing alternatives. A real data analysis is presented in Section 4 and the article ends with a concluding section.
Methodology
The ZI model assumes that for a random variable
Let
We define
The parameter estimates
Stochastic search procedures consist of moving randomly through the model space, containing all possible models, and draw a subset of promising models. The advantage of such procedures is that they avoid selecting a prior set of candidate models based on potential misleading assumptions. As Qian and Field (2002) put it, the key idea is ‘to convert the model selection procedure into a problem of random sample generation from a finite population’. To address the main concern that the procedure moves well through the model space and guarantees to select the optimal model or a set of models very close to it, MCMC techniques such as the Gibbs sampler or the Metropolis—Hastings algorithm are generally implemented. The idea is to generate from the finite population of all possible models a ‘sample of candidate models which contain the optimal model
We will use the Metropolis–Hastings (MH) algorithm (Metropolis et al. 1953; Hastings 1970), which can be summarized with the three following steps: Choose an initial model At iteration Repeat the procedure for
In Step 2, a candidate
We recommend to choose the initial model
The set of neighbouring models
The operating transition kernel
The scaling parameter
The length of the Markov chain
The consistency properties of this stochastic search approach have been derived in Qian and Field (2002) in a general setting. They proved that, under some ergodic conditions, the probability that the ‘true’ model appears at least once in the chain tends to one with the length of the chain. With high probability, the ‘true’ model will appear more frequently and early in the chain, and therefore the model with the highest marginal frequency and the one with the smallest utility measure are consistent and converge to the true model almost surely.
The utility measure
The Mallow's
A second approach is to use an estimate of prediction error computed with CV, as in Cantoni et al. (2007). This alternative can also be used in situations where the likelihood is not available. Besides an estimate of the prediction error, one can also obtain an estimate of its standard error and use this information to select a subset of good models in the spirit of the one-standard-error rule (Friedman et al. 2001).
Following Shao (1993), we define a. For each split b. Compute the average prediction error c. Draw
We recommend to use
We can reparametrize
The choice of the utility measure can also be driven by the computational burden implied. With respect to the use of an information criterion, the CV approach involves, for each candidate model
How to summarize results
We describe three ways of summarizing the output of the MCMC chain obtained with our stochastic search. This rich output is a by-product of the procedure and is an important feature from which we can take advantage. The performance of these alternatives will be compared in our simulations.
Simulation study
We assess the performance of our proposed approach and contrast it to competing alternatives introduced in Supplementary Section 3.1. In Study 1, we consider a moderated-sized design (
Study 1
We consider the same set of potential covariates for both parts of a ZIP model, so that
For
The coefficients in model (3.1) are defined such that the
Our preliminary calibration has found that
We discuss in detail the results for
Summary of the marginal frequency of appearance computed over the 10 000 visited models of the Markov chain,
. The variables included in the true data-generating model are highlighted in bold
The marginal frequency of appearance of each variable gives a sense of whether our procedure visit the most interesting models. Table 1 summarizes the frequencies for
If we compare CV and AIC for the same type of data (either independent or correlated), we observe that CV provides larger median marginal frequencies of appearance for the true variables and smaller ones for the variables not included in the data-generating model, but that the distribution of the marginal frequencies is wider for CV than for AIC (as expressed by the first and third quartile).
Focusing on the correlated setting, we observe that the correlation between the pairs of variables
Proportion of simulated samples (over the last 5 000 visited models of each MCMC chain) including the variable in the consensus model. In dark grey are the variables included in the true data-generating model
Proportion of simulated samples (over the last 5 000 visited models of each MCMC chain) including the variable in the consensus model. In dark grey are the variables included in the true data-generating model
We compare three different inclusion levels for the consensus model, namely,
Regularizations techniques
We also did the same plots for the penalized approach. We consider three versions of the Buu et al. (2011) approach (BJLT) for each of the LASSO and SCAD penalty, see options 1), 2) and 3) in Supplementary Section 3.1. For the approach of Wang et al.(2014a) and (Wang et al. (2015) (WMW), we consider the three penalties: LASSO, SCAD and MCP. The results are given in Figure 2 (independent data), and in Supplementary Figure 2 (correlated data). For the independent data setting, focusing on BJLT, the different behaviour of the technique depending on the choice of the penalty (either LASSO or SCAD) is apparent at first sight. The LASSO tends to include lots of superfluous variables, in particular with versions 2) and 3), where the tuning parameters depend on the standard error of the coefficients. At the contrary, SCAD is very selective: it captures the stronger signals, but often misses the weakest. There is less difference between the three versions of SCAD than there is between the three versions of LASSO. We also notice that LASSO, and less so SCAD, seem affected by the nature of the covariates: see, for example, the behaviour for
Proportion of simulated samples including the variable in the final model. In dark grey are the variables included in the true data-generating model
Proportion of simulated samples including the variable in the final model. In dark grey are the variables included in the true data-generating model
Comparing Figure 2 with Supplementary Figure 2 we see that, similarly as for the consensus model, the performance of the penalized approaches in recognizing the signal associated with
In Table 2 are displayed the proportion (across the entire MCMC chain) of inclusion of each variable in the model with minimum criterion. We notice that the minimum criterion technique (either with CV or AIC) seems globally less effective than the
Proportion of simulated samples for which each variable is included in the optimal model that minimizes the criterion across the entire MCMC chain (
)
Proportion of simulated samples for which each variable is included in the optimal model that minimizes the criterion across the entire MCMC chain (
)
Proportion of simulated samples for which each variable is included in the optimal model with largest number of votes over the last 5 000 visited models of each MCMC chain (
)
The proportions of inclusion of each variable in the final model obtained by the model votes technique are summarized in Table 3. These proportions are computed after burn-in removal. We notice that the proportion of inclusion of the true and superfluous variables for the CV version are very close to the inclusion probabilities of the
Number of times the true model has been correctly identified (Exact), the number of times one or more data-generating variables are missing from the selected model, but with possibly many extra spurious variables (Underfit) and the number of times a largest model containing the true one (Overfit) has been selected (
)
Number of times the true model has been correctly identified (Exact), the number of times one or more data-generating variables are missing from the selected model, but with possibly many extra spurious variables (Underfit) and the number of times a largest model containing the true one (Overfit) has been selected (
)
If we look at Supplementary Table 2 which summarizes the median number of votes for the best model across the simulated samples, we notice that the most visited model is generally more popular (i.e., gets relatively more votes) in the CV setting than in the AIC setting. On the other hand, the median number of model visited within the last 5 000 models of the chain is much larger when AIC is used than when CV is used. These observations indicate that the AIC method explore more models than the CV method.
Finally, Table 4 (independent dataset) and Supplementary Table 3 (correlated dataset) give a picture of the correct model identification of each technique considered here. The table gives the number of times the true model has been correctly identified (Exact), the number of times one or more data-generating variables are missing from the selected model, but with possibly many extra spurious variables (Underfit) and the number of times a largest model containing the true one (Overfit) has been selected. As a general comment, we can say that, given the mixture of weak and strong signals in the data-generating model, it is very difficult (for each technique) to identify exactly the true model. Also, due to the presence of weaker signals, the general tendency is to miss some variables rather than to overfit. A notable exception are the BJLT-LASSO2 and BJLT-LASSO3 techniques, in particular for the independent data. This is in line with what we have observed in Figure 2.
For the independent data, the consensus model, either based on CV or AIC, very often selects a model where one or two data-generating variables are missing. The minimum criterion misses the signal more heavily than the consensus model, even more so for AIC, whereas the performance of the approach of model votes is comparable to that of the consensus model. As already mentioned, BJLT–LASSO2 and BJLT–LASSO3 either largely overfit (three extra variables) or miss one variable, whereas BJLT–LASSO1 heavily discard variables that should appear in the selected model (two or more variables missing). The three versions of BJLT–SCAD tend to behave like BJLT–LASSO1 (with some exceptions), which means that they miss two or more true variables, making it less appealing than our consensus model approach, for example. WMW shows a similar behaviour to BJLT–SCAD, irrespective of the penalty used.
For the correlated dataset, the conclusions are similar, but with a stronger tendency to underfit. This is particularly so for BJLT–LASSO2 and BJLT–LASSO3.
We remark that better performance in identifying the signals, in particular the weak ones, is accompanied by the inclusion of more superfluous variables in the final model. This is quite sensible.
Our simulation exercises show that for the consensus model, both versions which make use of either CV or AIC are good in identifying strong signals. CV is more sensitive than AIC to the inclusion level when it comes to the identification of lower signal variables. As a consequence, CV tends to include more superfluous variable than AIC. Globally speaking, CV and AIC exhibit comparable performances regarding the proportion of inclusion of the true data-generating variables. Both versions are affected by the correlation between covariates, but mainly in the binary part of the model. These same comments apply to the model votes technique, which shows a very similar behaviour to the 50% consensus model. The minimum criterion approach is globally less effective than the 50% consensus model, in particular concerning the weak effects. CV seems to behaves slightly better than AIC, though, in identifying the variables with weak effects. Lower performance of the minimum criterion is observed for the correlated dataset, but only for the correlated variables themselves.
Between the penalized techniques, the three versions of WMW clearly outperform BJLT–LASSO. BLJT–SCAD1 and WMW–SCAD are comparable in terms of signal identification (e.g., Figure 2), but WMW–SCAD tend to miss fewer variables (e.g., Table 4).
Study 2
In this study, we consider one of the scenarios in Wang et al. (2015), namely their example 1. A ZINB model is considered with
Based on the conclusions of Study 1, we compare here our consensus model based on AIC (with inclusion levels of
For consistency with the original reference, the results are summarized by looking at the median of the MSE ratio between the identified model and the ‘full’ ZINB model, the sensitivity (proportion of correctly identified number of non-zero coefficients) and the specificity (proportion of identified number of zero coefficients). In Table 5, the median MSE values show that the accuracy of our AIC approach is better in the binary part of the model, but that WMW performs better in the count part. Increasing the inclusion level of the consensus model seems to have the effect of decreasing MSE ratio and sensitivity, but increasing specificity (note, however, that the standard deviation is larger than the observed differences, in particular for the sensitivity). Our approach has better sensitivity for both the binary and the count part of the model, whereas the performance in terms of specificity is better for the WMW approach. The MCP penalty provides the smaller MSE ratios between the WMW versions and also gives good sensitivity. The LASSO penalty exhibits good performance in terms of specificity. Identifying zero and non-zero coefficients is a competing goal, and for all approaches a trade-off takes place.
Median of MSE ratio between the model identified by each technique and the full ZINB model. Mean and standard deviation (between parentheses) of sensitivity and specificity
Median of MSE ratio between the model identified by each technique and the full ZINB model. Mean and standard deviation (between parentheses) of sensitivity and specificity
This dataset contains a sub-sample of 1 812 German male individuals in 1994 used in Riphahn et al.(2003) to study the demand for health care. The dataset is publicly available in the R package
Variable selection results for the health care demand dataset. The couple
indicates whether a variable has been retained (x) or discarded (–) in the binary part and the count part of the model, respectively
Variable selection results for the health care demand dataset. The couple
indicates whether a variable has been retained (x) or discarded (–) in the binary part and the count part of the model, respectively
We apply to this dataset our consensus model approach, the minimum criterion technique, the model votes technique (all of them with both CV (
Few variables have been selected by all the techniques: health and children in the binary part, and for health in the count part. Some variables have been (almost) consistently discarded. It is the case for age40TRUE, age55TRUE, age60TRUE and their interaction with health, schooling, civil, bluec, employed and addon in the binary part of the model, and for almost all the age factors, schooling and children in the count part of the model. For some other variables, there is less agreement between the methods. If we focus only on our proposal, we remark that CV retains more variables than AIC.
We propose a technique for variable selection when the number of covariates is large and the screening of all possible models is unfeasible. The technique is simulation based and comes with several options for using the rich computed information. From our two simulation studies, we have learned that our technique is very effective in identifying the correct variables that have generated the data.
Based on the results of Study 1, we recommend the use of either the
In the second simulation study, where we compare more closely our consensus model to the WMW approach in a larger setting, we observe that specificity is improved at the price of sensitivity. The consensus approach has better sensitivity, whereas the WMW approach has better specificity. Whether one should prefer a method that exhibits better specificity or better sensitivity depends on the final goal of the analysis. For instance, sensitivity is more important in studies where one does not want to miss potential effects. On the other hand, specificity is more important to avoid suggesting nonexistent relationships.
This second simulation study confirms the good performance of the
Footnotes
Acknowledgments
We would like to thank the Editor, the Associate Editor and the anonymous referees for their encouragements and their careful reading of our manuscript. Their comments have led to an improved version of our article.
