Abstract
Missingness in explanatory variables requires a model for the covariates even if the interest lies only in a model for the outcomes given the covariates. An incorrect specification of the models for the covariates or for the missingness mechanism may lead to biased inferences for the parameters of interest. Previously published articles either use semi-/non-parametric flexible distributions for the covariates and identify the model via a missing at random assumption, or employ parametric distributions for the covariates and allow a more general non-random missingness mechanism. We consider the analysis of binary responses, combining a missing not at random mechanism with a non-parametric model based on a Dirichlet process mixture for the continuous covariates. We illustrate the proposal with simulations and the analysis of a dataset.
Keywords
Introduction
In many studies, data are missing for some explanatory variables (
When all variables in
In cases where at least one explanatory variable is continuous, we may not have a priori any information for a plausible parametric model. Hence, some authors adopt either semi-parametric or non-parametric models for the marginal distribution of
Incorrect assumptions, either for the missingness mechanism or for the distribution of the covariates, may generate biased inferences for the conditional distribution of the responses given the covariates. We consider a flexible distribution for
In Section 2, we describe the dataset that will be used to illustrate the methods considered in the remainder of the manuscript. In Section 3, we present an overview of the non-parametric Bayesian approach with Dirichlet process for complete data analysis. In Section 4, we extend the model to additionally accommodate the missing data generating mechanism. We perform a simulation study to evaluate the proposed method in Section 5 and analyze our working example in Section 6.
The pulmonary embolism data
Wicki et al. (2001) analyzed data from 1 090 patients that were consecutively admitted to the emergency ward of the University Hospital of Geneva for suspected pulmonary embolism, i.e., blockage of the main artery of the lung or one of its branches. We are interested in evaluating how the probability of occurrence of cardiovascular disease relates to diagnostic tests and other easily obtained information. For simplicity, we consider here only some of the explanatory variables included in the final model presented by these authors.
The indicator of the presence of pulmonary embolism (response variable) as well as four explanatory variables (age, previous pulmonary embolism or deep vein thrombosis, recent surgery, and pulse rate) were observed for all patients, while two variables that indicate presence of certain characteristics (platelike atelectasis and elevation of hemidiaphragm) had missing values for a single patient. On the other hand, the partial pressure of carbon dioxide (PaCO2), obtained from arterial blood gas analysis, was missing for 103 (9%) patients.
In Figure 1, we exhibit histograms of PaCO2 and Gaussian kernel method density estimates based on observed data and on sampled values from posterior predictive distributions obtained from the fit of the non-parametric model of Section 3 and from two parametric models in the normal, log-normal and gamma distribution families that had the best fit. The observed data seem to be better accommodated by the posterior predictive distribution of the non-parametric model than by the corresponding densities of the parametric models. The result is corroborated by Kolmogorov–Smirnov tests for the comparison of the empirical distribution of the observed data with the posterior predictive distributions for a new observation from the non-parametric, normal, log-normal and gamma models (p-values 0.207, 0.002, < 0.001 and 0.004, respectively).

Let Xi, i = 1, …, n, be a random sample of size n from a distribution function F. In the parametric approach, we assume a known form for F, indexed by a finite-dimensional parameter specified a priori, but generally unknown. To allow greater flexibility in modelling and robustness against misspecification of F, we consider non-parametric models; paradoxically this does not mean that the corresponding models are completely free from parameters, but rather indicates that the number and nature of parameters are variable and determined somehow by the data, potentially reaching infinity. Reviews of some non-parametric methods in the Bayesian and classical paradigms are presented, respectively, by Müller and Quintana (2004) and Scott (1992).
One way to avoid the specification of the form of F is to employ random probability measures (RPM), which are probability distributions over the space of probability measures. Ferguson (1973) introduced the Dirichlet process (DP) as an RPM. Admitting that F follows a DP, symbolically, F + DP(a, F0), means that for any measurable partition A1, …, AM of the sample space, the probability vector [F(A1), …, F(AM)] follows a Dirichlet distribution with parameter vector [aF0(A1), …, aF0(AM)], where a is a precision parameter and F0 is a reference distribution measured on the sample space. Under this parametrization, F0 is the prior expectation of the distribution F and as a increases there is a greater concentration of F around F0, up to the extreme case where a → 3 indicates that F is assumed to be equal to F0; on the other hand, small values of a (e.g., < 5) allow, in general, F to deviate considerably from F0 (Congdon, 2006, p. 201). Given n independent and identically distributed observations, the posterior distribution is F|(τ1, …, xn) + DP(a + n, F1), where F1 = (aF0 + nFn)/(a + n) and Fn is the empirical distribution function of the observations.
The simplicity of the properties of the DP and the ease with which the posterior distribution is obtained highlight why this model is so attractive. However, the DP generates a discrete distribution almost surely, which may not be appropriate for many applications. A simple way to generate an RPM compatible with absolutely continuous distributions is to assume that Xi follows an absolutely continuous distribution given the value of a specific parameter θi and that, in turn, θi, i = 1, …, n, are a random sample of a DP, i.e.,
Assuming that the parameters {θi} follow a prior distribution of the DP type centred on G0, instead of the common approach of assuming that these parameters directly follow a parametric distribution G0 (Walker et al., 1999) adds the desired flexibility to the model. The term Dirichlet process mixture (DPM) stems from the hierarchical formulation (3.1)-(3.2) which implies that the marginal distribution for Xi is a mixture, i.e.,
This model shall not be confounded with the mixture of Dirichlet processes (MDP), suggested by Antoniak (1974), a parametric mixture induced by distributions imposed on the parameters of the DP, regarded as hyper-parameters. As with the DP, the MDP generates a discrete distribution almost surely, whereas the DPM, considering or not prior distributions for the parameters of the DP, produces absolutely continuous distributions.
The constructive definition of the DP, presented by Sethuraman (1994), shows that G|(a, G0) + DP(a, G0) can be represented by:
for any measurable subset A of the space of values of {ij}, where,
(Walker et al., 1999). This very useful result allows the design of efficient algorithms to fit DPM models by rewriting (3.3) as:
The construction of the random weights {pj} for the mixture in (3.5) is the so-called stick-breaking procedure. Note that
In practice, however, for simplicity, it is common to truncate the mixture (3.7) to M components (see, e.g., Ishwaran and James, 2002), which is equivalent to approximating the DP(a, G0) by a truncated Dirichlet process (TDP), denoted by TDP(a, G0, M). In this case, to obtain the weights p1, …, pM, we generate the variables Vj + Beta(1, a), j = 1, …, M – 1, and set VM = 1.
The choice of M is the key issue in the approach of obtaining TDP prior distributions. First, we could appeal to the limit M = n, because at most one would have each sample unit xi associated to a different ij in (3.7). Second, because the exact value of the continuous variable is often rounded by the measurement instrument or by the observation process, the number of distinct values contained in the sample is usually much smaller than n, and then, as in the previous case, it makes no sense to assume that M exceeds this number. In the study of pulmonary embolism, for example, from the 987 observed values of PaCO2, there are only 243 distinct values. Finally, even if all values are distinct, Antoniak (1974) shows that (i) the DP naturally provides clusterings of {θi}, assuming that nearby observations are associated to the same value of ij, (ii) a prior distribution for a induces a prior distribution for the number of distinct values of {θi}, denoted by M*, and (iii) for large n,
In Table 1, we display some values for E(M*|a), varying a and n in (3.8). The dependence of M* and a can be understood from (3.5) because as a decreases, the distributions {Vj} concentrate on values away from zero and therefore tend to have a smaller number of weights that are not as close to zero.
Average number of distinct groups, E(M*|a), obtained from (3.8), varying a and n.
West (1992) and Escobar and West (1995) present analyses setting a = 1, but both the latter along with Escobar and West (1998) and Ishwaran and James (2002), among others, suggest the use of a prior distribution given by:
where Gamma(λ1, λ2) denotes a gamma distribution with shape parameter λ1 and scale parameter λ2, such that the average is λ1/λ2. Following the suggestion of Ishwaran and James (2002), we consider λ1 = λ2 = 2, which concentrates around 98% of the a values between 0 and 3, allowing G to deviate considerably from G0.
The posterior distribution of M* can be used to evaluate the truncation of the DP, i.e., whether it was based on a too small value for M. Thus, if the 97.5% quantile of the posterior distribution of M* is very close to M, it is reasonable to increase the value of M; on the other hand, if the 97.5% quantile of M* is far from M, we can decrease the value of M without harming the approximation of the DP by the TDP and yet obtaining results faster, since larger values of M are associated with greater computational effort. The posterior distribution of a along with (3.8) may help in the selection of the new value. In the analysis of PaCO2 in the pulmonary embolism study, for example, the 97.5% quantile of the posterior distribution of M* was 11, a value reasonably smaller than the adopted M = 20; as the 97.5% quantile of the posterior distribution of a was a little smaller than 2, using the results in Table 1, we can decrease M to a value close to 13. The posterior distribution of a does not deviate much from the prior distribution, even with these substantial sample sizes, and this might be a consequence of some potential identifiability problems related to a as discussed by Leonard (1996).
One of the simplest DPM models is the Poisson-gamma, described by Escobar and West (1998) and Congdon (2006, p. 205), wherein
Ishwaran and James (2001) show that this Gibbs sampler has the following characteristics.
The elements of (i1, …, iM) are conditionally independent given the other variables, and the full conditional distribution of ij is proportional to
The elements of (V1, …, VM–1) are conditionally independent given the other variables, and the full conditional distribution of Vj is Beta(1 + aj, a + βj), where aj is the number of {si} equal to j and βj is the number of {si} greater than, j, j = 1, …, M – 1.
The elements of (s1, …, sn) are conditionally independent given the other variables, and each si is updated from the distribution P(si = j|xi) \ pjf(xi|ij), j = 1, …, M.
To avoid a higher degree of overparametrization of the model described in the next section, we consider the simplest version of the univariate normal model proposed by West (1992):
where N(μi, V) denotes the normal distribution with mean μi and variance V, and
West (1992) sets a value for μ0 and adopts the prior distributions:
where S0/s0 is the prior guess for V and s0 measures the prior belief in this guess. He did not mention how to choose w and W, but used w = 2 and W = 10 in his analysis, considering this an approximately diffuse distribution as a prior. Furthermore, he warns that, in general, there is not much information about τ in the sample, but that a prior for τ is necessary, paralleling the subjective choice of the smoothing parameter in kernel methods. Finally, he states that large values of τ induce a greater number of modes for the posterior predictive distribution of X. Escobar and West (1995) consider the heteroskedastic case, i.e., wherein each Xi in (3.10) may have a different variance, Vi, and adopt the prior
with A–1 → 0. In their example, they first evaluate the distribution of the number of modes induced by different values of τ and, then, use the hyper-parameters w = 1 and W = 100, considered compatible with their beliefs.
Ishwaran and James (2002) consider different alternatives. In particular, instead of (3.12), they assume
where the variance of the reference distribution for μi is a priori independent from the variance of Xi. Instead of using (3.14) as the prior distribution, they suggest to set the value of τ in such a way that (3.16) covers the values that {μi} may assume. They mention that a good choice is to take
where Unif[0, T] denotes a continuous uniform distribution in the range [0, T], taking T equal to the variance of the data. They conclude that the gamma prior distribution can be too informative in certain circumstances, even when employing hyper-parameters related to non-informative distributions, and that this may smooth the data improperly. As a consequence, they suggest that the uniform prior distribution may be a more interesting alternative.
Due to the difficulty in choosing the hyper-parameters of (3.14) when using the reference distribution (3.12) and because it is possible that (3.13) may be more informative than assumed for the reference distribution (3.16), we follow the suggestions of Ishwaran and James (2002) adopting the uniform prior distribution for V. We do not consider here the heteroskedastic versions to avoid a higher degree of overparametrization, not only because the sample does not contain much information about a, but also because the more general missingness mechanisms considered in the next sections already suffer from identifiability problems.
Let Yi denote a binary response always observed, Xi, a continuous covariate with potentially missing values, and Ri, an indicator variable assuming the value of 1 if Xi is observed or 0, if Xi is missing, i = 1, …, n. Although interest lies only in the conditional distribution of Yi given Xi, it is necessary to consider a model for Xi, as we do not want to discard the portion of the sample wherein Xi is missing. As we admit that the missing data generating mechanism may depend on the unobserved values, we also need to model Ri.
Employing the so-called selection model factorization (Little and Rubin, 2002), we consider the model:
where Bern(θi) denotes the Bernoulli distribution with success probability θi, along with the prior distributions
all mutually independent.
Values for the hyper-parameters of these prior distributions are indicated in the applications. Note that model (4.1)–(4.3) does not lead to the likelihood of the observed data, but only to the likelihood of the complete data, which is precisely what is required by the computational packages of the BUGS project. Thus, in one of the stages of the MCMC sampling scheme, the algorithm randomly draws a value for the missing data from its conditional distribution given the other variables (observed and unobserved). Regarding the DP, in each iteration of the MCMC, the algorithm draws (1) μ0, (2) nj, j = 1, …, M, i.e., the M distinct values of {μi}, (3) a, (4) V, (5) Vj, j = 1, …, M – 1 (to obtain p1, …, pM), and (6) chooses which of the nj, j = 1, …, M, will be allocated to each μi, i = 1, …, n.
The model is considered semi-parametric because it employs the non-parametric approach of the previous section for the marginal distribution of Xi and conventional parametric models for the conditional distributions of Yi given Xi and Ri given Yi and Xi.
The missingness mechanism (4.1) is non-random because it considers that the probability of having missing covariates may depend on their unobserved values. On the other hand, if we include the missing at random assumption
the missingness mechanism becomes ignorable under the viewpoint of Bayesian inferences for β0 and β1 due to the assumed prior independence between (δ0, δ2) and the other parameters (Little and Rubin, 2002). A subclass of the MAR model is the missing completely at random (MCAR) mechanism that can be formulated by setting
In this setup with missingness in explanatory variables, it is important to note that the so-called complete case analysis (CCA), where units with missing data are discarded, commonly generates unbiased inferences for β0 and β1 not only under the MCAR mechanism but also under any other missingness mechanisms that do not depend on the response Yi such as in the reduced version of the missing not at random mechanism,
A CCA of data generated under the non-random missingness mechanism (4.13) results in biased inferences for the marginal distribution of Xi, but not for the conditional distribution of Yi given Xi. Also, the CCA does not require the specification of a marginal model for Xi if the interest lies only in the conditional distribution of Yi given Xi.
We consider the following distributions for the explanatory variable
where Log-normal(μ, σ2) denotes a log-normal distribution, and μ and σ are, respectively, the mean and the standard deviation of the underlying variable on the logarithmic scale. The mean and the standard deviation of XL and XC coincide with the corresponding parameters of XN, although the densities are very different, as illustrated in Figure 2.

In order to assess the impact of results obtained under different distributional assumptions for the covariate, we generated 1000 replicates of X from each of the three distributions (5.1), (5.2) and (5.3) with sizes n = 50, 100, 200, 400 and 1000; then, for each value generated under each of the distributions of the covariates, we generated Y from (4.2) with β0 = 6 and β1 = –0.5; finally, we generated R from (4.1) with δ0 = –3, δ1 = 0.5 and δ2 = δ3 = 0. For each of the generated datasets (with XN, XL e XC), we fitted the semi-parametric model of the previous section as well as normal and log-normal parametric models. For normal and log-normal parametric models, the non-parametric model (4.3) is replaced, respectively, by
and, for both, the prior distributions (4.6)-(4.9) are replaced by (3.14), with hyper-parameters associated with vague distributions, namely, large A (106 and 103 for, respectively, normal and log-normal cases) and small w (2 for both cases). For the semi-parametric model, the hyper-parameters of (4.6)-(4.10) were chosen as described in Section 3. For all models, we adopted vague prior distributions for dj and βj employing the hyper-parameters
In Table 2, we display the Monte Carlo estimates for the coverage of the 95% equal-tailed credible intervals for β0 and β1 under CCA and analyses of all available data with parametric (normal and log-normal) and non-parametric models assumed for the covariate. An advantage of the MNAR model defined by constraint (4.13) is that we can compare the results to those obtained under the proposed models to that of the CCA, since in this setup it does not lead to biased inferences for β0 and β1. Consequently, the coverages for the CCA and for the correct parametric models were the closest to the desired level for all sample sizes. For small sample sizes, there were instances where incorrect parametric models exhibited coverages slightly closer to 95% than the ones of the non-parametric model, but in general, as sample sizes increased, the non-parametric model had results much better than the incorrect parametric models, even though for XC the coverage was still far away from the desired level with n = 1000.
Monte Carlo estimates for the coverage of the 95% equal-tailed credible intervals (in percentage) for β0 and β1 under CCA and analyses of all available data with parametric (normal and log-normal) and non-parametric models assumed for the covariate generated under an identifiable MNAR mechanism.
As the MNAR model under the constraint (4.13) is identifiable, we avoid dealing with problems caused by non-identifiability, at least initially. In order to explore this issue, we repeated the simulation study, i.e., we employed model (4.1) without considering constraint (4.13) by setting δ0 = –6, δ1 = 0.5, δ2 = 1 and δ3 = 0.5 to generate values for R. When fitting non-identifiable models for incomplete categorical data, Poleto et al. (2011) note that to assess convergence, chains much larger than the ones that would be required for identifiable models (such as a MAR model) are needed; they also observe that this scenario is further aggravated when larger samples or more vague prior distributions are considered. Therefore, we chose to use more concentrated prior distributions for dj, j = 1, 2, 3, i.e., with
In Table 3, we present the results with both prior hyper-parameters, but only for β0 as the results for β1 were pretty similar. We observe that, although the results depend on the prior distributions, the impact is actually smaller than expected. As in the present case the CCA leads to biased inferences for β0, it is not surprising that the corresponding coverages are the worst ones. The coverages of the correct parametric models and the non-parametric model were in general closest to the desired level, especially for n $ 200. By comparing Tables 2 and 3, we note that the incorrect parametric models had much worse results in the latter than in the former. This is probably a consequence of the fact that the average percentages of missing data were approximately 9% in the former and 19% in the latter.
Monte Carlo estimates for the coverage of the 95% equal-tailed credible intervals (in percentage) for β0 under CCA and analyses of all available data with parametric (normal and log-normal) and non-parametric models assumed for the covariate with hyper-parameters
Among several models fitted to data of Section 2, the most interesting for illustrative purposes is specified by
where Yi is the indicator of pulmonary embolism, the explanatory variables X1i, …, X7i are, respectively: (i) an indicator of recent surgery, (ii) an indicator of previous pulmonary embolism or deep vein thrombosis, (iii) an indicator of platelike atelectasis on chest x-ray film, (iv) an indicator of elevation of a hemidiaphragm on chest x-ray film, (v) age, in decades, (vi) arterial pulse rate, in hundreds of beats per minute (bpm) and (vii) partial pressure of carbon dioxide (PaCO2), in kPa, Ri is the indicator of observation of PaCO2 (X7i) and
where
Perrier (personal communication) stated that PaCO2 was missing for some patients because the arterial blood gas analysis was not performed, as patients were not very sick or were so sick that they needed the administration of oxygen. There are no records of which of the two reasons is responsible for the missing PaCO2 data. Perrier’s comments suggest that it is reasonable to assume that the probability of observing PaCO2 (θi): (i) is maximum for patients with probability of pulmonary embolism (πi) close to the prevalence of pulmonary embolism (r) and (ii) decreases as the probability of pulmonary embolism is farther from the prevalence. With this in mind, the piecewise regression model in (6.1) allows the θi to decrease with different speeds as πi → 0 or as πi → 1 and, thus, we expect that δ1 > 0 and δ2 < 0. By including parameters of the measurement process in the missing data generating mechanism, the model being used here is of a shared-parameter kind (Molenberghs and Kenward, 2007). The selection model framework of Section 4 can be applied with this change (and likewise can the pattern-mixture model) without any difficulty. The only patient for whom the two chest x-ray data were missing was not considered in the analyses.
As in the previous sections, we need to specify a model for X7i in addition to (6.1) and (6.2). We could adopt a conditional model for X7i given the other explanatory variables. However, preliminary analyses show that only the pulse rate and the occurrence of previous pulmonary embolism or deep vein thrombosis helped to explain the variability of PaCO2, although, very weakly, since the coefficient of determination for the linear model was only 1% and we did not observe any clear non-linear association in the corresponding scatter plots. Having this in mind, we adopted a marginal rather than a conditional model for X7i. We considered normal, log-normal and gamma parametric as well as the non-parametric models. We used prior distributions as described in the previous sections; the means and variances of the adopted normal distributions for βj, j = 0, …, 7 and dj, j = 0, 1, 2 are all equal to, respectively, 0 and 103. We show the posterior summaries for {βj} and {dj} for the non-parametric model and the ones for {βj} for the complete case analysis in Table 4. As opposed to the simulation study, results for the normal, log-normal and gamma parametric models were pretty similar to the corresponding ones of the non-parametric model. In all analyses, the magnitudes of the Monte Carlo errors were smaller than the precision of the figures presented in the table. Standard diagnostic methods were used to evaluate the convergence of the Markov chains generated for β0 and β1 (Heidelberger and Welch, 1983; Gelman and Rubin, 1992; Raftery and Lewis, 1992) and did not show evidence against their convergence.
Posterior means, standard deviations (SD) and 95% equal-tailed credible intervals (CI).
With the exception of the parameter associated with PaCO2, the posterior standard deviations of the other parameters are in general smaller in the analyses that include all the data than in the complete cases analysis. In Figure 3, we display the estimates of θi (probability of observing PaCO2) obtained from the posterior means of δ0, δ1 and δ2 of the non-parametric model, and estimates of πi (probability of pulmonary embolism), calculated for the observed data. The probability of observing PaCO2 is higher for patients with high rather than low probability of pulmonary embolism. By using the information that the probability of observing PaCO2 is smaller either for cases where the probability of pulmonary embolism is less or is greater than the prevalence, the model yields a weaker association between the presence of pulmonary embolism and the values of PaCO2 than the corresponding one obtained in models that do not take these assumptions into account. Note, for example, the difference between the posterior means of PaCO2 in Table 4. Analyses of all available data are more suitable than complete case analyses, because, by embedding assumptions about missing data, they should provide less biased results on the association between pulmonary embolism and PaCO2, and generate more precise results for the other associations.

We focused on the modelling of binary responses in the case where a single continuous explanatory variable has non-random missing values. We showed that the Bayesian approach with a non-parametric model based on a Dirichlet process mixture for the continuous covariate is a viable alternative to avoid possible biases in the inferences of interest introduced by the choice of an incorrect parametric distribution. In line with Poleto et al. (2011), Bayesian sensitivity analyses of the missingness mechanism via over-parameterized models and proper prior distributions, allows one to avoid too stringent untestable assumptions, while still leading to reasonable answers, i.e., interval estimates that, even though being wider due to the additional ignorance about the missingness mechanism, still contain the true values if the prior distributions embrace the correct missingness model.
Some additional extensions may be considered. First, two or more continuous variables may have missing data. This must be considered both in the model for the missingness mechanism and in the model for the explanatory variables. Both can be specified with multivariate distributions or with a product of univariate conditional distributions. For the missingness mechanism, some authors (e.g., Lipsitz and Ibrahim, 1996; Ibrahim et al., 1999) express a preference for the latter strategy, that is, use of a product of Bernoulli distributions instead of a multivariate Bernoulli distribution. For the explanatory variables, we also believe that it may be more practical to work with unidimensional Dirichlet process mixtures, as described by Ishwaran and James (2002), than with the multivariate version of the non-parametric model considered by Escobar and West (1995), where a multivariate normal distribution with the usual multivariate normal and inverted Wishart prior distributions are employed. Furthermore, the modelling setting of Escobar and West (1995) is an extension of the univariate case described by West (1992) that, as discussed in Section 3, presents some difficulties with regard to the choice of hyper-parameters. On the other hand, by employing unidimensional Dirichlet process mixtures, the normal and uniform priors of Ishwaran and James (2002) can be used as long as the priors for the Dirichlet process mixtures can be considered all mutually independent.
Second, there might be interest in considering other distributions for the responses as well as multivariate cases. The replacement of these distributions may be performed in the modelling setting with nearly the same effort required in complete case analyses.
Third, some of the response variables may be subject to missingness. These cases can be handled by a simultaneous modelling of the indicators of observation for these variables.
Finally, even though the Dirichlet process mixture is the non-parametric Bayesian approach most frequently employed in the literature, other non-/semi-parametric alternatives for the distribution of covariates can be considered, such as the Pólya tree prior distribution, a generalization of the Dirichlet process, resulting in a random probability measure compatible with continuous distributions. Paddock (2002) used this type of prior distribution on the analysis of responses with ignorable missingness. However, the use of this type of prior distribution generates predictive distributions with discontinuities, which may be inappropriate in some situations.
With or without the extensions described above, the biggest challenges for applying the models we deal with are likely the cases wherein assumptions for the missingness mechanism generate non-identifiable models and the sample size is too large and/or the prior distributions for all parameters are too vague. Under these conditions, the samples generated for the posterior distribution obtained by MCMC become extremely autocorrelated, thus, requiring very long chains to detect convergence and also to obtain Monte Carlo errors small enough to ensure the desired precision in the inferences of interest. In particular, these scenarios already highlighted in Poleto et al. (2011) become more severe with large sample sizes because the approximation of the Dirichlet process by its truncated version requires a greater number of components. Possibly other MCMC schemes as those described in Griffin and Holmes (2010) can attenuate this problem.
Footnotes
Acknowledgements
We gratefully acknowledge the financial supports to this research: Frederico Z. Poleto and Julio M Singer, from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), Brazil, Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), Brazil, and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil (project 308613/2012-2); Carlos Daniel Paulino, from Fundação para a Ciência e Tecnologia (FCT) through the research centre CEAUL-FCUL, Portugal and project Pest-OE/MAT/UI0006/2014; Geert Molenberghs, from the IAP research Network P7/06 of the Belgian Government (Belgian Science Policy). The authors are grateful to Dr Arnaud Perrier and to Dr Henri Bounameaux, from Division of General Internal Medicine of Geneva University Hospitals, for providing the data.
Hyper-parameter values
The hyper-parameters employed in Sections 5 and 6 are summarized in Table 5.
