We provide four case studies that use Bayesian machinery to making inductive reasoning. Our main motivation relies in offering several instances where the Bayesian approach to data analysis is exploited at its best to perform complex tasks, such as description, testing, estimation, and prediction. This work is not meant to be either a reference text or a survey in Bayesian statistical inference. Our goal is simply to provide several examples that use Bayesian methodology to solve data-driven problems. The topics we cover here include analysis of times series and analysis of spatial data.
In this manuscript, we provide four case studies that use Bayesian machinery to making inductive reasoning. Our main motivation relies in offering several instances where the Bayesian approach to data analysis is exploited at its best to perform complex tasks, such as description (probabilistic summary of main data patterns), testing (evaluation of competing theories of data formation), estimation (evaluation of parameters in a presumed model), and prediction (forecast of missing or future observations).
We focus on model-based approaches to perform statistical inference. In particular, our case studies are totally based on Bayesian statistics. A statistical model fully describes the data generating process under which a given dataset might have arisen. Thus, the vector of observations is assumed to be generated by an unknown probability distribution with probability density/mass function . Therefore, we have that all information in the data is contained in .
Furthermore, in Bayesian statistics we treat the model parameters that index , , as random variables, which in turn are assigned a prior distribution with probability density/mass function (even though it is an abuse of mathematical notation, since the probability density/mass function associated with is also denoted with ). The prior distribution captures the uncertainty of the researcher about the value of before observing (another way to think about the prior is as summarizing all information about that is external to ).
If we use probability theory to express all forms of uncertainty associated with our statistical model , then Bayes theorem can be used to update our knowledge about , through the posterior distribution
The posterior combines information in the data with any other information external to it that has been encoded in the prior. Again, we slightly abuse notation by using to denote the distribution of as well as that of .
The Bayesian approach is convenient for several reasons. First of all, in a Bayesian setting all forms of uncertainty quantification are granted. Moreover, it is natural to think about hierarchies, and therefore, borrowing of information (hierarchical models can be also fitted employing non-Bayesian approaches, but they are not as natural). Also, with the advent of Markov chain Monte Carlo (MCMC) and other simulation-based approaches, computation has become in some ways easier than computation for non-Bayesian approaches (particularly for complex models). Lastly, since both observations and parameters are random variables, dealing with missing and censored data is simpler.
This work is not meant to be either a reference text or a survey in Bayesian statistical inference. Our goal is simply to provide several examples that use Bayesian methodology to solve data-driven problems. Therefore, we expect that the reader has a working understating of Bayesian methods for model building and computation (both code and data are available for all those readers that explicitly ask it from the corresponding author). For a deep treatment of Bayesian data analysis methods, we refer the reader to Gelman et al. (2014), Robert (2007), Jackman (2009), among many others. Finally, the topics we cover here, include problems in Bayesian nonparametrics (e.g., Müller et al., 2015), Bayesian analysis of times series (e.g., Prado & West, 2010), and Bayesian analysis of spatial data (e.g., Banerjee et al., 2014).
The rest of the document is structured as follows: Section 2 presents an example using synthetic data in the context of nonparamteric regression; Section 3 offers the analysis of a times series using dynamic linear models; Section 4 provides an illustration of a space-time model for areal unit data; finally, Section 5 gives a brief discussion.
Nonparametric regression modelling
Consider the Gaussian process (GP) regression setting (e.g., Taddy & Kottas, 2010) with a single continuous covariate,
where and are the response and covariate observations, respectively, are independent and identically distributed (iid) random errors from a distribution, and is the regression function, which is assigned a GP prior with constant mean function , and power exponential covariance function , with unknown and , but fixed (the special cases for and correspond to the Exponential and Gaussian covariance functions, respectively). Here, defines the smoothness of the function, determines the amplitude of oscillation, and captures the volatility of orientation. Lastly, recall that based on the GP definition, for any collection of index points , the random vector follows an -variate Normal distribution with mean vector , and positive definite covariance matrix whose -th element is given by .
In order to perform full Bayesian inference under this setting, we consider the following independent prior distributions for the remaining model parameters:
Note that by letting , with for , denote the unknown function values, it follows that and , with , and , where , which means that once has been integrated out .
Now, we provide a Gibbs sampling algorithm designed to draw samples from the posterior distribution
where . Let be the vector of model parameters at iteration of the algorithm, for . Given a starting point , we consider a Gibbs sampler (with a Metropolis step) updating to until convergence as follows:
Sample , with
Sample .
Sample , with
Sample .
Sample using a Metropolis step:
Set .
Sample , with a tuning parameter.
Compute , with
where is a constant.
Update to with probability .
Now, in order to test the previous algorithm, we generate synthetic data according to Eq. (1) with , , for , and the true regression function (this example is included in a technical report by Radford Neal, which is available on-line from http://www.cs.toronto.edu/∼radford/mc-gp.abstract.html). Panel (a) in Fig. 1 shows the simulated dataset. Here, we adopt an empirical Bayes approach by setting , , , since marginally and ; and also, and , which let vary across a wide range of values. Thus, we fit the model with and run the algorithm choosing (step 5.) carefully to produce a reasonable average mixing rate between 30% and 50%.
Posterior summaries for , , , and
Parameter
Mean
SD
2.5%
50%
97.5%
0.0334
0.0063
0.0228
0.0327
0.0476
0.6117
0.3932
0.1654
0.6144
1.3862
0.8075
0.4035
0.3303
0.7009
1.8700
0.3590
0.1740
0.1200
0.3296
0.7710
Data along with five sample paths chosen at random from the posterior distribution of , and posterior inference on .
The results presented below are based on samples from the posterior distribution obtained after thinning the original chain every 10 observations and a burn-in period of 5000 iterations. Chains mixed reasonably well and effective sample sizes for range from 22921 to 26436. Panel (a) in Fig. 1 shows five sample paths chosen at random from the posterior distribution of . We see how these paths fit very well to the data. On the other hand, Table 1 presents summaries the posterior distribution of , , , and . Note that the 95% quantile-based credible interval for includes the true value (0.04). Finally, Panel (b) in Fig. 1 shows the posterior mean along with 95% credible intervals for . Almost the entire true function falls within the credible bands, which shows that the model specification as well as the prior elicitation was suitable in this case.
Dynamic linear modelling
Figure 2 shows the detrended zero-centered weekly change series associated with the U.S. 3-year Treasury Constant Maturity Interest Rate from March 18, 1988 to September 10, 1999. We have equally spaced measurements in total. This series does not exhibit additional trends and seems to be stationary (which is confirmed below), so we do not differentiate the data again.
Autoregressive model
As a naive analysis of the series, we consider a model of the form
which corresponds to an autoregressive model of order , . We fit this model to the data using a reference prior distribution of the form , with , and the conditional likelihood
where since the first observations are considered as initial values, for , and . This formulation corresponds to a regular linear model setting, and therefore, we can apply standard theory (Prado & West, 2010, pp. 19–22).
Combining the likelihood Eq. (3) with the prior distribution , we can easily obtain samples from the posterior distribution using direct sampling as follows: first, sample from , and then, for each in the previous step sample from a . Using , it follows that , is the residual sum of squares, which in turn leads to the MLE of , and is the MLE of . Posterior summaries of 25000 samples from are provided in Table 2. We see that the posterior parameter estimates are very similar to their MLE counterpart.
Posterior summaries for and
Parameter
Mean
SD
2.5%
50%
97.5%
0.2267
0.0407
0.1462
0.2267
0.3066
0.0063
0.0420
0.0753
0.0058
0.0896
0.1132
0.0408
0.0334
0.1132
0.1936
0.0130
0.0007
0.0116
0.0129
0.0145
Detrended zero-centered weekly change series of the U.S. 3-year Treasury Constant Maturity Interest Rate from March 18, 1988 to September 10, 1999.
Dynamic linear model
Now, going forward in the analysis, we consider a popular model given in (Tsay, 2010, p. 632) for detecting outliers:
Under this formulation, denotes the presence or absence of an additive outlier at time , and its corresponding magnitude when present. Thus, additive outliers are allowed to occur at every time point with a probability equal to 0.2 (further details about the prior elicitation can be found in Tsay, 2010, p. 636). Furthermore, defines an autoregressive process with order . Then, we have model parameters, namely, , , , , and . Also, we also assume the conjugate priors , for , and .
Note that Eq. (3.2) is a non-Gaussian dynamic linear model (DLM). However, conditional on the outlier indicators , this model can be written as a standard DLM. For instance, with the model aquires the form:
with
and
We use the previous fact to develop the second step of the MCMC algorithm described below. There, we implement a forward filtering backward sampling algorithm (Prado & West, 2010, p. 137) to get samples from , which automatically allow us to draw samples from .
Now, we describe the general form of the MCMC to obtain samples from the posterior distribution
where and . Full conditional distributions are derived looking at the dependencies in . Let be the vector of model parameters at iteration of the algorithm, for . Given a starting point , we consider a Gibbs sampler (with a FFBS step) updating to until convergence as follows:
Sample , for , with
Sample , for , with
Sample , with
Sample , with
Sample , for , using a FFBS step. Here we use standard DLM notation, in particular that corresponding to the Kalman filter (West & Harrison, 1999, Theorem 4.1, pp. 103–105):
Use the filter equations to compute , , , and , for :
with , , and .
At time , sample .
For , sample , where
with .
Now, we implement the MCMC provided above. To do so, we need to choose appropriate values for , , , and . First, and are reasonable values in accordance with our introductory analysis of the series using a standard autoregressive model. On the other hand, we weakly concentrate the prior distribution of around from our introductory analysis, by setting and , which leads to with infinite variance. The results presented below are based on samples from the posterior distribution obtained after a burn-in period of 5000 iterations. Chains mixed very well and there is no sign of lack of convergence. Effective sample sizes range from 398.7 to 29019.3.
Posterior summaries for and
Parameter
Mean
SD
2.5%
50%
97.5%
0.4692
0.1605
0.1618
0.4665
0.7904
0.1341
0.1932
0.2694
0.1492
0.4727
0.1296
0.1367
0.1495
0.1309
0.4029
0.0017
0.0004
0.0010
0.0016
0.0026
(a) Posterior probability that an observation is an outlier (lines correspond to 0.5 and 0.8, which serve as probability thresholds). (b) Posterior mean of an outlier size.
Table 3 displays summaries for the posterior distribution of and . Estimates for and are consistent with those obtained in Table 2 using a standard autoregressive model. However, estimates for and are utterly different. Such a difference is expected since the autoregressive model does not consider additive outliers, and as a consequence, is not able to characterize fundamental features of the series such as its mean evolution. On the other hand, Fig. 3 shows the posterior probability that an observation is an outlier along with the corresponding posterior mean of an outlier size , for . We identify 18 observations such that , seven of which are greater than 0.8, namely, observation at weeks 65, 74, 201, 323, 415, 418, and 513. Tsay (2010, p. 564) also classifies these points as outliers (employing an approach with mild differences from the one described here) and highlight observations at times (May 20, 1994) and (January 17, 1992). At the former, there was a 0.6% drop in the weekly interest rate, while at the later, there was a jump of about 0.35%. Finally, Fig. 4 displays the estimated autoregressive process as well as the fitted values along with the original data. We see that the model captures reasonably well the “volatility” of the series.
(a) Posterior inference on the autoregressive process. (b) Data (gray) and fitted values (red).
Areal unit space-time modelling
We consider an ecological study in which a spatiotemporal model for areal unit data is considered for modelling a set of binary observations , for and , in a discrete time period over a fixed lattice . Here, we consider the Google flu data corresponding to the first 12 weeks of 2013 for the contiguous continental states in the United States (Alaska and Hawaii are omitted from the analysis). Thus, we record a binary variable that is equal to 1 if the Google flue index for state during week is greater than 7500 cases, and 0 otherwise.
We have that the observations are Bernoulli distributed with probability of success , i.e., . Here, we consider the following areal hierarchical generalized linear model (Banerjee et al., 2014, Sec. 6.4) including a temporal component:
with a hierarchical prior distribution of the form
for , , where is a fixed positive integer, are fixed positive real numbers, and , are location-specific fixed effects, , and are location-specific random effects. Thus, denotes the number of random effects associated with the locations. Finally, notice that in Eq. (4) we have included the auxiliary variables in order to ease computation and performance Chib (1995).
The link function relating the and the linear predictor is based in a Student-t distribution with degrees of freedom, which is particularly useful to detect rare events (those with small ). Since with , and , then the marginal distribution of (and therefore ) is Student-t with degrees of freedom. Thus, due to the symmetry of the t distribution, we have that , and therefore, , where is the cumulative distribution functions of a random variable following a t distribution with degrees of freedom, which means that the link function is .
Note that the main effects for non-spatiotemporal covariates have a standard linear regression structure , with a weakly informative prior for (Gelman et al., 2014, p. 55). Also, the main effects for space covariates have a standard linear regression structure , following an improper conditionally autoregressive (CAR) model for (Banerjee et al., 2014, p. 81). Furthermore, the main effects for time are included in the linear predictor through , with a weakly informative prior for . Lastly, the main effects for spatiotemporal interactions have two components. On the one hand, captures region-wide heterogeneity (unstructured variation) by means of , where is a precision parameter controlling the magnitude of . And on the other, captures regional clustering (spatially structured variability) via , where is a precision parameter as before.
CAR models are very convenient computationally, since the method for exploring the posterior distribution is in itself a conditional algorithm, the Gibbs sampler, which also eliminates the need for matrix inversion. However, CAR models have two main theoretical and computational challenges, namely, model impropriety and precision parameter selection. Banerjee et al. (2014, p. 155) recommend to ignore impropriety and work with the intrinsic CAR specification, since we are using it just as a prior distribution and not to model the data directly. Even though this is the usual approach it requires some care: This improper CAR prior is a pairwise difference prior that is identified only up to an additive constant. Thus, in order to identify an intercept term (say ), we must add the constraint (this constrain is imposed numerically during computation). Moreover, and cannot be chosen arbitrarily large, since and would become unidentifiable. After all, we have only a single in order to estimate two random effects for each and each . To make the prior specification and sensible, Banerjee et al. (2014, p. 156) recommend a prior specification that leads to
where is the average number of neighbors.
States and regions neighbors used to specify precision matrices. Red boxes denote neighbor elements.
Define , , , , , , , and also, . Parameter estimates can be obtained from the posterior distribution
with specified as in Eq. (7), and is a square matrix of size such that
where is the number of neighbors of (see Fig. 5), and means that units and are neighbors. Full conditional distributions are derived looking at the dependencies in . Let be the vector of model parameters at iteration of the algorithm, for . Given a starting point , we consider a Gibbs sampler (with a step with a numerical constraint) updating to until convergence as follows:
Sample
for , .
Sample
where , , , , with for , , with for .
Sample
where and is a precision matrix (covariance matrix inverse) of size defined accordingly (see Fig. 5, panel b).
Sample
where .
Regional offices according to the U.S. Department of Health and Human Services.
Sample
where , , , , , and , for .
Sample
where and , for .
Center around its own mean in order to impose the constrain , for .
Sample , for .
Sample , for .
Sample , for .
Now, we implement the model by considering the Google flu data corresponding to the first 12 weeks of 2013. Outbreaks of influenza are commonly observed during weeks 1 to 8 (Winter season). The proportion of states with Google flue index greater than 7500 cases goes down straight to zero after week 8. Then, we have an align dataset with states (subjects) and weeks (period times) that lead to 637 observations possibly correlated in time and space. Thus, we consider the linear predictor given by
where and are the proportion of white population and the proportion of population over 65 years old, respectively, for state in 2013, according to Kaiser Family Foundation (http://kff.org/), and is an indicator variable pointing out if state belongs to region or not, according to according to the U.S. Department of Health and Human Services (HHS, http://www.hhs.gov/iea/regional/; see Fig. 6). Therefore, is the mean global effect, is the fixed effect of covariate , for , is the fixed spacial effect of region , for , is the fixed effect of time, and is the random spatiotemporal effect considering heterogeneity and clustering.
Posterior mean and 95% quantile-based credible intervals for , , and .
Posterior mean for the dynamic random effects . The blue line corresponds to Maine, which is the state with the highest proportion of white population in 2013.
We implement the MCMC algorithm given above drawing samples from the posterior distribution after thinning the original chain every 10 observations and a burn-in period of 5000 iterations. To do so, we weakly concentrate the prior distribution of , , and around 0, and that of , , and around 1, by following the prior specification given in Eq. (8) with , , and . Notice that is set equal to 2 because a distribution has an infinite variance. Trace and autocorrelation plots of the regression parameters show that the corresponding chains achieve convergence quickly. However, in some particular cases, there are signs of serious autocorrelation (in order to address this issue we thin the chain as specified).
Figure 7 summarizes the posterior distribution of , , and . The proportion of the white population have a positive significant impact on the response, which leads to an increment in the probability that the state is above the threshold, whereas the global mean and the proportion of population over 65 do not. Even thought most deaths associated with influenza in industrialized countries occur among the elderly over 65 years of age, this covariate results not significant because it has little variation across the states () and the proportion of population over 65 in each state is just a small fraction of the whole state population (be aware of the “ecological fallacy”). In addition, such a finding makes sense because we are considering incidence as opposed to mortality. Furthermore, all the regions delimited by the HHS have a significant impact on the response, excepting regions 6 and 9 (see Fig. 6). This finding is reasonable since these regions are neighbors and are the smallest regions in terms of extension and population. In addition, it is not surprising that in general regions have a negative influence on the response since these regions independently address needs of communities and individuals through HHS programs and policies. Finally, the effect of time is also significant since it is well known that outbreaks of flue are highly correlated with seasons and we consider a time transition frame.
On the other hand, Fig. 8 displays the posterior mean for the sum of the heterogeneity and clustering random effects . As expected, these effects clearly go down toward 0 after week 8 due to the seasonal behavior of the response. Even though all the states reveal similar patterns, the model is also useful to identify subject-specific dynamic effects. For instance, consider the case of Maine (ME), which is the state with the highest proportion of white population in 2013.
Discussion
In this work we have offered four detailed case studies that use Bayesian modelling techniques in order to answer complex questions. At every instance, we have provided specifics about model formulation, prior elicitation, simulation-based algorithms computation, and posterior inference. We hope that the reader benefits from such effort and considers the Bayesian paradigm a natural way to solve data-driven problems.
From a frequentist point of view, an extra effort is needed in order to fit the kind of models that we consider in this work. On the one hand, for “random effects”, once we specify a stochastic dependence for a given parameter (this is quite natural under the Bayesian paradigm because all parameters are random quantities), its condition becomes unclear for carrying out frequentist inference: Are random effects parameters or not? On the other hand, from a Bayesian perspective, since both observations and parameters are random variables, dealing with missing and censored data as well as uncertainty quantification is straightforward, even more so since simulation-based approaches became widely available in the early 90s (Robert & Casella, 2011). Finally, it is noteworthy that frequentists deal directly (in an ad-hoc fashion) with hierarchical models by maximizing either marginal likelihoods or restricted maximum likelihoods Patterson and Thompson (1971), or treating random effects (or any other set of parameters) as missing data and estimate them using and EM algorithm (Dempster et al., 1977).
Even though the Bayesian approach to statistical inference is extremely beneficial, some challenges remain. The most challenging aspects of Bayesian statistical inference correspond to eliciting prior distributions and performing the prior-to-posterior updates in non-conjugate models.
The prior distribution is an integral part of the model and needs as much attention as the likelihood function. If the prior information is “bad”, then it could negatively affect the results (on the flip side, “good” prior information can substantially improve the quality of your results, particularly for small sample sizes). Also, the prior elicitation can be very complex when many hyperparameters need to be set (often associated to parametric families). Even after the introduction of more powerful computational tools based on simulations, the choice of prior distributions is often driven by a desire to facilitate computation.
Typically, the prior-to-posterior updates is a much more challenging problem than that encountered in classical statistical methods. Computation for frequentist methods requires that we compute maximums and derivatives. Computation for Bayesian methods usually requires that we compute integrals. This was a major obstacle for the practical adoption of Bayesian methods before cheap and powerful computers were widely available until the early 90s (see our discussion above). Finally, we recommend consider alternative inference methods in order to account for “big data”, which is currently an active research area in computational statistics (e.g., variational methods). See for example Ormerod and Wand (2010).
Lastly, we devote a few words in order to demystify the gap between frequentist and Bayesian thinking. After reviewing some advantages and challenges of the Bayesian paradigm, the reader should be aware that several links with the frequentist world remain. For instance, as part of the revision process, one of the referees pointed out the connection between the frequentist treatment of mixed effects models and empirical Bayes methods: The best linear unbiased prediction for random effects under the frequentist paradigm can also be obtain directly from empirical Bayes principles (e.g., Karunamuni, 2002). Other similarities are available, mostly under noninformative prior distributions and large sample sizes, but they will be discussed elsewhere.
Footnotes
Appendix
Notation and acronyms
The cardinality of a set is denoted by . If P is a logical proposition, then if P is true, and if P is false. denotes the floor of , whereas denotes the set of all integers from 1 to , i.e., . The Gamma function is given by .
Matrices and vectors with entries consisting of subscripted variables are denoted by a boldfaced version of the letter for that variable. For example, denotes an column vector with entries . We use and to denote the column vector with all entries equal to 0 and 1, respectively, and to denote the identity matrix. A subindex in this context refers to the corresponding dimension; for instance, denotes the identity matrix. The transpose of a vector is denoted by ; analogously for matrices. Moreover, if is a square matrix, we use to denote its trace and to denote its inverse. The norm of , given by , is denoted by .
The following list summarizes all the acronyms used in this manuscript. All the probability distributions can be found in Appendix A of (Gelman et al., 2014):
E: Expected value.
Var: Variance.
SD: Standard deviation.
CV: Coefficient of variation.
Pr: Probability.
Ber: Bernoulli distribution.
N: Normal distribution.
TN: Truncated normal distribution.
G: Gamma distribution.
IG: Inverse-gamma distribution.
U: Uniform distribution.
CAR: Conditionally autoregressive model.
GP: Gaussian process.
References
1.
BanerjeeS.CarlinB., & GelfandA. (2014). Hierarchical modeling and analysis for spatial data. Crc Press.
2.
ChibS. (1995). Marginal likelihood from the gibbs output.Journal of the American Statistical Association, 90(432), 1313-1321.
3.
DempsterA.P.LairdN.M., & RubinD.B. (1977). Maximum likelihood from incomplete data via the em algorithm.Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1-22.
4.
GelmanA.CarlinJ.SternH., & RubinD. (2014). Bayesian data analysis, Vol. 2. Taylor & Francis.
5.
JackmanS. (2009). Bayesian analysis for the social sciences, Vo. 846. John Wiley & Sons.
6.
KarunamuniR.J. (2002). An empirical bayes derivation of best linear unbiased predictors.International Journal of Mathematics and Mathematical Sciences, 31(12), 703-714.
7.
MüllerP.QuintanaF.A.JaraA., & HansonT. (2015). Bayesian nonparametric data analysis. Springer.
PattersonH.D., & ThompsonR. (1971). Recovery of inter-block information when block sizes are unequal.Biometrika, 58(3), 545-554.
10.
PradoR., & WestM. (2010). Time Series: Modeling, Computation, and Inference. Chapman and Hall/CRC.
11.
RobertC. (2007). The Bayesian choice: from decision-theoretic foundations to computational implementation, Vol. 2. Springer.
12.
RobertC., & CasellaG. (2011). A short history of markov chain monte carlo: Subjective recollections from incomplete data.Statistical Science, 26(1), 102-115.
13.
TaddyM., & KottasA. (2010). A bayesian nonparametric approach to inference for quantile regression.Journal of Business & Economic Statistics, 28(3), 357-369.
14.
TsayR.S. (2010). Analysis of Financial Time Series. CourseSmart. Wiley.
15.
WestM., & HarrisonJ. (1999). Bayesian Forecasting and Dynamic Models. Springer Series in Statistics. Springer New York.