A Bayesian Primer for the Organizational Sciences

Abstract

When first learning Bayesian statistics, the organizational scholar may be confronted by a number of conceptual and practical challenges. The present article seeks to minimize these by first explicating how the Bayesian process can be understood simply as the combination of two complementary sources of information: prior beliefs and data. In turn, we describe how each source is derived from Bayes’s theorem and mathematically formalized, essential knowledge for the Bayesian analyst. However, the beginner can also be undermined by practical difficulties such as software implementation. To this end, we offer a walkthrough of how a Bayesian logistic regression model is coded within BugsXLA, a user-friendly Excel add-in for Bayesian estimation. The data for this example come from a previously published study that identified a subpopulation of “job hobos,” individuals characterized by their frequent voluntary turnover and positive attitudes toward quitting. In the original frequentist analysis, exploring the predictors of hoboism proved to be inefficient and inconclusive. We contrast this standard approach with Bayesian estimation, whose results provide rich and novel insights on the topic.

Keywords

Bayesian BugsXLA WinBUGS regression turnover hobo syndrome

In recent decades, we have witnessed an explosion of Bayesian methodologies in a wide range of fields, including astrophysics (Loredo, 1992), genetics (Shoemaker, Painter, & Weir, 1999), epidemiology (Dunson, 2001), and the modeling of perception (Adelson & Pentland, 1996; Feldman, 2001; Weiss, 1997), learning (Courville, Daw, & Touretzky, 2006; Griffiths & Tenenbaum, 2005), and language processing within cognitive science (Griffiths, Steyvers, & Tenenbaum, 2007; Xu & Tenenbaum, 2007). As an alternative to standard frequentist procedures, researchers are increasingly encouraged to adopt Bayesian approaches to statistical inference that avoid “null effects,” are not constrained by small samples, and do not impose harsh penalties on curious researchers testing multiple hypotheses (Dienes, 2011; Kruschke, 2010a, 2010b, 2011; Kruschke, Aguinis, & Joo, 2012; Lee & Wagenmakers, 2007; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; Wetzels et al., 2011). In addition to circumventing the traditional problems associated with the current frequentist paradigm, Bayesian statistics has its intrinsic benefits as well: Estimation provides a wealth of information on which to base inferences, previous findings are naturally incorporated with new data, and the increasingly complex models seen in contemporary research are easily accommodated.

The Neglect of Bayes

The well-known advantages of Bayesian methods are contrasted by the scarcity of their use in our field. In a search of fifteen prominent journals in the organizational sciences (e.g., Academy of Management Journal, Journal of Applied Psychology, Journal of Management, Personnel Psychology), Kruschke et al. (2012) found that fewer than 0.5% of articles published from January 2000 to December 2010 used Bayesian analyses. Since then, little has changed. Using their same search parameters for the following period of January 2011 to June 2014, we found that this percentage rose to only approximately 0.9%. Importantly, what constituted “using Bayesian analyses” in both reviews was merely if the word Bayes or Bayesian was mentioned in the text, not whether Bayesian methodologies were the primary analytical techniques used. As frequentist procedures sometimes incorporate Bayesian-related aspects (e.g., Bayesian information criterion in a latent class cluster analysis or empirical Bayes in hierarchical linear modeling), even these small percentages are surely inflated. We find both this trend and current state to be problematic. This is not out of allegiance to one paradigm over another, but because Bayesian approaches remain underutilized despite growing awareness of their efficacy in improving both the process and yield of scientific research.

These findings beg the question, “With increased visibility and evident benefits, why have these rates not improved?” There are many potential explanations (see Sharpe, 2013), and we discuss three possible reasons here. First, researchers may be somewhat aware of the advantages of Bayesian statistics, but perhaps assume that data analysis based on null-hypothesis significance testing (NHST) is “good enough.” That is, they may be vaguely aware of the benefits of the former and problems of the latter, but not recognize their significance in actual research contexts. Kruschke et al. (2012) delineated the many advantages of Bayesian estimation, and Zyphur and Oswald (in press) provided several examples of when this approach superseded traditional frequentist methods (viz., in an analysis of variance and structural equation model). Expanding on their previous work, we use previously published data (Woo, 2011) to contrast the Bayesian and standard frequentist approaches to estimation in an applied regression context. We demonstrate how a Bayesian analysis leads to more comprehensive inferences and better answers to the research questions being posed.

Second, the knowledge required for Bayesian techniques is markedly different than that needed for classical procedures. Probability theory takes a central role in the former and is often divorced from statistics in graduate courses (Lynch, 2007, p. 9). Performing Bayesian estimation also requires a greater background knowledge of statistical theory, including familiarity with various probability distributions and their parameters, an understanding of Bayes’s theorem and its implications, and competence in Markov chain Monte Carlo (MCMC) simulation methods. Though methodologically capable, organizational researchers are not statisticians and use statistics pragmatically—that is, as a tool to address pressing research questions. Therefore, introductory Bayesian texts that are mathematically intensive (e.g., Gelman, Carlin, Stern, & Rubin, 2004; Gill, 2002; Ntzoufras, 2009) may not be appropriate for this audience. Zyphur and Oswald (in press) provided an informative and accessible introduction to a wide range of topics in Bayesian estimation, including model specification, methods for assessing model fit, and concepts in MCMC sampling. However, the underlying mathematics were eschewed in favor of a more conceptual overview. This is understandable, but competency in Bayesian estimation requires an understanding of its underlying mathematical process. In particular, it is essential to conceptually understand (a) the three mathematical functions that compose a Bayesian model, (b) how they are constructed, and (c) how various probability distributions are used within this modeling framework. Therefore, the current article contributes to the literature by outlining the overall process of Bayesian estimation and then explaining the general mathematics in a comprehensive yet accessible way. We provide an introduction to the concept of probability distributions and their underlying functions, offer guidance for properly specifying a Bayesian model, and discuss the logic and process of MCMC sampling.

A final impediment to the spread of Bayesian methods is the task of learning the requisite software. Many software packages supporting Bayesian statistics are not user-friendly relative to those commonly used in organizational research (e.g., SPSS, Stata) and generally require considerable knowledge of a programming language. Organizational researchers often lack a strong programming background, and basic explanations of how Bayesian models are coded within these programs are not nearly as common as they ought to be. To remedy this, we introduce the reader to a relatively new program for Bayesian estimation, BugsXLA (Woodward, 2011a, 2011b). This program is freely available and represents a significant advancement in Bayesian statistical software: It is an Excel add-in that is entirely driven by a graphical user interface (GUI). The advantages of this program are discussed in the current article, and to help familiarize the reader with this software, we also offer a guided tutorial of how the Bayesian logistic regression model of our empirical example was specified. It is our sincere hope that in reading this article, organizational researchers will come to have a better holistic understanding of the process of Bayesian estimation, its mathematical foundations, and how it can be implemented in contemporary software.

Part 1: The “Two Sources” in Bayesian Inference

Preliminaries

The entire process of Bayesian inference can be simply demonstrated by likening it to a general reasoning process each of us performs daily. The basic structure is as follows: We start with some prior beliefs about an aspect of the world (e.g., there is probably some leftover milk in the refrigerator), observe some related information (observing a family member with a large glass of milk), and then update our beliefs accordingly (there might not be any more milk left after all).

Though the content of this example is irrelevant, the form of this process is identical to Bayesian inference: Prior beliefs meet observed data and become updated. However, in statistics our interests are more specific; typically, we are interested in estimating the underlying population characteristics that generated our data, or parameter values (means, variances, correlations, etc.). Thus, Bayesian estimation can be characterized as beginning with some prior beliefs regarding a parameter, observing some data, and then updating these initial beliefs to reflect the influence of this new information. In this way, the Bayesian process is simply the integration of two unique and complementary sources of information—prior beliefs and data—to make educated conclusions regarding parameters. In Bayesian parlance, the source containing one’s prior beliefs is fittingly referred to as the prior, the source containing the data, the likelihood, and one’s updated beliefs, the posterior. Each of these is formally derived from the terms in Bayes’s theorem, to which our discussion now turns.

An Introduction to Bayes’s Theorem

Thomas Bayes’s (1702-1761) original theorem is a mathematical equation built on simpler probability axioms (Jackman, 1999). Its basic form is

p (A | B) = \frac{p (B | A) p (A)}{p (B)},

where p indicates probability, and the “|” symbol denotes a conditional probability, the probability of one event given that another event has occurred—for example, the probability that I will eat given that I am hungry. (Thus, the left side of the equation, $p (A | B)$ , is the probability of A given that B has occurred.) Although this equation may seem complex, its utility is straightforward: It allows us to calculate any conditional probability of interest, provided that we can specify the terms on the right side of the equation. As previously stated, in statistics we are typically interested in estimating the parameter values that led to the observed data. Thus, our primary interest can be said to lie in one conditional probability above all others: $p (p a r a m e t e r | d a t a)$ . That is, we are interested in the probability that a specific parameter estimate is the true population value given the data we have observed. By translating Bayes’s original equation into statistical terms, we can calculate this value:

p (p a r a m e t e r | d a t a) = \frac{p (d a t a | p a r a m e t e r) p (p a r a m e t e r)}{p (d a t a)} .

The Logic of Using Functions

Using this version of Bayes’s theorem, one could hypothetically calculate the probability for any single estimate of a parameter. However, it would be much more useful if we could use this equation to simultaneously calculate the probability of each possible parameter value; obtaining the probability for each potential value would be extremely informative and allow one to make highly informed statistical inferences. Bayesian estimation is a means to this end and is therefore often described as a remarkably rich source of information on which to base inferences (Kruschke, 2013; Kruschke et al., 2012). Mathematically, this is accomplished through a revolutionary step: replacing the point probabilities in Bayes’s original formula with entire probability distributions that represent the entire range of potential parameter values. A probability distribution is simply a mathematical function: a relation that produces an output for a given input. For instance, $f (x) = x^{2}$ is a function where the relation between the inputs (the values on the x-axis) and outputs (the values on the y-axis) is the square of each x-value (i.e., it is a function of x). What makes a probability distribution unique is that its outputs on the y-axis are probabilities, and in Bayesian statistics, the inputs on the x-axis are the range of possible parameter values. Thus, using functions allows the analyst to assess the probability of each possible value simultaneously.

Users of frequentist methods are assumedly familiar with several probability distributions, including the univariate normal, Student’s t, and F-distributions, which are commonly used for testing the statistical significance of a parameter estimate. However, they are perhaps less familiar with the underlying mathematical functions that govern their graphical shape. For example, the form of the univariate normal distribution is governed by the following function,

f (x) = \frac{1}{\sqrt{2 π σ^{2}}} e x p {\{- \frac{(x - μ)}{2 σ^{2}}\}}^{2},

and the F-distribution by this function:

f (x) = \frac{\sqrt{\frac{(d_{1} x)^{d_{1}} d_{2}^{d_{2}}}{(d_{1} x + d_{2})^{d_{1} + d_{2}}}}}{x B (\frac{d_{1}}{2}, \frac{d_{2}}{2})} .

It is important to note that, although these functions appear complex and markedly different, one does not work with them directly when doing Bayesian estimation; as in frequentist procedures, the actual mathematics are performed by the software. However, it is important to understand conceptually that Bayesians use probability distributions for statistical modeling and that these all serve the same purpose: to return a measure of probability for each input value on the x-axis.

Because we can calculate the relative probability for all possible parameter values by using the functions that underlie probability distributions, Bayes’s formula originally defined for single probabilities can be reexpressed as,

f (B | A) = \frac{f (A | B) f (B)}{f (A)},

or in statistical terms,

f (p a r a m e t e r | d a t a) = \frac{f (d a t a | p a r a m e t e r) f (p a r a m e t e r)}{f (d a t a)} .

An added convenience is that when using functions Bayes’s formula reduces to

f (p a r a m e t e r | d a t a) \propto f (d a t a | p a r a m e t e r) \times f (p a r a m e t e r),^{1}

where the “∝” symbol means “proportional to” and can be interpreted as an equals sign for the purposes of this article. This simplified version of Bayes’s theorem is the form that is explicitly used in Bayesian estimation; every Bayesian is highly familiar with it, and its three terms represent the posterior, likelihood, and prior, respectively. The posterior distribution on the left side of the equation is the ultimate goal of Bayesian estimation: the probability distribution of the parameters given the data that have been observed. The function underlying this probability distribution is simply the product of the two functions on the right that correspond to the two different sources of information that comprise these updated beliefs. The first term, $f (d a t a | p a r a m e t e r)$ , is the likelihood, a function that captures the information provided by the data regarding the plausibility of different parameter values. The second term, $f (p a r a m e t e r)$ , represents the prior distribution and is a function that formalizes one’s prior knowledge regarding the parameters.

The Posterior as a “Compromise”

As noted above, the function that underlies the posterior distribution is simply the product of those that underlie the likelihood and prior (i.e., the two contributing sources of information: data and prior knowledge). Therefore, it is important to understand that one only needs to specify a particular probability distribution for the likelihood and prior, whose underlying functions, when multiplied together, are the posterior. In other words, one does not specify a probability distribution for the posterior directly; it naturally results from the combination of these “two sources.” It is therefore useful to think of the Bayesian posterior as “a point that represents a compromise between the prior information and the data, and the compromise is controlled to a greater extent by the data as the sample size increases” (Gelman et al., 2004, p. 36). That is, even if we start out with some very strong prior beliefs, heavily weighing particular parameter values over others, a large amount of contrary data will overwhelm them, and the resulting posterior distribution will be closer in form to the likelihood. It is especially helpful to understand this relationship graphically. Figure 1 depicts how the shape of the posterior distribution for a parameter (here, a probability) changes under three different combinations of likelihoods and priors. The upper panel illustrates a case where the analyst has specified strong prior beliefs regarding the probability of certain parameter values: Values near 1 are heavily favored. However, the likelihood function—based on how the data “speak” to the probability of the different parameter values—strongly contradicts these prior beliefs. Because the posterior can be viewed as a “compromise” between the two functions, its form and location reflect the influence of both. However, it is ultimately much closer in form to the likelihood, as the large number of observations in the likelihood caused it to be weighted more heavily in the analysis relative to the prior.

Figure 1.

Three examples of the likelihood and prior combining to form the posterior distribution. Note that in the bottom panel, the posterior is identical to the likelihood function, as the flat noninformative prior exerts no influence in the analysis.

The center panel illustrates how the same data (i.e., the likelihood) interacts with a prior distribution that is notably more similar. The resulting posterior is very similar to that of the first plot due to the influence of the data, but importantly, is less spread out (i.e., it has a reduced variance). This results from the fact that, relative to the first plot, the likelihood and prior contain information that is much more consistent, manifesting in less uncertainty (i.e., spread) within the posterior distribution.

Finally, the bottom panel displays a “flat,” noninformative prior distribution that weighs each possible parameter value equally so that the posterior is entirely dominated by the data in the likelihood function. In fact, because the prior distribution asserts no influence in the analysis, the posterior is identical to the likelihood and is completely overlapped by it in the figure.

As can be seen, it is the product (i.e., combination) of the likelihood and prior that determines the underlying function and corresponding graphical shape of the posterior—the goal of Bayesian estimation. In fact, because in a mathematical sense they are the posterior, specifying these terms essentially is the process of Bayesian modeling. It is therefore crucial to understand how these functions are mathematically constructed—the process of how specific probability distributions are selected for these terms in Bayes’s theorem. Depending on the specific analysis at hand, the distributions might be those cited above (the normal or F-distribution) or other, perhaps less familiar, forms (e.g., the Poisson, binomial, or multivariate normal distributions). Specifying an appropriate distribution for the likelihood and prior in Bayes’s theorem (Equation 7) will be a central topic in the forthcoming sections, where each of these two sources is described in turn.

Addendum: Probability Versus Probability Density

It is important to note that in the plots in Figure 1 and in Bayesian estimation in general, the y-axis does not usually represent probability per se. This is because for continuous probability distributions, the y-axis represents probability density, and not probability proper. The reason for this is that a continuous distribution contains an infinite number of continuous values on the x-axis. For instance, the normal distribution ranges from –∞ to +∞, but given any two particular values (e.g., 0 and 1) there are an infinitude of numbers between them (.1, .01, .001, etc.). This fact makes the true probability of any one specific value infinitesimally small. Thus, probability density is used as the metric of probability in these distributions which is defined as the amount of probability within a particular interval divided by its width. In other words, it is the “density” of probability within a specific interval along the x-axis. This is why the continuous probability distributions seen in Figure 1 allow for raw y-axis values greater than 1 (i.e., greater than unity). However, for discrete probability distributions the value on the y-axis is, in fact, probability properly conceived. Intuitively, a discrete distribution is one whose x-axis contains a number of distinct values (e.g., integers only) which allows each value to be allotted its own value of probability. Though the probability versus probability density distinction may seem to be a hindrance and add a layer of unnecessary complexity, the practical consequences are actually minimal; regardless of whether continuous or discrete distributions are used, the prior, likelihood, and posterior still provide accurate measures of relative probability, which is what the analyst is interested in—that is, which parameter values are the most likely relative to the others (and by how much) given the observed data. For a more thorough discussion regarding this distinction see Kruschke (2011, chap. 3).

Source 1: Data and the Likelihood Function

Data analysis begins first and foremost with a sample of observed data. Supposing that a set of empirical data has been collected, how can it be used to formalize which parameter values are more likely relative to others? The answer lies in a close examination of Bayes’s theorem. Recall that the posterior, $f (p a r a m e t e r | d a t a)$ , is the probability distribution (i.e., a function) of different parameter values given the data: The x-axis provides the entire range of these values, and the y-axis yields the probability of each. Conversely, the likelihood, $f (d a t a | p a r a m e t e r)$ , is a function that provides the probability of the observed data for each potential parameter value. Here, the x-axis still represents the range of possible parameters, but the y-axis gives the probability of the data under each of these values (Kruschke, 2011). That is, it yields the probability of the data given each parameter value, referred to as the likelihood of that value. Given a particular parameter value (e.g., a mean of 10), the observed data (e.g., observations of 90, 100, and 110) might be highly improbable, and other parameter values might be much more likely to lead to the data that had been observed (e.g., a mean of 100). Conceptually, this is how the likelihood function probabilistically weighs the different candidate parameter values and will be concretely demonstrated in the following sections.

Selecting a Distribution for the Likelihood

Because the likelihood (the first term in Bayes’s theorem) is a function that models the data, the first step in its construction is selecting a probability distribution that is appropriate to the dependent variable(s) being modeled. That is, one must select a distribution that plausibly represents the underlying population that generated the observed data. For instance, if one is undertaking a multivariate analysis with multiple dependent variables, then a multivariate distribution must be chosen for the likelihood (e.g., the multivariate normal or multinomial distributions), as multivariate observations cannot come from a univariate distribution. Similarly, if the dependent variable is categorical, then a discrete probability distribution is most appropriate, rather than one that is continuous. Here, the goal is to select a specific distribution that is most accurately able to model the population from which the sample came.

Evidently, what should determine this selection is the actual nature of the values being modeled. Therefore, when initially specifying the distribution for the data in the likelihood function, it is always useful to first ask oneself several questions, such as, “Are my observations univariate or multivariate?” “Are they continuous or discrete?” and “Are my data restricted to certain values?” (e.g., integers, nonnegative values). A useful heuristic is to view this selection as, in part, a process of elimination, filtering out distributions that are clearly not suitable as population models for the data. Furthermore, data often come in one of several canonical forms (e.g., continuous or dichotomous values), and there are standard choices for what distribution to select for a particular data type. Accordingly, Table 1 lists different common forms of data, the distributions that are conventionally chosen as their population model for the likelihood, and the common statistical analyses associated with each. Appendix A provides intuitive conceptual descriptions of many different distributions and how they are conventionally used in Bayesian estimation.

Table 1.

Standard Distributions for Modeling Different Types of Data in the Likelihood Function.

Type of Dependent Variable	Standard Distribution	Continuous or Discrete	Associated Analyses
Continuous	Normal	Continuous	Univariate analyses (e.g., linear and curvilinear regression; t test; ANOVA; multilevel models)
Continuous	Student’s t	Continuous	Robust univariate analyses
Dichotomous	Bernoulli or binomial	Discrete; discrete	Logistic regression
Ordinal	Normal or categorical	Continuous; discrete	Ordinal regression
Count	Poisson	Discrete	Poisson regression
Multivariate continuous	Multivariate normal	Continuous	Multivariate analyses (e.g., multivariate regression, MANOVA, latent variable models)
Multivariate continuous	Multivariate Student’s t	Continuous	Robust multivariate analyses
Multivariate categorical	Multinomial	Discrete	Multivariate analyses (e.g., latent variable models)

Note: Although these represent conventional selections for modeling the data, the reader should also be aware that alternative models may be used to address the same research question, just as in the frequentist paradigm.

Example of a Likelihood Function: Estimating a Population Mean

To make the process of modeling the data in the likelihood function more concrete, suppose that we wish to perform Bayesian estimation on a particular parameter, say, the mean number of past deviant work behaviors performed by job applicants. Suppose further that we have collected data from 100 random applicants, where each observation represents their number of prior deviant behaviors. Thus, each observation is an integer, and we are therefore modeling count data (i.e., integer data). Examining the nature of this variable using the preceding questions (e.g., “Is it continuous or discrete?”), we know that we require a probability distribution that is univariate, discrete, and ideally bound at 0 (because no individual can have a negative number of deviant behaviors). Looking at Table 1, the standard distribution for modeling count data is the Poisson distribution, and from the descriptions in Appendix A, we know that it is a probability distribution that fits each one of these desired characteristics. The following mathematical function underlies its form:

p (x) = \frac{e^{- λ} λ^{x}}{x!} .

In this function, x represents an observed data value, like every probability distribution, and the variable λ is a parameter—the mean of the distribution and the parameter of interest. Substituting a data value for x and leaving λ as a free parameter will cause this function to return the probability of this particular observation under each possible value of λ, as discussed in the preceding section. For instance, an observation of 3 (as a value of x) will not be very probable given a population mean (λ) of 100. Intuitively, this observation is most probable given values of λ nearest to 3, while the probability of this observation decreases as the value of λ becomes further removed. To illustrate this concept, the top panel of Figure 2 displays a Poisson likelihood function when the observed data value is 3. In this way, the data “speak” as to which specific parameter values are more or less likely. Finally, now that a parameter has been specified in the model, Bayes’s general theorem can be rewritten to reflect our current working example:

f (λ | d a t a) \propto f (d a t a | λ) \times f (λ) .

Figure 2.

Three Poisson likelihood functions with different numbers of observed values that make different values of the parameter, λ (the mean), more or less likely. The dashed line represents the point in the function that corresponds to the most likely value of λ given the data (3, 4, and 1.86, respectively).

Constructing a Full Likelihood

This is how a single observation is formalized into knowledge regarding a parameter. First, a probability distribution is selected that represents a plausible population model that generated the observation. Then, the observed value is substituted for x within its underlying function to yield the probability of that specific value under the range of possible parameter values. However, more generally we wish to know how probable the data set is as a whole relative to the range of possible parameter values; the function constructed from a single observation will not be very informative. If a single observation is modeled by a probability distribution, how then do we combine the information contained in the entire data set? The answer to this question lies in basic probability theory, where a fundamental axiom is that the probability of two independent events occurring is merely the product of their individual probabilities: For example, the probability of events A and B both occurring is $p (e v e n t A) \times p (e v e n t B)$ . This rule can be extended to the functions underlying probability distributions as well. That is, we can insert each independent data value within the particular function we have selected and then multiply them all together, just as when working with the point probabilities above. For instance, if we have selected a Poisson distribution as a suitable population model for the data, and there are only two observations in the data set, say, 3 and 5, then the full likelihood function for λ will be the following equation:

f (d a t a | λ) = \frac{e^{- λ} λ^{3}}{3!} \times \frac{e^{- λ} λ^{5}}{5!} .

This equation is nothing more than two Poisson distribution functions multiplied together with each observation substituted as x. However, in actual research, data sets normally contain more than two observations, and so a Poisson likelihood for an entire data set can be expressed more generally as

f (d a t a | λ) = \frac{e^{- λ} λ^{x_{1}}}{x_{1}!} \times \frac{e^{- λ} λ^{x_{2}}}{x_{2}!} \times \dots \times \frac{e^{- λ} λ^{x_{N}}}{x_{N}!},

where x₁ , x₂ , and x_N represent the individual observations, and λ is the population mean that is left as the free parameter on the x-axis. Placing this within the context of our example, suppose that the first three individuals in our fictitious data set (N = 100) reported 3, 5, and 4 prior instances of deviant workplace behavior, respectively. In constructing the likelihood function, we would substitute each data value as x in the particular function we have chosen (for us, the one underlying the Poisson distribution) and then multiply each together:

f (d a t a | λ) = \frac{e^{- λ} λ^{3}}{3!} \times \frac{e^{- λ} λ^{5}}{5!} \times \frac{e^{- λ} λ^{4}}{4!} \times \dots \frac{e^{- λ} λ^{x_{100}}}{x_{100}!} .

Equation 12 represents the full likelihood function for our example, with the remaining 97 observations excluded for brevity. Appendix B shows how Bayes’s theorem has been updated now that the likelihood has been fully defined.

The total number of observed values also influences the likelihood function. The center panel of Figure 2 displays a Poisson likelihood that results from having only three observed values (x = 3, 5, 4) in the data set. Notice that relative to the top panel, which displays the likelihood function from a single observation (i.e., 3), when more observations are included the spread (i.e., width) of the likelihood function is reduced. Thus, with accumulating consistent information, our uncertainty is reduced regarding which parameter values led to the collected data; each observation is more or less probable under different parameter values (here, λ, the mean) and contributes to the final form of the likelihood function. The bottom panel of Figure 2 displays a Poisson likelihood with four additional observations: 0, 0, 1, 0. Because these values are lower than the previous observations, the likelihood has shifted over to the left, reflecting the fact that a lower value of λ is more likely to lead to this entire set of data. The spread (variance) is also significantly reduced, as with accumulating information, λ values greater than 4 are increasingly unlikely.

Summarizing the preceding discussion, the following represents the general process of constructing a likelihood function. First, one assesses the nature of the dependent variable(s) and selects a probability distribution that appropriately reflects these characteristics to serve as a model for the population that generated each observation. Then, each individual observation takes the place of x within the function underlying the chosen distribution and each function is multiplied together to yield the full likelihood. Although our example focused on a Poisson likelihood, the very same principles apply for the construction of all likelihood functions and can be simply extrapolated accordingly (e.g., if a normal distribution is chosen, then the likelihood will be comprised of N normal distribution functions multiplied together). Once these steps have been accomplished, the first term in Bayes’s theorem, $f (d a t a | p a r a m e t e r)$ , will have been specified. However, before the function underlying the posterior distribution is fully defined (the final goal of Bayesian estimation), the other term in Bayes’s theorem, the prior distribution— $f (p a r a m e t e r)$ —must specified as well. Similar to the likelihood function, it too will be modeled by a specific probability distribution, but does not provide information regarding the parameters based on data. Rather, its source of information is the analyst’s prior beliefs.

Source 2: Prior Knowledge

After the likelihood has been defined, the next step in Bayesian modeling is to specify a probability distribution that appropriately models one’s previous beliefs regarding the plausibility of the different parameter values. What constitutes these “previous beliefs” is ultimately a subjective choice made by the analyst as to how various parameter values are to be weighted at the outset of the analysis (i.e., before being combined with the likelihood function). In some cases, a robust literature on the topic of interest exists and can provide reliable information regarding how probable certain parameter values are relative to others. A meta-analytic effect size is one such source of information. A researcher may use this previous knowledge to specify an informative prior for a parameter—that is, a prior that appreciably affects the form of the posterior distribution. The upper and center panels of Figure 1 display such informative priors that differentially weigh various parameters values. Conversely, little may be known regarding the probability of different parameter values; all may seem equally likely, or only a small set may be reasonably excluded. Oftentimes, a researcher may have access to substantial previous research but choose to use a noninformative prior, a prior distribution that does not influence the form of the posterior (e.g., the bottom plot in Figure 1). This is done for the posterior to be totally determined by the likelihood (i.e., the data) so that inferences remain unaffected by this potential source of subjectivity. In fact, most published research to date has used noninformative priors (Lynch, 2007, p. x). In any case, prior distributions should be chosen judiciously, as they affect the final shape of the posterior and influence parameter estimation. However, the good news is that in Bayesian estimation one is not limited to a single choice; multiple models using priors with varying amounts of information can be specified and then compared (see Zyphur & Oswald, in press, for a further discussion of different priors and how they can be used to inform research).

However, if previous research is available, then it should generally be incorporated into the model to reflect the cumulative and iterative nature of science (Kruschke, 2011; Zyphur & Oswald, in press). This feature also represents a significant advantage of Bayesian estimation over standard frequentist procedures: Previous findings can be formally incorporated into each and every analysis. Critics of Bayesian statistics often demur to the subjective nature of prior distributions (Howson & Urbach, 1991, p. 372; see also Gelman, 2008), but these criticisms are baseless; statistics is inherently subjective (Berger & Berry, 1988), a choice that is subjective does not entail that it is arbitrary, and the information included within the prior should always be justifiable to a scientific audience (Kruschke, 2011).

Selecting a Prior Distribution

Irrespective of how much information is included in the prior, the first step in specifying this part of the Bayesian model is to select a probability distribution that appropriately reflects the properties of the parameter to be modeled—in much the same way the data are modeled by a distribution in the likelihood function. Returning to our example, the parameter of interest is λ, the mean number of deviant behaviors. What are the desirable characteristics of a distribution for modeling this parameter? The questions used for assessing the nature of the data can also be used here: For example, “Is this parameter univariate or multivariate?” “Continuous or discrete?” “Bound to a certain range?” To start, this particular parameter is univariate and continuous, as means are not limited to integers. We also know that this particular mean cannot be negative, as no individual can have a negative number of deviant behaviors. Selecting a prior distribution for this parameter is made easier by the fact that, just like when modeling data, there are conventional choices in Bayesian statistics for modeling parameters. Table 2 lists typical prior distributions for the parameters in different likelihood functions. Looking at this table, the most common prior distribution for modeling λ, the mean of the Poisson distribution, is the gamma distribution. This distribution is univariate, continuous, and bound at zero—all desirable characteristics for modeling this particular parameter.

Table 2.

Common Selections for Prior Distributions and Their Hyperparameters.

Likelihood Distribution	Associated Parameters in Likelihood	Standard Prior Distributions for Each Parameter	Hyperparameters^a Associated With Each Prior
Normal	1. μ: location parameter (mean) 2. σ²: scale parameter (variance)	1. Normal 2. Inverse gamma	1. μ: location parameter (mean); σ²: scale parameter (variance) 2. α: shape parameter; β: scale parameter
Student’s t	1. μ: location parameter (mean) 2. σ²: scale parameter (variance) 3. ν: degrees of freedom	1. Normal 2. Inverse gamma 3. Uniform	1. μ: location parameter (mean); σ²: scale parameter (variance) 2. α: shape parameter; β: scale parameter 3. a: location parameter; b: scale parameter
Bernoulli	1. p: probability of event (mean)	1. Beta	1. α: shape parameter; β: shape parameter
Binomial	1. p: probability of event (mean) 2. n: number of total trials	1. Beta 2. None (fixed from data)^b	1. α: shape parameter; β: shape parameter 2. None
Poisson	1. λ: both a location and scale parameter (mean and variance)	1. Gamma	1. α: shape parameter; β: rate parameter

Note: These priors represent many of the classic choices for modeling the parameters of likelihood functions. However, there are additional distributions that can be used, and the selection should always depend on the particular analytic context. For additional information on these distributions and others, see Gelman, Carlin, Stern, and Rubin (2004, Appendix A) and Lynch (2007, pp. 25-33).

^aHyperparameters are the parameters of prior distributions.

^bThe n parameter representing the number of total trials does not require a prior distribution to quantify uncertainty because it is given by the data.

Types of Parameters

However, the gamma distribution, like any probability distribution, can take on many different shapes. Though every distribution has its own specific, unchanging characteristics (whether it includes negative values, is continuous or discrete, etc.), there are variables in its underlying function that set its particular form. For instance, the univariate normal distribution function has two variables that define the specific form it will take: the mean (μ) and variance (σ²). Although it is always symmetrical, univariate, and continuous, varying either one of these values alters the ultimate form of the distribution: The mean determines the location of the center, and the variance determines its spread. These variables that determine the particular form of a probability distribution are themselves referred to as parameters. Although each distribution has its own unique parameters, they often perform similar functions and can usually be classified into one of three broad categories: (1) location parameters that determine where the distribution is located on the x-axis (e.g., μ, the mean of the normal distribution), (2) scale and rate parameters that define the spread of the distribution (i.e., increasing a scale parameter such as σ², the variance of the normal distribution, increases its width, while increasing a rate parameter, such as β of the gamma distribution, decreases its width), and (3) shape parameters that affect neither the location nor the width directly, but alter the shape of the distribution in a particular way (e.g., by altering its kurtosis or the width of its tails).

That said, Bayesian statistics requires not only a working knowledge of the different distributions available for modeling, but also how they are parameterized (i.e., their underlying parameters). Becoming familiar with the different types of distributions and their parameters is a necessary yet rewarding challenge for the student of Bayesian analysis. To help the beginner through this process, Appendix A contains conceptual descriptions of many distributions and their parameters, along with whether or not they are classified as a particular type (viz., location, scale/rate, or shape). Furthermore, Appendix C provides R code that enables the user to plot a number of distributions under different values of their parameters. Experimenting with various values promotes an intuitive understanding of how different parameter values affect the final graphical form of a distribution.

Hyperparameters: Defining Prior Knowledge

Returning to our example, we have decided that the gamma distribution is an appropriate choice to model the parameter of interest: the mean number of prior deviant behaviors by current job applicants. Its underlying algebraic function is

f (x) = \frac{β^{α}}{(α - 1)!} x^{α - 1} e^{- β x},

where α and β are the distribution’s parameters that define its form. The parameter α is a shape parameter that controls its skewness (as α increases the gamma distribution decreases in skew and approaches normality, as explained in Appendix A), and β is a rate parameter (i.e., as its value increases, the width of the distribution decreases). Because these are higher-level parameters—that is, parameters of the prior distribution that models our beliefs regarding a parameter—they are conventionally referred to as hyperparameters. This terminology can be somewhat confusing for fledgling Bayesian analysts: Parameters in the likelihood function (e.g., the mean number of deviant work behaviors) are modeled by prior distributions whose forms are themselves determined by additional, higher-order parameters.² However, the distinction is made because Bayesian models are hierarchical by nature, and using this terminology allows one to distinguish the parameters found at different levels. Figure 3 depicts the hierarchy of our current Bayesian model. At the bottom level, the data is modeled by the Poisson distribution which has a single parameter, λ, the mean. In turn, our prior beliefs regarding this parameter are modeled by the gamma distribution, which has its own parameters: the hyperparameters α and β.

Figure 3.

A hierarchical illustration of the current Bayesian model in the style of Kruschke (2011).

As varying either one of these variables will alter the ultimate form of this distribution, we wish to select specific values that will make it congruent to our prior beliefs regarding λ. That is, if we have prior knowledge, we may wish to make use of this by weighing certain parameter values more strongly at the outset of the analysis and fix the values of the hyperparameters accordingly. A suggested method of doing this is by plotting multiple distributions with the parameters varied each time. This allows one to see how the form of the distribution changes and promotes an intuitive understanding of the intimate relationship between its shape and its parameter values. Appendix C contains further R code for plotting overlaid distributions for comparison purposes.

We now return to our running example of estimating the population mean (λ) of the number of past deviant work behaviors for job applicants. Despite this being merely a hypothetical example, previous research could potentially inform the analysis. Dilchert, Ones, Davis, and Rostow (2007) reported the frequency of officially documented deviant work behaviors by individuals applying for police officer jobs. Approximately 77% of individuals had no officially documented cases, 9% had one case, 5% had two, and 7% had three or more. Other studies have found similar general rates of deviant behaviors (e.g., Bolton, Becker, & Barber, 2010; Mount, Ilies, & Johnson, 2006). This information can be used to define our prior beliefs regarding which parameter values are more probable relative to others. Based on this information, the most likely values for λ are quite low (the mean of the Dilchert et al., 2007, sample was around 0.4), so we would like to weigh these values more heavily. Weighing low values relative to higher ones entails that the distribution be significantly right-skewed. We therefore initially set the shape parameter that determines the skewness of the gamma distribution, α, to a very low value, say, 1. (As noted in Appendix A, the gamma distribution is increasingly skewed as this parameter is reduced.) We also would like the distribution to be somewhat wide to capture our uncertainty regarding this parameter; we have previous evidence for a low mean, but weighing certain values too heavily can influence the shape of the posterior more strongly than we would like. We therefore initially set the rate parameter β to another low value: 0.5. The form of this distribution is shown in the top panel of Figure 4. As can be seen, values below 2 are weighted very strongly with the probability density of larger values decreasing quickly, accurately reflecting the prior research findings as well as reflecting some of our subjective uncertainty regarding the true population value.

Figure 4.

An informative gamma prior distribution based on previous research findings and a noninformative gamma prior distribution for the same hypothetical example.

Our prior beliefs have now been formalized by this probability distribution: It strongly favors low values, while those that are farther away are increasingly less probable. As stated earlier, a common practice in Bayesian modeling is to use a noninformative priors to let the data fully dominate the form of the posterior distribution (e.g., the bottom panel of Figure 1). These are much simpler to specify because (a) they do not require precise calibration of the hyperparameters to reflect specific prior knowledge and (b) there are conventional noninformative priors that are used to model various parameters (e.g., the uniform distribution discussed in Appendix A). Often, specifying a noninformative prior simply requires increasing the spread of the distribution so that all plausible parameter values are weighted equally. This can often be done by either increasing the value of a scale parameter or, equivalently, by decreasing the value of a rate parameter (which, again, has the inverse effect of decreasing a scale parameter). For example, if one has chosen a normal distribution as a prior distribution for a parameter (e.g., a mean or regression slope), then setting its variance to a very large number (e.g., 10,000) will result in a very wide distribution where all plausible values are weighted equally. This either reflects the analyst’s total lack of knowledge regarding the potential parameter values or their desire to let the data alone determine the form of the posterior. Returning to our example, a noninformative gamma distribution is commonly specified by significantly reducing both the shape and rate hyperparameters to 0.001 (see Spiegelhalter, Thomas, Best, & Lunn, 2003). This noninformative gamma prior is shown in the bottom panel of Figure 4. Although this distribution seems to “spike” as it approaches zero, it is essentially flat for all plausible values of λ.

Now that both functions in Bayes’s theorem—the likelihood and prior—have been fully specified by different probability distributions, the posterior distribution of our example is defined. Again, the posterior distribution provides us with our conditional probability of interest, $f (p a r a m e t e r | d a t a)$ , the probability of all potential parameter values given the observed data, and is itself a function—one that is the product of the likelihood function and the function underlying the prior distribution. (See Appendix B for the full mathematical function that represents the posterior in our example.) Furthermore, because the posterior represents a “compromise” between these two terms, their combination will shift our beliefs away from the prior to the extent that is supported by the data, a highly attractive feature of Bayesian estimation.

Summary

We close this section by offering a general summary of the process of specifying a prior distribution. First, each parameter in the model must be defined by a prior distribution that reflects the amount of previous knowledge the analyst wishes to incorporate within the analysis. Selecting an appropriate distribution to model parameter is a similar process to modeling data in the likelihood function: The nature of the values to be modeled is the primary focus (e.g., discrete or continuous, bound at certain intervals). Once a suitable distribution has been selected, the parameters of this prior distribution (i.e., its hyperparameters) must be specified to reflect the amount of desired information included within the prior. Often, flat, noninformative priors are used for simplicity and to forestall potential criticisms regarding this “subjective” component (Howson & Urbach, 1991, p. 372; see also Gelman, 2008). However, if reliable information is available, this can and should be used to increase the precision of the analysis.

Part 2: An Introduction to MCMC Sampling

The Logic of Sampling

As can be seen in Appendix B, the final mathematical function that underlies the posterior distribution is very complex—even for a simple one-parameter model. This is very often the case, especially as model complexity increases. A consequence is that directly calculating summary statistics of this distribution is either very difficult or impossible using traditional methods, an obstacle that had historically inhibited Bayesian statistics and allowed the frequentist paradigm to become dominant (Dienes, 2011; Lynch, 2007, p. 2). Fortunately, Markov chain Monte Carlo (MCMC) simulation methods (e.g., Gelfand & Smith, 1990) provide an elegant solution to this problem and allow estimation of complex Bayesian models. Briefly, MCMC simulation involves starting an algorithm that draws parameter values at random from the posterior distribution. As the number of samples becomes increasingly large (i.e., thousands), the entire set of samples will begin to approximate the actual posterior distribution. (Imagine a histogram to which thousands of values are added until it begins to approximate a smooth distribution.) The utility of this method is that computing summary statistics from a large set of values is a trivial task compared to conventional methods (viz., calculus), and these statistics will approximate those of the actual posterior with great accuracy. The trick of this process is to ensure that (a) the algorithm is actually drawing samples that are representative of the posterior (i.e., that most values are drawn from its center and not its tails) and (b) that a sufficient number of samples have been collected to derive accurate parameter estimates.

“Burn-in,” Convergence, and Mixing

These two criteria are essential for proper Bayesian estimation and are satisfied when the MCMC algorithms display what is referred to as convergence and mixing—the two primary concerns associated with MCMC sampling (Lynch, 2007, p. 132). Briefly, when the Markov chain algorithm begins, it starts drawing values at an initially specified point and then “jumps” around the posterior distribution, the next drawn value determined only by its current state. However, it may take a number of steps before the algorithm reaches the desired regions of the posterior (i.e., its bulk). As a result, many of the initially drawn samples are not used for estimation because they are biased due to the starting position of the algorithm. Instead, they are discarded in what is referred to as the burn-in period (the samples are “burned”). Ideally, the length of the burn-in period should be a function of how many steps it takes for the algorithm to begin drawing from the appropriate regions of the posterior, but this can be difficult to estimate as it varies with both model complexity and the starting position of the algorithm. However, burn-in periods of 5,000, 10,000, or even 20,000 samples are common, and the good news is that there are no drawbacks for discarding an indefinitely high number of early samples (other than longer waiting times). Since thousands of samples can usually be collected within minutes, it is usually wise to specify a longer than necessary burn-in period.

After the burn-in period, the subsequently drawn values are then saved and used to summarize the parameters of the posterior distribution. The goal here is to ensure that these samples are representative of the posterior. That is, the algorithm must display convergence—that the algorithm has reached the central regions of the posterior and has drawn representative values (viz., mostly from the bulk of the distribution with appropriately less from its tails). The time it takes for an algorithm to “converge” is, relatedly, a function of model complexity and the starting values of the Markov chains (i.e., the algorithms). Since there are no real drawbacks for saving very large numbers of samples, it is good practice to save many thousands of values to ensure accurate estimates (e.g., at least 10,000). It is also important to note that one can start multiple chains at different points in the posterior. When these chains have converged, they will all sample from the same general regions of the posterior. Using multiple chains (e.g., 2-4) is always recommended (Lynch, 2007, p. 148), as the performance of one can be compared to the others, allowing for better assessments of convergence.

After a chain has converged, it should then display adequate mixing. While convergence represents a point in the sampling process, mixing is a characteristic of the algorithm in regard to the rest of the drawn values. Proper mixing occurs when the algorithm has sampled thoroughly from the range of the posterior distribution—that is, adequately sampling from both its center and tails to reflect the entire posterior.

Methods for Assessing MCMC Performance

Informal Methods

Assessment of Markov chain algorithm performance (viz., convergence and mixing) is accomplished through both formal and informal methods. It is important to note that there is no single, definitive method that guarantees that proper MCMC sampling has occurred (Lynch, 2007, p. 135). However, a combination of both formal and informal methods can provide sufficient evidence for proper MCMC performance. Despite being more subjective, informal methods need not be any less rigorous than those that are formal. In fact, these are often the best methods for diagnosing good or poor chain performance. Primarily, this entails examining trace plots. A trace plot displays the value of each sampled parameter value (y-axis) at each step in the chain (x-axis) and can provide evidence for both convergence and mixing. Convergence is evinced when the chains appear to “settle” in the same general region, indicating that they have reached the central regions (i.e., the bulk) of the distribution. Mixing is indicated when the chains exhibit enough variability (i.e., vertical movement) around this region to sample properly from the entire posterior. Therefore, each chain ought to not only consistently sample around a general region (its bulk), but also display variability around it. Figure 5 depicts two trace plots of different parameters whose posteriors have each been sampled by two chains started at different values. The two chains in the upper plot (represented by the two lines) have clearly not reached convergence, as they are completely isolated; converged chains will always overlap, as they sample values from the same general region of the posterior. In contrast, the chains in the lower plot appear to have converged almost immediately, as both have settled around the same general region and are consistently sampling comparable values. They also demonstrate good mixing as there is adequate variability around this region. To ensure proper estimation, thousands of samples should be taken regardless of how quickly the chains seeming to converge. Note that these plots only show the first 400 values sampled by the chains. This should emphasize how variable convergence rates can be across different models and even for parameters within the same model. (The two parameters displayed here are from the same Bayesian model.)

Figure 5.

An example of trace plots generated by WinBUGS that show the value of the sampled parameter value (y-axis) at each step in the chain (x-axis). Both plots contain two Markov chains each represented by a different line.

Formal Methods

Formal methods for diagnosing convergence and mixing involve the calculation of statistics. When multiple chains are used, the Brooks–Gelman–Rubin convergence diagnostic (BGR; also called the potential scale reduction factor; Gelman & Rubin, 1992) is a common method for assessing whether or not the Markov chains have reached convergence (i.e., the bulk of the posterior). Essentially, this statistic partitions the variance of the sampled parameter values into its between and within-chains variance. Significant between-chain variance is undesirable because it indicates that the different chains are sampling notably different parameter values and implies that at least one chain has not yet reached convergence. When the chains have converged, there should be very little between-chain variance because they will be sampling similar values. Conversely, within-chain variance is desirable, as it indicates mixing—that is, there is variability in the samples drawn by the chains to represent the entire posterior. The BGR convergence diagnostic is simply a ratio of the total variance (both within and between-chains) to the within-chains variance of the sampled parameter values. Thus, when this ratio approaches 1 (e.g., < 1.10), this indicates that there is little between-chain variance and provides evidence that the chains have converged.

A formal method for determining if proper mixing has occurred, and also if enough samples have been collected, is to calculate the Monte Carlo error for each parameter. This statistic assesses the accuracy of the summary statistics that would be calculated from the current set of samples (Spiegelhalter et al., 2003). A general rule is that the number of collected samples is sufficient if the Monte Carlo error is below at least 5% of the standard deviation of the set of samples drawn for each parameter (Spiegelhalter et al., 2003).

The aforementioned methods represent the primary tools for assessing MCMC performance; identifying convergence and mixing (or lack thereof) are often accomplished through these diagnostics. However, other supplementary methods exist, and for a more additional information regarding MCMC diagnostics the reader is directed to Lynch (2007, chap. 6).

Conclusion on MCMC Simulation

As can be seen, the specifics behind MCMC sampling are nearly as complex and jargon-laden as Bayesian model construction itself (i.e., specifying the likelihood and prior). Therefore, we feel compelled to state that, although important, MCMC methods are somewhat orthogonal to the actual process of Bayesian modeling; they are merely practical tools for providing descriptive statistic of the parameters in the posterior distribution. The beginner’s attempts at learning Bayesian statistics can be undermined by premature exposure to the mathematics behind these algorithms, when all that is really required is a general conceptual understanding of MCMC sampling and methods for performance diagnostics (as is offered here). When first learning Bayesian estimation, what is far more important is gaining familiarity with the different probability distributions that are available for modeling and how these should be used in specific applications. That is, one should achieve a sound understanding of model specification before investigating the particulars of MCMC sampling. Finally, although the specific MCMC algorithms and their underlying mathematics lie beyond the scope of this article, the interested reader is guided to the following discussions of the topic: Gelman et al. (2004, p. 290), Jackman, (2000), Kruschke (2011, chap. 7), and Lynch (2007, p. 88).

Part 3: Summarizing and Interpreting the Posterior Distribution

Once enough representative samples have been collected via MCMC simulation, the task for the Bayesian analyst becomes providing descriptive statistics that appropriately summarize the information contained in the posterior. In contrast to the estimates provided by a frequentist analysis, Bayesian estimation yields an entire probability distribution of potential parameter values. This sheer amount of information allows for flexible summaries and more comprehensive inferences. In the frequentist paradigm, it is standard practice to report (a) a point parameter estimate, (b) a confidence interval, and (c) the results of a significance test. However, because a posterior distribution contains such a wealth of information, the best methods for summarizing the results are inherently more ambiguous. This is especially the case when the posterior displays ambiguity—that is, it is widely spread out over many plausible values. We therefore offer the following principles which the reader can use to adequately summarize a Bayesian posterior distribution. For additional information we direct the reader to Kruschke (2011), Lynch (2007), and Zyphur and Oswald (in press).

Similar to the conventions of frequentist reporting, the results of a Bayesian analysis should always include a point and interval estimate of each parameter. For general point estimates, the mean, median, or mode of the posterior distribution may be used. When the posterior is heavily skewed, the median or mode may better reflect the most plausible value, as the mean may not provide the best measure of central tendency. In any case, a point estimate should always be supplemented by an interval estimate, such as a 95% credible interval. Credible intervals are similar to frequentist confidence intervals in that they are equal-tailed: A 95% credible interval excludes the 2.5% of each end of the posterior distribution. However, unlike confidence intervals, they allow for direct probability statements (Howell, 2013, p. 194); a 95% credible interval of 1.22 to 3.31 means that there is a 95% probability that the true parameter value lies within that interval. This stands in contrast to the unintuitive meaning of the frequentist confidence interval—that an interval constructed in the same way will contain the parameter value 95% of the time. (As statisticians say, the confidence lies in the method, not the numbers themselves; Good, 1999.) Confidence intervals also do not provide precise measures of probability for the values within the interval (Kruschke et al., 2012), while this feature is intrinsic to any intervals constructed from a posterior distribution.

Point and interval estimates represent merely the beginning of what the analyst can do with a posterior distribution. One of the most beneficial features of Bayesian estimation is that practical significance becomes the focus of the analysis. Often in research, one is interested in whether a parameter estimate is sufficiently different than zero to conclude that a practically significant effect exists. Because statistical significance (a) provides no direct information regarding effect size, (b) is guaranteed given a large enough sample size, and (c) is not directly related to practical significance, it is not a sufficient basis for answering this question that bears on our substantive hypotheses. With a Bayesian posterior, practical significance can be assessed by defining a range around the null value that constitutes the “region of practical equivalence” (the ROPE)—the values around the null that are substantively considered to be practically nonsignificant (Kruschke, 2011). One can then measure how much of the posterior distribution lies within and on either side of this critical region (see Kruschke, 2011, 2013). For instance, if 75% or 95% of the posterior lies outside of the ROPE, then this provides considerable support for a practically significant effect. Conversely, if the ROPE contains 50% or more of the distribution, then this indicates that a practically significant effect is unlikely (i.e., the effect size is very small or practically nonexistent), especially if the remaining 50% is split on either side of the ROPE. In any case, the relative confidence in a conclusion should reflect the degree of support provided by the posterior. Furthermore, although defining the ROPE represents a subjective choice on the part of the analyst, it should be based on sound reasoning and prior theory and be justifiable to a scientific audience. As an alternative to summarizing the preceding section, Table 3 provides a checklist that outlines each step in performing Bayesian estimation—from model specification to interpreting the posterior. It is a concise summary of the previous three sections.

Table 3.

Outline of the Essential Steps for Conducting and Reporting Bayesian Estimation.

Overall Process of Bayesian Estimation	Specific Steps	Considerations
Phase 1: Model specification	Step 1: Select a distribution for the likelihood function	1. Examine the nature of the dependent variable(s). Determine if it is restricted to a certain range of values, whether the observations are multivariate or univariate, and if they are continuous or discrete (see Table 1). 2. Depending on these characteristics, select a probability distribution that best represents the population distribution that plausibly generated each observation (i.e., select a population model for the likelihood). 3. The full likelihood is constructed after multiplying the function underlying the chosen distribution with each data value inserted as x. However, this step is nearly always performed by the software of choice.
Phase 1: Model specification	Step 2: Specify a prior distribution for each parameter in the likelihood	1. Assess the nature of the free parameters in the likelihood function and, again, determine if they are limited to certain ranges, univariate or multivariate, and continuous or discrete (see Table 2). 2. Select a prior distribution that reflects these characteristics and appropriately models the parameter (e.g., the beta distribution for a probability; a normal distribution for a mean). 3. For each free parameter, decide whether to specify an informative or noninformative prior distribution. 4. If informative priors are used that may affect substantive inferences, they should be defended by prior theory and/or research. 5. Fix the hyperparameters of the prior distributions to accurately reflect the desired amount of previous knowledge (if any) regarding each free parameter.
Phase 2: MCMC sampling	Step 3: Specify the MCMC sampling	1. Specify how many Markov chains are to begin sampling values from the posterior distribution. Two chains are usually sufficient, but for very complex models more chains can be beneficial for assessing convergence. 2. Decide how many samples are to be discarded in the burn-in period. The number of samples that should be “burned” varies with model complexity, and because there are no drawbacks for discarding many values, it is best to specify a long burn-in period (e.g., at least 5,000 samples). 3. Decide how many subsequent samples are to be saved for the descriptive statistics of the estimated parameters to be accurate. Again, the number of samples required varies with model complexity, and without restrictions, it is good practice to draw more of samples than is probably necessary (e.g., at least 10,000).
	Step 4: Perform MCMC diagnostics	1. After the chains have drawn their values, assess their performance through informal and formal methods to ensure that both convergence and proper mixing have occurred. 2. Informally, trace plots should be inspected to determine if the chains have settled around the same general region (convergence) and that there is some variability within the drawn values to represent the entirety of the posterior (mixing). 3. Formally, calculate the BGR convergence diagnostic and the Monte Carlo error for each parameter. The BGR should approach 1, and the Monte Carlo error should be less than 5% of the standard deviation of the saved samples for each parameter. 4. If the performance of the MCMC algorithms is revealed to be suspect through any of these methods, the final parameter estimates may not accurately reflect the posterior, and the sampling should be conducted again with a longer burn-in period and/or an increase in the number of saved samples.
Phase 3: Report the results	Step 5: Describe the model specification	1. State the distribution that was used to model the data in the likelihood function. 2. Report what prior distributions were selected to model each free parameter, how they informed the analysis (e.g., strongly or not at all), and what the hyperparameters were set as to model these prior beliefs. 3. Describe the process of the MCMC simulations (viz., the number of chains, length of the burn-in period, the number of saved samples) and provide evidence that proper sampling has occurred.
	Step 6: Summarize the posterior	1. Visually assess the shape of the posterior to guide its summarization (e.g., “Is it strongly skewed? Does it have a very large variance?”). 2. Provide point estimates of the posterior, such as the mean, median, or mode. These estimates will often align, but may not if the posterior is significantly skewed. Ensure that the point estimate used for theoretical interpretation represents the most plausible candidate value of the parameter of interest. 3. Supplement the point estimates with interval estimates that indicate a range of plausible values. Using 95% intervals is standard practice, but smaller intervals can provide added precision and insight. 4. If one is interested in whether or not an effect is practically significant, a region around the null value that constitutes the range of practically equivalent values can be defined (i.e., the ROPE). One can then assess and report how much of the distribution falls within, above, and below this region, providing evidence for either practical (i.e., substantive) significance or nonsignificance.
	Step 7: Interpret the findings	1. Interpret the point and interval estimates in accordance with the ROPE (if one has been specified). 2. Make substantive conclusions that are consistent with, and accurately reflect, the large amount of information contained in the posterior distribution.

Part 4: The Past, Present, and Future of Bayesian Software

We now turn to an issue of practical interest. We believe that a significant obstacle facing the spread of Bayesian methodologies is the state of currently available software. Although the logic of Bayesian estimation is intuitive (Dienes, 2011; Kruschke, 2013), its contemporary software is not. Programs such as WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000), R (R Development Core Team, 2012), MATLAB (MathWorks, 2012), JAGS (Plummer, 2003), and Stan (Stan Development Team, 2014) require extensive knowledge of how to code a Bayesian model within the conventions of a particular syntax. Organizational researchers often lack robust training in software coding, and this additional requirement can be just as onerous as learning the concepts and mathematics behind Bayesian estimation.

To date, the most popular and widely used software for Bayesian statistics is WinBUGS. This program has many excellent features and is relatively user-friendly: The program is largely menu-driven (i.e., it is based in a GUI), and specifying the MCMC sampling process, MCMC diagnostics, and summarizing the posterior are accomplished by simply clicking the appropriate buttons in a dialog box. Accordingly, it serves as the companion software program to many seminal introductory Bayesian texts, including Bayesian Data Analysis (Gelman et al., 2004), Doing Bayesian Data Analysis: A Tutorial with R and BUGS (Kruschke, 2011), Bayesian Modeling Using WinBUGS (Ntzoufras, 2009), and The Bugs Book: A Practical Introduction to Bayesian Analysis (Lunn, Jackson, Best, Thomas, & Spiegelhalter, 2012).

However, even this program requires some programming know-how, as the probability model (i.e., prior and likelihood) must be specified in a complex syntax. Unfortunately, amid the conventions of the code it can be difficult to see how the priors and likelihood (i.e., the statistical model) are actually represented. In addition, the data must be written in a very specific format, and minor mistakes in either the model code or data format invariably result in compilation errors, the causes of which can be very difficult to locate. Thus, software implementation may actually represent the most substantive barrier to the spread of Bayesian methods and undoubtedly contributes to its anemic rate of growth in the organizational sciences, quantified by both Kruschke et al. (2012) and in the introduction of the present article.

BugsXLA

As an attempt to remedy this situation, we present an introduction to a novel and exciting addition to the repertoire of Bayesian statistical software: BugsXLA (Woodward, 2011a, 2011b).³ Though currently in a beta stage of development, this free software is an entirely GUI-driven add-in for Excel that outsources the actual MCMC simulation process to WinBUGS. That is, everything is specified within an Excel GUI (the likelihood function, prior distributions, number of MCMC samples, etc.), but the actual MCMC simulation is driven by the WinBUGS engine. After the sampling is over, the results—both the thousands of parameter samples and their summary statistics—can be imported directly into a new Excel spreadsheet, allowing for easy summaries and plots of the parameters in the posterior distribution. We believe that this software represents the future of Bayesian statistics in the social sciences, and it has been described as “Bayes for the common man” (Woodward, 2005). Indeed, there is an impressive array of features that make this software both incredibly versatile and user-friendly.

Its strong relationship to WinBUGS allows the user to utilize the best features of both. The model specified in the BugsXLA GUI automatically generates a WinBUGS model script so that an individual who is comfortable with the WinBUGS syntax can model in additional complexities if needed. In addition, although BugsXLA itself has no current functions for MCMC diagnostics, these can be performed in WinBUGS using its GUI. Finally, data that had been previously specified in the WinBUGS format can be imported into Excel and then promptly used by BugsXLA.

BugsXLA automatically generates text files of the results which can then be directly imported into R for further analyses and diagnostics. R is an open-source statistical program that is the software of choice for many statisticians and has become increasingly popular within the organizational sciences (Culpepper & Aguinis, 2011). There are a number of R packages that assist Bayesian analyses, a list of which can be found at http://cran.r-project.org/web/views/Bayesian.html. In addition, BugsXLA contains an option to automatically generate an R script containing premade code for plotting the MCMC samples and conducting a comparable frequentist analysis.

At each window in the BugsXLA GUI is a “Help” button that offers explanations of each function and option within each dialog box. It is immensely helpful, as the terms by themselves in the GUI can seem initially vague.

In addition, there are many other attractive features of this software (e.g., data are automatically standardized to assist MCMC sampling), and future editions will provide more options for the user, allowing for more flexibility in model and MCMC specification (P. Woodward, personal communication, April, 9, 2014). A general walkthrough of the program is available at http://www.philwoodward.co.uk/bugsxla/walkthrough.html. However, an individual new to Bayesian estimation may still find this walkthrough overwhelming, as there are many novel concepts present in both Bayesian estimation and MCMC simulation. To this end, we now present a walkthrough of how a Bayesian logistic regression is coded in BugsXLA using an example relevant to the organizational sciences.

Our Empirical Example

Our example uses data from Woo’s (2011) study, which identified a subpopulation of U.S. employees as “job hobos” (Ghiselli, 1974). These workers were characterized by their high rates of voluntary turnover (M_hobos = 5.05 vs. M_overall = 2.42) and positive attitudes toward quitting (M_hobos = 4.40 vs. M_overall = 2.07). In the original article, frequentist logistic regression analyses were performed linking hobo syndrome (operationalized as a dichotomous variable of latent cluster membership) with two sets of variables. In the first logistic regression model, seven distinct motivational forces of withdrawal (i.e., forces that drive employees away from the organization; Maertz & Griffeth, 2004) were entered as predictors of hobo syndrome: affective (lack of emotional attachment to the employer), alternative (employment opportunities available elsewhere), calculative (negative prospect of long-term goal attainability), contractual (lack of felt obligations to the employer), behavioral (low costs of leaving), normative (family demands and expectations), and constituent forces (lack of attachment to groups or individuals within the organization). The probability of belonging to the hobo cluster was hypothesized to be positively related to these withdrawal motives, as individuals with hobo syndrome were expected to display “lower levels of attachment or commitment to a particular organization compared to other workers” (Woo, 2011, p. 465). A second logistic regression model used two personality predictors thought to be positively linked to hobo syndrome: openness to experience and impulsivity. We reanalyzed these data using Bayesian estimation in BugsXLA and now present a general walkthrough of how this was conducted. The data for this example are available at http://www1.psych.purdue.edu/˜sewoo/Bayesian.php, and the results of these analyses are presented in Part 5.

BugsXLA Walkthrough

After installation, BugsXLA is represented in Excel via a toolbar with several icons. Data should be structured in an Excel spreadsheet with a variable in each column and the rows containing observations. The first row of each column should contain the variable’s name. In our data set, the dependent variable was titled “Hobo” and the predictors were also intuitively labeled (e.g., “Constituent” and “Impulsivity”). Once the data has been formatted, the model (i.e., likelihood and priors) can be specified through the “Model Specification” button in the toolbar which brings up the dialog box depicted in Figure 6.

Figure 6.

The model specification window in BugsXLA for the first logistic regression model.

The first step is to specify the data used in the Bayesian analysis by clicking the “Data Range” box and then manually selecting the data in Excel. Next, each dependent and independent variable must be defined as a “Factor” or a “Variate” using the “Set Variable Types” button. “Factors” represent categorical variables, “Variates,” continuous variables. In our example, all nine predictors in both models were continuous and therefore set as “Variates.” Our dependent variable was categorical (viz., dichotomous—whether or not each individual had been classified as a job hobo) and was therefore defined as a “Factor.” Importantly, if a factor is included in the model, one must then go on to define its levels—that is, tell BugsXLA the specific factor levels that correspond to the dummy values in the raw data (e.g., 1, 2, 3). Our dichotomous data consisted of 1s (representing event occurrence) and 0s (representing nonoccurrence). Importantly, for a dichotomous factor (such as ours), event occurrence (i.e., 1) must be set as the first level in BugsXLA, nonoccurrence (i.e., 0), the second level.

After the variables have all been defined, the actual Bayesian model can then be specified. The first step in doing this, as described in Part 1, is specifying a probability distribution that appropriately models the data (i.e., constructing the likelihood function). Our data consisted of single dichotomous outcomes, and looking at Table 1, a standard distribution for this type of dependent variable is the Bernoulli distribution. As described in Appendix A, the Bernoulli distribution is merely a special case of the binomial distribution. Accordingly, the Bernoulli distribution can be specified in BugsXLA by selecting the binomial distribution in the “Distribution” box (see Figure 6). The “Response” section contains the name of the dependent variable (here, “Hobo”), and the dummy value that denotes instances of the event (viz., 1).

When using a generalized linear model (as in logistic regression) the “Link” option sets the type of link function to be used, and we used the standard logit function. Finally, specifying predictors in the regression equation is accomplished in the lower half of the window shown in Figure 6. In our example, the first model included the seven motivational forces of withdrawal as continuous predictors, and these were entered in the “Independent” row of the “Covariates” section using their predefined variable names: “Affective + Alternative + Behavioral + Calculative + Normative + Contractual + Constituent.” The regression equation of the second model was written as: “Openness + Impulsivity.” The intercept is included automatically, and interactions (if desired) can be specified using the “*” symbol (e.g., Openness * Impulsivity).

After defining the likelihood, the MCMC sampling process is then specified using the “MCMC and Output Options” button. This brings up a dialog box in which the number of chains, burn-in period length, and number of subsequent samples can be set. Because in the current example, both models were not particularly complex, two Markov chains were used with a burn-in period of 10,000 samples. An additional 20,000 samples were specified to be drawn from the posterior to serve as the basis for the parameter estimates. This window also allows for R scripts to be generated and if the MCMC samples are to be imported from WinBUGS into a new Excel spreadsheet. To plot the approximated posterior distributions using BugsXLA, these samples must be imported, so this option was selected in the current analysis.

To complete specification of the Bayesian model, the prior distributions for the regression parameters must be defined. When one is finished specifying the likelihood and MCMC options, clicking “OK” will bring up a dialog box that allows a GUI specification of these prior distributions and their hyperparameters. For convenience, default noninformative priors are provided by BugsXLA and were used for the present analysis (as there is minimal prior research in this area). Once the prior distributions for each parameter have been set, clicking the “Run WinBUGS” icon will open WinBUGS and automatically begin the MCMC simulation process. Once this is finished, the user can utilize the WinBUGS GUI for MCMC diagnostics (clicking on the “Samples” option located under the “Inference” heading).⁴ After one has analyzed and saved these results, exiting the WinBUGS program will prompt the statistics and samples resulting from the analysis to be imported directly into Excel.

As a final note, it is unlikely that the reader will feel completely comfortable running Bayesian analysis in BugsXLA after this brief tutorial. Instead, this walkthrough is best used as a basic resource of how Bayesian models are generally specified. A prerequisite is a sound understanding of model construction, and to that end we recommend perhaps the two most accessible introductory texts to Bayesian estimation: Kruschke (2011) and Lynch (2007). We also recommend the companion book to BugsXLA: Bayesian Analysis Made Simple: An excel GUI for WinBUGS (Woodward, 2011a). It is our experience that other purportedly introductory texts are appropriate for a statistics audience but were not written for organizational and social science researchers with limited statistical and programming backgrounds.

Part 5: An Example of Reporting the Results of Bayesian Estimation

Frequentist Analysis

We now present the (a) raw statistical results and (b) substantive findings of both the Bayesian and original frequentist analyses. It will be shown that Bayesian estimation is superior in the former and thus supersedes traditional NHST-centered frequentism in the latter. Data represent the formalization of our constructs, and statistical analyses are where our related hypotheses and theory are formally tested and explored. Thus, superior statistical methodology is directly linked to a greater ability to generate theoretical developments. Juxtaposing the two paradigms in this section will make these claims evident.

Before presenting the results of the Bayesian analyses, we first describe the frequentist results of Woo (2011) which are summarized in Table 4. For ease of interpretation, the standardized slopes have been converted to odds ratios, where an odds ratio of 2.50 means that a one-standard deviation increase in the predictor variable results in an increase in the odds of hobo cluster membership by a factor of 2.5. (And 1 is the null value; that is, either outcome is equally likely given this increase in the predictor.)

Table 4.

Comparison of Logistic Regression Results Across Statistical Paradigms.

	Frequentist NHST		Bayesian Posterior Summaries				ROPE
Model 1	β	95% CI for β	M	Mdn	SD	95% Interval	% Within	% Below	% Above
Affective	0.65^†	[0.42, 1.01]	0.64	0.64	1.26	[.41, 1.00]	5.48	93.75	0.77
Alternative	1.38^†	[0.95, 2.01]	1.40	1.40	1.21	[0.97, 2.08]	9.18	1.04	89.78
Behavioral	1.57*	[1.07, 2.30]	1.60	1.60	1.22	[1.09, 2.38]	2.59	0.08	97.33
Calculative	1.26	[0.72, 2.23]	1.26	1.26	1.34	[0.71, 2.27]	20.09	12.29	67.62
Normative	1.18	[0.82, 1.70]	1.18	1.18	1.21	[0.81, 1.73]	27.53	7.65	64.82
Contractual	3.68**	[2.11, 6.41]	3.79	3.77	1.34	[2.19, 6.82]	0	0	100
Constituent	1.46^†	[0.98, 2.18]	1.49	1.48	1.23	[1.00, 2.25]	5.82	0.62	93.56
Model 2	β	95% CI for β	M	Mdn	SD	95% Interval	% Within	% Below	% Above
Openness	1.64**	[1.17, 2.30]	1.19	1.18	1.18	[0.86, 1.64]	27.74	4.46	67.80
Impulsivity	0.80	[0.57, 1.11]	0.96	0.97	1.18	[0.70, 1.32]	46.32	32.84	20.84

Note: Hosmer–Lemeshow test for Model 1: χ² = 2.588, df = 8, p = .957. Hosmer–Lemeshow test for Model 2: χ² = 6.445, df = 8, p = .597. The Bayesian intervals are credible intervals. For ease of interpretation all predictor variables were standardized, and the regression slopes (β) represent odds ratios.

^† p < .10, *p < .05, **p < .01.

In the two frequentist regression models, a total of three out of the nine predictors reached statistical significance at α = .05. Hobo syndrome was predicted by high levels of contractual forces (β = 3.68, 95% CI [2.41, 6.41], p < .001), behavioral forces (β = 1.57, 95% CI [1.07, 2.30], p = .02), and openness to experience (β = 1.64, 95% CI [1.17, 2.30], p = .004), consistent with the author’s theoretical expectations. The constructed confidence intervals also indicated the precision of these estimates. However, the majority of the results were ambiguous and reflect the inefficiency of frequentist regression. Three predictors, affective (β = 0.65, p = .05), alternative (β = 1.38, p = .09), and constituent forces (β = 1.46, p = .06), were significant only at the α = .10 level. Thus, they were “approaching significance” or “marginally significant” (whatever that may mean). Other predictors had higher p values, for example, calculative (β = 1.26, p = .42), normative (β = 1.18, p = .38), and impulsivity (β = 0.80, p = .18), but were still not highly probable under the null.

As inferences cannot be made when one fails to reject the null (Fisher, 1935), these nonsignificant parameter estimates were ignored, given only a passing mention in the results section (Woo, 2011). This was the case even for slopes similar in magnitude to those that reached significance, yet failed on account of their larger standard errors (e.g., constituent: β = 1.46, p = .06 vs. behavioral: β = 1.57, p = .02). Thus, the ritual of “nil hypothesis” testing (Gigerenzer, 2004; Gigerenzer, Krauss, & Vitouch, 2004) inhibits inferences by inefficiently assessing whether an effect is an artifact of sampling error.

Importantly, statistically nonsignificant parameter and effect size estimates are becoming increasingly unreported in many scientific literatures (Fanelli, 2012). Because, ceteris paribus, statistically nonsignificant effects are generally smaller, this induces a publication bias toward larger effects (Kepes & McDaniel, 2013; O’Boyle, Banks, & Gonzalez-Mulé, in press), a problem compounded by the chronically low power to find small effects (Cohen, 1962; Maxwell, 2004). Meta-analysis is a powerful tool that can be used by both frequentists and Bayesians alike that integrates previous research findings (Cumming, 2014; Kruschke et al., 2012; Le, Oh, Shaffer, & Schmidt, 2007). However, the significant publication bias resulting from NHST most likely inflates these meta-analytically derived estimates (Cumming, 2012; O’Boyle et al., in press).

The reason for these many inefficiencies lies in the convoluted process of NHST and meaning of the p value; failing to reject the null is not evidence for it (Wagenmakers, 2007), though it is all too frequently reported as such (Cumming, 2012; Hoekstra, Finch, Kiers, & Johnson, 2006). Even so, many researchers are fully aware of this distinction, yet believe that the one is strongly related to the other: that is, when $p (d a t a | n u l l)$ is relatively low, then it is safe to say that the null is plausibly false (Burdette & Gehan, 1970; Hubbard & Bayarri, 2003). However, this is not necessarily the case. Cohen (1994) demonstrated that, for a fictitious schizophrenia test that accurately identifies true cases at least 95% of the time, the probability of a positive result given that the individual is healthy ( $p (d a t a | H_{0}))$ is p < .05. However, using Bayes’s theorem, he calculated that the probability of an individual being healthy given a positive test result ( $p (H_{0} | d a t a))$ was as high as .61. That is, even given a “statistically significant” result, the null hypothesis had a 20% higher probability than the alternative (for an example applied to employee selection, see Kruschke et al., 2012, p. 726). Thus, the surface similarity of $p (d a t a | H_{0})$ (i.e., a p value) and $p (H_{0} | d a t a)$ (what researchers want to know and what Bayesian estimation provides) is misleading.

Furthermore, even statistically “nonsignificant” predictors surely have some relationship to the criterion variable and perhaps a substantial one; the “nil hypothesis” of an absolutely zero effect size is virtually always false (Cohen, 1994; Tressoldi, Giofré, Sella, & Cumming, 2013), despite this never being part of the original logic of NHST (Fisher, 1955, 1956; Gigerenzer, 2004). Thus, although the NHST-based frequentist approach can and has offered valuable insights, it is founded on questionable logic, leads to ambiguous results, and is ultimately inefficient relative to the investments made by researchers.

Bayesian Analysis

Model Specification and MCMC Diagnostics

When reporting a Bayesian analysis, there are several components that must be communicated (Anderson, Link, Johnson, & Burnham, 2001; Kruschke, 2011, chap. 23): One must first describe the model for the data, the choice of priors, and the process and diagnostics of the MCMC sampling (see Table 3). In our analyses, the dependent variable was dichotomous and modeled in the likelihood function by the Bernoulli distribution. The intercept and slope parameters were given noninformative priors due to the lack of previous work in this subject area. They were modeled by the univariate normal distribution with a mean (μ) of 0 and variance (σ²) of 10,000 to ensure a very wide distribution so that all plausible parameter values were weighted equally at the outset of the analysis. The first 10,000 MCMC samples were discarded in the burn-in period, and posterior inferences were made after an additional 20,000 samples were collected. The Monte Carlo error for each parameter was less than 5% of its standard deviation (range: 0.01-1.28%), indicating that enough samples had been collected for accurate estimation. Both regression models converged very quickly, as evidenced by trace plots and a sufficiently low BGR convergence diagnostic. (For all parameters it was < 1.07 by the end of the burn-in period.) Finally, proper mixing was also evidenced by trace plots and the low Monte Carlo error for each parameter.

The Current Results

Before making Bayesian inferences, it is useful to first visually assess the posterior distribution approximated by the set of MCMC samples so that appropriate and theoretically relevant descriptions can be made. We therefore begin our analysis by looking at the posterior distributions of the three regression slopes that were statistically significant in the frequentist analysis: those relating hobo syndrome to contractual forces, behavioral forces, and openness to experience, displayed in Figure 7. The BugsXLA tool for posterior plotting juxtaposes a parameter’s prior distribution with the approximated posterior, showing how one’s initial beliefs have been updated by the observed data (i.e., the “Bayesian process” discussed in Part 1). Because our analyses used flat, noninformative priors, the posterior effectively represents the likelihood function—that is, how the data differentially weigh candidate parameter values. It is also important to note that the y-axis in these plots does not display probability proper because these distributions only approximate the posterior through the set of samples drawn by the Markov chains. In other words, it is a histogram that, given proper convergence and mixing, will accurately represent the posterior, the “relative frequencies” of the y-axis indicating how probable a particular parameter value is relative to the others (i.e., it is functionally equivalent to measures of relative probability).

Figure 7.

BugsXLA posterior plots of the contractual, behavioral, and openness slopes. The black line at the bottom of the distribution is the “flat” noninformative prior, and the darker area constitutes the region 0.90-1.10, which was defined as practically equivalent to a null effect (the ROPE).

Because we are interested in the presence of practically significant effects, we defined the region of practical equivalence (the ROPE) as the range 0.90 to 1.10. That is, for a predictor to be “practically significant,” a standardized one-unit increase must make the odds of hobo syndrome increase or decrease by at least 10% (i.e., an odds ratio greater than 1.10 or less than 0.90, respectively). The approximate region of the ROPE is depicted by the darker shade in the posterior plots. BugsXLA has a function to retrieve the probability that the parameter value is above or below a specific value in the distribution. Once the ROPE has been defined, one can simply use this function to yield the probability that the parameter lies outside or within this region.

In deciding which point estimates to use, the summary statistics of Table 4 indicate that the means and medians for each parameter are virtually identical (i.e., the posteriors are approximately normal), so using any point estimate will yield a similar conclusion. Just as in the frequentist analysis, the contractual slope parameter was the strongest predictor of hobo syndrome. It had a posterior mean of 3.79, a 95% credible interval (CI) of [2.19, 6.82], and 100% of its distribution was located above the ROPE. Similarly, the regression slope for behavioral forces had a mean value of 1.60, 95% CI [1.09, 2.38], and 97.33% of the distribution was located above the ROPE, indicating that low perceived costs of leaving are positively related to hobo syndrome. Examination of the openness posterior (mean: 1.19, 95% CI [.86, 1.64]) also indicated a positive relationship, with 67.80% of the posterior located above the ROPE. Despite the fact that this posterior is more ambiguous, the majority of the evidence still supports a positive relationship, even if the parameter value is most likely smaller than the previous two. Had only 50% been located above the ROPE, then it would have been prudent to abstain from making substantial conclusions regarding the presence of an effect. Again, because the posterior contains such a large amount of information, the analyst must be able to make flexible assessments that are appropriate given the nature of the present evidence.

Expectedly, these posterior estimates were similar to the maximum likelihood estimates of the frequentist models because, when noninformative priors are used, Bayesian and frequentist estimators often coincide (Bayarri & Berger, 2004, p. 63; Gelman et al., 2004, p. 363). In research settings where the use of prior knowledge is justified, Bayesian estimation can differ markedly from maximum likelihood estimation; the former can be more precise as the latter estimation procedure is unable to incorporate previous findings. Moreover, even in the current situation where noninformative priors were used, the posterior speaks directly on the probability of all possible parameter values and yields point estimates that are unsuppressed by NHST.

In the frequentist regression, the vast majority of predictors (six out of nine) failed to reach statistical significance at α = .05. This leads to ambiguous results (“approaching significance”) and an inability to make inferences. However, a considerable advantage of Bayesian estimation is that even when estimates are more ambiguous (which is often the case in research), inferences can still be made, and these estimates can inform subsequent research. The posteriors of the three “marginally significant” slopes of the frequentist analysis (p > .05 but < .10), are presented in Figure 8.

Figure 8.

BugsXLA posterior plots of the affective, alternative, and constituent slopes.

Here, the slope relating affective forces to hoboism had a posterior mean of 0.64, 95% CI [0.41, 1.00], and 93.75% of its posterior lay below the ROPE, indicating the existence of a practically significant negative relationship. The alternative forces slope had a posterior mean of 1.40, 95% CI [0.97, 2.08], and over 89.78% of its distribution supported a positive relationship. Finally, the constituent forces slope had a mean of 1.48, 95% CI [1.00, 2.25], and over 93.56% of its posterior was also located above the ROPE. Though these point estimates were similar across both paradigms, unconstrained by NHST, Bayesian data analysis allows for substantive inferences that are essential to understanding the topic of interest. In contrast to the original predictions, hobos were predicted by higher levels of positive affect toward the organization. Thus, these results suggest that hobos do not quit because they are averse to the organization. In fact, they may actually view the organization more favorably because they do not feel restricted by it, an explanation consistent with the very large contractual forces slope (i.e., that they feel little contractual obligation to stay). Practically, this is a crucial insight because it suggests a fundamental limit to the extent to which organizations are responsible for the chronic turnover exhibited by this group—that is, it may not be a function of job satisfaction.

The analysis also indicated that hobos are sensitive to the job market; they are more likely to believe outside alternatives are available and perceive them as desirable (alternative forces). They also display reduced personal attachment to individuals or groups within the organization (constituent forces). Their patterns of frequent turnover and positive attitudes toward quitting perhaps inhibit motivation to create lasting workplace relationships. Alternatively, this finding may indicate features of their general personality and social functioning: Hobos may feel low attachment to others and groups in general, organizations being one particular manifestation of this pattern of interpersonal behavior. In any case, this is a novel research finding that was suppressed using conventional frequentist practices. Through Bayesian estimation, these “marginally significant” effects have provided novel insights into the underlying cognitive and affective processes of job hobos.

As for the “nonsignificant” slopes of the frequentist analysis, the distributions of the calculative forces, normative forces, and impulsivity regression slopes are displayed in Figure 9. The posterior mean of the calculative slope was 1.26, 95% CI [0.71, 2.27], and 67.62% of its posterior was located above the ROPE. Thus, much like the openness posterior, the majority of the evidence supports a practically significant positive relationship. The posterior mean of the normative slope was 1.18, 95% CI [0.81, 1.73], and 64.82% of its posterior was located above the ROPE, also supporting a positive relationship. For the impulsivity slope, the posterior mean was 0.96, 95% CI [0.70, 1.32], and its posterior was located largely within the ROPE (46.32%), with the remaining probability split on either side (below 32.84%; above 20.84%).

Figure 9.

BugsXLA posterior plots of the calculative, normative, and impulsivity slopes.

As expected, these slopes were not as large as those that had lower p values in the frequentist analysis. The data here were more ambiguous, and we are inevitably less confident when drawing conclusions about these parameters. Though nonnegligible percentages of the posterior were located within the ROPE, they still provide support for two practically significant effect sizes. Certainly, further research is warranted, but tentative inferences can still be made.

From our results, it is probable that hobos feel less external pressure from individuals outside the organization (e.g., family members) to retain their job (normative forces). This lends further support to the idea that hobos have weaker personal attachments in general. Alternatively, they may indeed have these relationships but simply assign lower importance to these perceived obligations. With some confidence, we can also state that hobos perceive lower probabilities of attaining long-term career goals at their current organization (calculative forces). The weight of the evidence (46.32%) supported that impulsivity is not practically related to hobo classification. However, there is still some ambiguity as this percentage is not particularly large, and 32.84% of the posterior supported a negative relationship. What is more important is that our uncertainty in this set of analyses is not a detriment that is systematically ignored, but organized information that supports tentative inferences and informs future research in an unbiased way.

Closing Remarks

By producing a probability distribution of parameter values, Bayesian estimation can be a more informative system of statistical inference than frequentist NHST. Bayesian results are intuitive because they directly speak to the conditional probability of interest, $p (p a r a m e t e r | d a t a)$ , as opposed to the logically unsound frequentist p value, $p (d a t a | n u l l)$ . They can be summarized in a multitude of ways that is flexibly based on the particular research context. More often than not, data are ambiguous. Bayesian analyses do not balk at this ambiguity by ignoring effects that do not meet an arbitrary standard of statistical significance. Instead, the use of probability distributions gracefully accommodates this ambiguity; it is an intrinsic feature of Bayesian modeling. By not having to meet prespecified alpha levels, Bayesian analyses are also better suited to test multiple hypotheses and for research where large samples are either difficult or impossible.

Though these advantages are significant, we believe that a complete transfer to Bayesian methods is currently unrealistic. Understanding the theory, concepts, and software necessary to conduct Bayesian analyses requires time and perseverance. Thus, an increase in introductory materials (books, articles, etc.) must be accompanied by the refinement of user-friendly software programs and other basic resources (e.g., graduate courses in Bayesian methodologies).

We also believe that an exclusive transfer to these methods is not desirable either. The new statistics (Cumming, 2012, 2014) is a burgeoning frequentist framework that eschews NHST and is similarly based in estimation. It relies on effect size estimates, confidence intervals, and meta-analysis and shares many of the desirable features of Bayesian estimation. In addition, there are cases when using both approaches is not only complementary (Berger, 2003; Berger, Boukai, & Wang, 1997; Berger, Brown, & Wolpert, 1994), but required (Bayarri & Berger, 2004). Thus, as the Bayesian statistician Gelman stated, “It does not seem like a good thing for a generation of statistics to be ignorant of experimental design and analysis of variance, instead becoming experts on the convergence of the Gibbs sampler” (a particular MCMC algorithm; Gelman, 2008, p. 446). Expertise in a particular statistical approach is far less important than competence in research design, knowing how to make claims that are consistent with the data, and understanding when a particular technique is more appropriate than another. Bayesian statistics is not a panacea for the unprepared researcher. However, insofar as researchers are obliged to produce the best and most clear work possible, they should be encouraged to learn these methods and become aware of their vast utility.

Footnotes

Appendix A

Appendix B

Appendix C

Acknowledgements

We would like to thank Scott Parrigon, Fred Oswald, Mike Zyphur, and Deborah Rupp for their helpful comments and feedback during the writing process. The authors are also indebted to John Kruschke for his “prior” work in Bayesian statistics, without which the present article could not have been written.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Adelson

E. H.

Pentland

A. P.

(1996). The perception of shading and reflectance. In Knill

Richards

(Eds.), Perception as Bayesian inference (pp. 409–423). Cambridge, UK: Cambridge University Press.

Anderson

D. R.

Link

W. A.

Johnson

D. H.

Burnham

K. P.

(2001). Suggestions for presenting the results of data analyses. Journal of Wildlife Management, 65, 373–378. doi:10.2307/3803088

Bayarri

M. J.

Berger

J. O.

(2004). The interplay of Bayesian and frequentist analysis. Statistical Science, 19, 58–80.

Berger

(2003). Could Fisher, Jeffreys, and Neyman have agreed on testing? Statistical Science, 18, 1–32.

Berger

J. O.

Berry

D. A.

(1988). Statistical analysis and the illusion of objectivity. American Scientist, 76, 159–165.

Berger

J. O.

Boukai

Wang

(1997). Unified frequentist and Bayesian testing of a precise hypothesis. Statistical Science, 12, 133–160.

Berger

J. O.

Brown

L. D.

Wolpert

R. L.

(1994). A unified conditional frequentist and Bayesian test for fixed and sequential simple hypothesis testing. Annals of Statistics, 22, 1787–1807.

Bolton

L. R.

Becker

L. K.

Barber

L. K.

(2010). Big Five trait predictors of differential counterproductive work behavior dimensions. Personality and Individual Differences, 49, 537–541.

Burdette

W. J.

Gehan

E. A.

(1970). Planning and analysis of clinical studies. Springfield, IL: Charles C Thomas.

10.

Cohen

(1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153.

11.

Cohen

(1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.

12.

Courville

A. C.

Daw

N. D.

Touretzky

D. S.

(2006). Bayesian theories of conditioning in a changing world. Trends in Cognitive Science, 10, 294–300.

13.

Culpepper

S. A.

Aguinis

(2011). R is for revolution: A cutting-edge, free, open source statistical package. Organizational Research Methods, 14, 735–740.

14.

Cumming

(2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York, NY: Routledge.

15.

Cumming

(2014). The new statistics: Why and how. Psychological Science, 25, 7–29. doi:10.1177/0956797613504966

16.

Dienes

(2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6, 274–290.

17.

Dilchert

Ones

D. S.

Davis

R. D.

Rostow

C. D.

(2007). Cognitive ability predicts objectively measured counterproductive work behaviors. Journal of Applied Psychology, 27, 616–627.

18.

Dunson

D. B.

(2001). Commentary: Practical advantages of Bayesian analysis of epidemiologic data. American Journal of Epidemiology, 153, 1222–1226.

19.

Fanelli

(2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891–904. doi:10.1007/s11192-011-0494-7

20.

Feldman

(2001). Bayesian contour integration. Perception and Psychophysics, 63, 1171–1182.

21.

Fisher

R. A.

(1935). The design of experiments. Edinburgh, UK: Oliver & Boyd.

22.

Fisher

R. A.

(1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, 17, 69–77.

23.

Fisher

R. A.

(1956). Statistical methods and scientific inference. Edinburgh, UK: Oliver & Boyd.

24.

Gelfand

A. E.

Smith

A. M.

(1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409.

25.

Gelman

. (2008). Objections to Bayesian statistics. Bayesian Analysis, 3, 445–450.

26.

Gelman

Carlin

J. B.

Stern

H. S.

Rubin

D. B.

(2004). Bayesian data analysis (2nd ed.). Boca Raton, FL: CRC Press.

27.

Gelman

Rubin

D. B.

(1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–511.

28.

Ghiselli

E. E.

(1974). Some perspectives for industrial psychology. American Psychologist, 80, 80–87.

29.

Gigerenzer

(2004). Mindless statistics. Journal of Socio-Economics, 33, 587–606.

30.

Gigerenzer

Krauss

Vitouch

(2004). The null ritual: What you always wanted to know about null hypothesis testing but were afraid to ask. In Kaplan

(Ed.), Handbook on quantitative methods in the social sciences (pp. 389–406). Thousand Oaks, CA: Sage.

31.

Gill

(2002). Bayesian methods: A social and behavioral science approach. Boca Raton, FL: Chapman & Hall.

32.

Good

P. I.

(1999). Resampling methods: A practical guide to data analysis. Boston, MA: Birkhauser.

33.

Griffiths

T. L.

Steyvers

Tenenbaum

J. B.

(2007). Topics in semantic association. Psychological Review, 114, 211–44.

34.

Griffiths

T. L.

Tenenbaum

J. B.

(2005). Structure and strength in causal induction. Cognitive Psychology, 51, 354–384.

35.

Hoekstra

Finch

Kiers

Johnson

(2006). Probability as certainty: Dichotomous thinking and the misuse of p values. Psychonomic Bulletin & Review, 13, 1033–1037.

36.

Howell

D. C.

(2013). Statistical methods for psychology. Belmont, CA: Cengage.

37.

Howson

Urbach

(1991). Bayesian reasoning in science. Nature, 350, 371–374.

38.

Hubbard

Bayarri

M. J.

(2003). Confusion over measures of evidence (p’s) versus errors (α’s) in classical statistical testing. American Statistician, 57, 171–182.

39.

Jackman

S. D

. (1999). Bayesian modeling in the social sciences: An introduction to Markov-chain Monte Carlo. Retrieved from http://jackman.stanford.edu/mcmc/icpsr99.pdf

40.

Jackman

S. D.

(2000). Estimation and inference via Bayesian simulation: An introduction to Markov chain Monte Carlo. American Journal of Political Science, 44, 375–404.

41.

Kepes

McDaniel

M. A.

(2013). How trustworthy is the scientific literature in I-O psychology? Industrial and Organizational Psychology: Perspectives on Science and Practice, 6, 252–268.

42.

Kruschke

J. K.

(2010a). Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1, 658–676.

43.

Kruschke

J. K.

(2010b). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14, 293–300.

44.

Kruschke

J. K.

(2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Burlington, MA: Academic Press.

45.

Kruschke

J. K.

(2013). Bayesian estimation supersedes the t-test. Journal of Experimental Psychology: General, 142, 573–603.

46.

Kruschke

J. K.

Aguinis

Joo

(2012). The time has come: Bayesian methods for data analysis in the organizational sciences. Organizational Research Methods, 15, 722–752.

47.

I. S.

Shaffer

Schmidt

(2007). Implications of methodological advances for the practice of personnel selection: How practitioners benefit from meta-analysis. Academy of Management Perspectives, 21, 6–15.

48.

Lee

M. D.

Wagenmakers

E. J.

(2005). Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112, 662–668.

49.

Loredo

T. J.

(1992). The promise of Bayesian inference for astrophysics. In Feigelson

E. D.

Babu

G. J.

(Eds.), Statistical challenges in modern astronomy (pp. 275–297). New York, NY: Springer-Verlag.

50.

Lunn

D. J.

Thomas

Best

Spiegelhalter

(2000). WinBUGS: A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.

51.

Lunn

Jackson

Best

Thomas

Spiegelhalter

(2012). The BUGS book: A practical introduction to Bayesian analysis. Boca Raton, FL: Chapman & Hall/CRC Press.

52.

Lynch

S. M.

(2007). Introduction to applied Bayesian statistics and estimation for social scientists. New York, NY: Springer.

53.

Maertz

C. P.

Jr Griffeth

R. W.

(2004). Eight motivational forces and voluntary turnover: A theoretical synthesis with implications for research. Journal of Management, 30, 667–683.

54.

MathWorks. (2012). MATLAB and statistics toolbox release. Natick, MA: Author.

55.

Maxwell

S. E.

(2004). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082-989X.9.2.147.

56.

Mount

Ilies

Johnson

(2006). Relationship of personality traits and counterproductive work behaviors: The mediating effects of job satisfaction. Personnel Psychology, 59, 591–622.

57.

Ntzoufras

(2009). Bayesian modeling using WinBUGS. Hoboken, NJ: John Wiley.

58.

O’Boyle

Banks

Gonzalez-Mulé

. (in press). The chrysalis effect: How ugly initial results metamorphosize into beautiful articles. Journal of Management.

59.

Plummer

. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Hornik

Leisch

Zeileis

(Eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing, March 20-22. Vienna, Austria: DSC. Retrieved from http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/

60.

R Development Core Team. (2012). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/

61.

Sharpe

(2013). Why the resistance to statistical innovations? Bridging the communication gap. Psychological Methods, 18, 572–582.

62.

Shoemaker

J. S.

Painter

I. S.

Weir

B.S.

(1999). Bayesian statistics in genetics: A guide for the uninitiated. Trends in Genetics, 15, 354–358.

63.

Spiegelhalter

Thomas

Best

Lunn

(2003). WinBUGS user manual, version 1.4. Cambridge, UK: MRC Biostatistics Unit.

64.

Stan Development Team. (2014). Stan: A C++ library for probability and sampling, version 2.2. Retrieved from http://mc-stan.org

65.

Tressoldi

P. E.

Giofré

Sella

Cumming

(2013). High impact = high statistical standards? Not necessarily so. PLOS ONE, 8, e56180.

66.

Wagenmakers

E. J.

(2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.

67.

Wagenmakers

E. J.

Wetzels

Borsboom

van der Maas

(2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426–432.

68.

Weiss

(1997). Interpreting images by propagating Bayesian beliefs. In Mozer

M. C.

Jordan

M. I.

T. Petsche

(Eds.), Advances in neural information processing systems (pp. 908–915). Cambridge, MA: MIT Press.

69.

Wetzels

Matzke

Lee

M. D.

Rouder

J. N.

Iverson

G. J.

Wagenmakers

E. J.

(2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6, 291–298.

70.

Woo

S. E.

(2011). A study of Ghiselli’s hobo syndrome. Journal of Vocational Behavior, 29, 461–469.

71.

Woodward

(2005). BugsXLA: Bayes for the common man. Journal of Statistical Software, 14, 1–18.

72.

Woodward

(2011a). Bayesian analysis made simple: An excel GUI for WinBUGS. Boca Raton, FL: CRC Press.

73.

Woodward

. (2011b). Bayesian analysis made simple: An excel GUI for WinBUGS [Software]. Available from http://www.philwoodward.co.uk/bugsxla/index.html

74.

Tenenbaum

J. B.

(2007). Word learning as Bayesian inference. Psychological Review, 14, 245–72.

75.

Zyphur

M. J.

Oswald

F. L

. (in press). Bayesian estimation and inference: A user’s guide. Journal of Management. doi: 10.1177/014920631350120