Bayesian analysis for social science research

Abstract

In this manuscript, we discuss the substantial importance of Bayesian reasoning in Social Science research. Particularly, we focus on foundational elements to fit models under the Bayesian paradigm. We aim to offer a frame of reference for a broad audience, not necessarily with specialized knowledge in Bayesian statistics, yet having interest in incorporating this kind of methods in studying social phenomena. We illustrate Bayesian methods through case studies regarding political surveys, population dynamics, and standardized educational testing. Specifically, we provide technical details on specific topics such as conjugate and non-conjugate modeling, hierarchical modeling, Bayesian computation, goodness of fit, and model testing.

Keywords

Bayesian models bayesian statistics monte carlo methods social sciences statistical inference

1. Introduction

Bayesian methods refer to data analysis tools derived from the principles of Bayesian inference, i.e., inductive learning through Bayes’ Theorem (Van de Schoot et al., 2021). These methods allow us to estimate parameters with good statistical properties, predicting/imputing future/missing observations, and making optimal decisions according to prespecified utility functions. Moreover, Bayesian techniques are also intrinsically linked to sophisticated computational algorithms for model estimation, selection, and validation (Hoff, 2009). A wide variety of scientific publications in the context of Social Sciences point out that Bayesian methods are a predominant methodology for the analysis of different phenomena since the early 1990s (e.g., Western & Jackman, 1994; Jackman, 2004; Gill & Walker, 2005; Barberá, 2015; Moser et al., 2021; Lynch & Bartlett, 2019; Fairfield & Charman, 2022).

Several reasons justify adopting a Bayesian approach to statistical inference in Social Sciences. Quantitative research shows that the social phenomenon is quite different from its counterpart in the experimental sciences, so its characteristics and methodological requirements are more akin to the Bayesian paradigm, far from the assumptions of the frequentist approach (e.g., Jackman, 2009; Fairfield & Charman, 2019). That is why we discuss advantages (and challenges!) about adopting a Bayesian spirit in Social Science research.

We provide foundations about model fitting under the Bayesian paradigm, from the prior distribution and sampling distribution specification to posterior inference mechanisms, including model verification and evaluation. Additionally, we illustrate the essentials of Bayesian methodologies through case studies in contexts such as political surveys, population dynamics, and standardized educational testing.

Unlike other authors (Lenhard, 2022; Sosa & Buitrago, 2022; Van de Schoot et al., 2021; Kruschke, 2021; Van de Schoot et al., 2014; Draper, 2009; Walker et al., 2007; Jackman, 2004), we provide a review of the Bayesian paradigm focused exclusively on Bayesian machinery framed in Social Science research. We aim to offer a frame of reference for a broad audience, not necessarily with specialized knowledge in Bayesian statistics, yet having interest in incorporating this kind of methods in studying social phenomena.

This paper is structured as follows. Section 2 draws a parallel between Bayesian and frequentist inference. Section 3 presents the rationale for a Bayesian approach to scientific research in Social Science. Section 4 shows the importance of sensitivity analysis, the suitability of conjugate analysis, model evaluation and testing, and Bayesian computation via Monte Carlo simulation. Section 5 provides fully develops case studies from a Bayesian perspective. Finally, Section 6 discusses our main findings as well as some alternatives for future research.

2. Statistical inference: Frequentist versus bayesian

Two paradigms to data analysis coexist in Statistics: frequentist and Bayesian. They differ in many ways, including the probability interpretation, the parameters’ nature, and the statistical/computational methods required to make inferences. However, both alternatives are governed by axiomatic foundations of probability and use the likelihood function to estimate unknown parameters.

Under any of these approaches, the relationship between data $\bm{y}=(y_{1},...,y_{n})$ and parameters $\theta\in\Theta$ is established by a sampling distribution $\bm{y}\sim p(\bm{y}\mid\theta)$ , which fully characterizes the random mechanism that generates $\bm{y}$ , for any given value of $\theta$ . The nature of $\theta$ is one of the main differences between the Bayesian and the frequentist paradigm. On the one hand, the frequentist approach assumes $\theta$ as a fixed but unknown quantity, and any estimate of it constitutes a random variable in itself, since it depends on repeated random sampling. Typically, $\theta$ is estimated by maximizing $p(\bm{y}\mid\theta)$ as a function of $\theta$ . On the other hand, the Bayesian approach assumes $\theta$ as a random variable so that any estimate of it, is fixed and constitutes a realization of such a random quantity. From this point of view, we can directly incorporate into the model any prior beliefs (state of knowledge) about $\theta$ using probabilistic statements. Such a task can be carried out through the prior distribution, $p(\theta)$ , whose purpose is to characterize the uncertainty about $\theta$ external to $\bm{y}$ . Thus, once $\bm{y}$ is observed, prior beliefs are updated, and the posterior distribution, $p(\theta\mid\bm{y})$ , is obtained in order to fully describe the updated state of knowledge about $\theta$ , given the empirical evidence provided in $\bm{y}$ . In this context, Bayes’ Theorem is the optimal rational method that guarantees coherence and logical consistency for updating prior beliefs about $\theta$ according to the information contained in $\bm{y}$ (Jackman, 2004; Hoff, 2009).

Bayes’ Theorem states that

$\displaystyle p(\theta\mid\bm{y})=\frac{p(\bm{y}\mid\theta)\,p(\theta)}{\int_{% \Theta}p(\bm{y}\mid\theta)\,p(\theta)\,\text{d}\theta}\,,$

where $p(\bm{y})=\int_{\Theta}p(\bm{y}\mid\theta)\,p(\theta)\,\text{d}\theta$ is the marginal probability of $\bm{y}$ , which does not depend on $\theta$ because it is an integral over all the values of $\theta\in\Theta$ , and therefore, corresponds to a normalization constant that allows $p(\theta\mid\bm{y})$ to be a valid probability distribution. Since $p(\theta\mid\bm{y})$ and $p(\theta)$ are functions of $\theta$ , and $p(\bm{y})$ does not depend on $\theta$ , the sampling distribution $p(\bm{y}\mid\theta)$ has to be regarded as a function of $\theta$ (as frequentists do!), and therefore, Bayes’ Theorem can be expressed as $p(\theta\mid\bm{y})\propto\ell(\theta)\,p(\theta)$ , where $\ell(\theta)=\text{c}\,p(\bm{y}\mid\theta)$ is the so called likelihood function, for any $\text{c}>0$ (typically chosen as $\text{c}=1$ ).

The previous expression makes evident two important aspects: (i) the posterior distribution is simply proportional to the likelihood function times the prior distribution, and (ii) some frequentist results can be seen as a particular cases of Bayesian analysis. Regarding the first aspect, the influence of prior beliefs and data on the posterior distribution depends on the amount of information provided in the prior distribution and the sample size, respectively. Regarding the second aspect, frequentist and Bayesian are typically equivalent when the prior distribution is non-informative (all the possible parameter values have the same density) and/or the sample size is large in comparison with the dimension of the parameter space. In this case, the posterior distribution takes the same form as the likelihood function.

Generally, Bayesian analysis are simple and direct from a conceptual point of view since they mainly rely on a naive application of Bayes’ Theorem. Cox (1946), Cox (1963) and Savage (1972) constitute theoretical support to justify that if $p(\bm{y}\mid\mathbf{\theta})$ and $p(\mathbf{\theta})$ represent the beliefs of a rational person (in a probabilistic sense), then Bayes’ Theorem is the optimal method for updating his beliefs according to the laws of probability. Furthermore, uncertainty quantification under the Bayesian paradigm is for free! Once the posterior distribution is obtained, any aspect of the model parameters can be described using probabilistic statements. However, not everything shines and there are several challenges to be addressed. Computation under the Bayesian paradigm is quite challenging because in multiparameter models integrals to characterize any aspect of the posterior distribution can be hard to compute.

3. Bayesian inference in social science

In social research, frequentist methods become unreliable when data do not correspond to a random sample from a larger population. The conventional interpretation of confidence intervals as well as hypothesis tests, which rely on patterns emerging from repeated sampling, tend to be confusing and inadequate in contexts where uncertainty do not arise from variation in repeated sampling (Western & Jackman, 1994; Jackman, 2009). Consequently, adhering to a frequentist notion of probability in the absence of repeatable data loses its meaning. In contrast, the subjective interpretation of probability provided in the Bayesian paradigm offers a coherent and internally consistent tool for statistical inference when data can not be framed in the context of repeated experimentation.

Social scientists often encounter themselves working with “small” data gathered from real-life social behavior, where classical experimental-design requirements are typically not met. In such scenarios, subjective judgments regarding the model’s formulation become inevitable and intrinsic to the scientific process (Jackman, 2009). Thus, it is natural for researchers in this context setting prior probabilities about the unknown quantities and interpret them subjectively (depending on the modeler’s state of knowledge). Furthermore, the absence of “large datasets” implies that estimates obtained through a frequentist approach lack robust statistical properties, particularly concerning the asymptotic properties that validate classical inference procedures. Comparative research studies in the social domain have demonstrated that applying frequentist methods to small data may lead to imprecise estimates of the effects of explanatory variables (e.g., Western & Jackman, 1994).

The social field abounds with data grouped over several units or periods. Hence, a key research question is how a causal structure operating at one level of analysis (e.g., individuals) varies over a higher level (e.g., localities or periods). The Bayesian approach to statistical inference is well suited to answer this question since it allows the researcher to formalize assumptions about between and within between-group heterogeneity by formulation a proper structured of prior beliefs. Thus, the prior distribution, considered by critics of Bayesian inference as a weakness, provides a way to expand from a simple model to a model involving several sources of heterogeneity, which allows modelers analyzing cases where social research requires understanding the relative weight of unknown quantities at different levels (Jackman, 2009).

Other advantages of Bayesian inference, not exclusive of Social Science, include its conceptual simplicity. As mentioned previously, Bayesian inference does not require to consider hypothetical data as frequentists do when developing confidence intervals and hypothesis tests. This is because the posterior distribution directly represents the most up-to-date information about the parameter, given nothing more than the observed data. Such a straightforward nature inherent in the Bayesian paradigm is quite appealing to social scientists, making it the primary methodology adopted by quantitative researchers in this field (Jackman, 2009).

Nowadays, Bayesian computation has become more feasible than ever before due to the current developments in both and hardware and software. Such a computational framework make possible to solve complex statistical problems that a few decades before were just not possible to handle. Specifically, the recent low cost and computational speed make it attainable for social scientists to analyze data from a Bayesian simulation-based approach. In this sense, the set of algorithms known as Markov Chain Monte Carlo (MCMC, see Section 4.3 for details) allows the Bayesian approach to be a practical reality for applied researchers (Jackman, 2009). These algorithms provide a powerful and flexible way to approximate the posterior distribution for most multiparameter models. For example, estimates of hierarchical models, latent variable models, and estimates based on observations with missing data have become straightforward procedures because the mathematics and computation underlying Bayesian analysis is drastically simplified via Monte Carlo simulation.

4. Fundamentals for bayesian modeling

Here, we provide essential details for formulating and fitting Bayesian models, from conjugate families to specifics in Bayesian computation based on Monte Carlo simulation. This material is key to understand and implement the case studies given in Section 5.

4.1 Prior specification

In their very core, all Bayesian models are hierarchical. The Bayesian paradigm follows a highly useful general idea: Suppose we are studying an unknown $y$ and we are having trouble modeling our uncertainty about $y$ directly. Then, the basic hierarchical modeling idea is to find another unknown $\theta$ upon which $y$ depends and model $y$ hierarchically by first modeling $\theta$ , and then, modeling $y\mid\theta$ (this idea can be applied further to $\theta$ itself). Therefore, modeling $\theta$ is crucial.

The main criticism of the Bayesian approach relies on the inherent subjectivism of the prior distribution. For frequentists, data analysis based on subjective information states (depending on the analyst) lacks scientific rigour. However, the question is, why should we neglect available external information that is consistent with reality when it can contribute to explaining the phenomenon of interest and lead to more accurate inferences and plausible conclusions? Even when external information is not available, we can set the prior distribution to reflect such state of knowledge. In this regard, there are available a number of options to specify the prior distribution in an “objective” fashion thought the so called objective priors (see Reich & Ghosh 2019, Chap. 2 for details).

Since the state of information may vary depending on the analyst, the choice of the prior distribution and the robustness of the inferences based on this choice is a fundamental issue in Bayesian inference. Regarding the prior choice, the literature exposes some methods for eliciting prior distributions (e.g., Berger, 2013; Congdon, 2007; Garthwaite et al., 2005; Stuart et al., 1994). For instance, the Jeffreys’ prior of a parameter $\theta$ is given by $p_{J}(\theta)\propto\sqrt{I(\theta)}$ , where $I(\theta)$ is the (expected) Fisher’s information (the key feature of this prior is its consistency under reparameterization). However, it may not be easy to formulate the researcher’s prior beliefs mathematically and precisely. For this reason, the prior distribution is often only an approximation of such beliefs and can be chosen for computational convenience (Hoff, 2009).

From a Bayesian perspective, studying the robustness of inferences means performing a sensitivity analysis. The purpose is to examine how the posterior distribution changes as different values of the hyperparameters are adopted (a hyperparameter is a parameter of the prior distribution, which is set by the analyst; this term is useful to distinguish them from parameters of the model). This analysis allows us to argue that the conclusions are consistent from both a qualitative and quantitative point of view. There are two ways to perform a sensitivity analysis in practice, either (i) by weakening or strengthening the adopted prior or (ii) by repeating the analysis with different priors (Jackman, 2004).

4.2 Conjugate distributions

Conjugate families are essential in Bayesian statistics because they greatly simplify computation. Specifically, suppose that the prior distribution $p(\theta)$ belongs to a known family of distributions. Then, such a prior is said to be conjugate regarding the sampling distribution $p(\bm{y}\mid\theta)$ , if the posterior distribution $p(\theta\mid\bm{y})$ belongs to the same family of distributions as the prior does (Jackman, 2004, 2009). It is up to modelers to work them or not depending on their prior beliefs and their modeling choices.

Under conjugacy, the update from the prior to the posterior distribution merely changes the parameters that define the corresponding conjugate family. This feature is easy to interpret, and besides easing computation, this characteristic allows us to develop some intuition about Bayesian learning through straightforward examples. However, it’s important to acknowledge that conjugate priors have certain limitations. For instance, not all likelihood functions have a known conjugate prior, and most conjugacy pairs are applicable only to small-scale examples with a limited number of parameters (Jackman, 2004, 2009). Furthermore, not every state of knowledge about an unknown parameter is easy to express using a conjugate prior distribution.

4.3 Bayesian computation via monte carlo simulation

In Bayesian inference, we can formulate complex models that require simulation-based methods to explore the posterior distribution, typically based on the Monte Carlo principle together with Markov chains (e.g., Gamerman & Lopes, 2006; Robert & Casella, 2013; Turkman et al., 2019). These methods allow the generation of random samples from the posterior distribution (target distribution) when the underlying computations are either extremely demanding or analytically intractable.

The Monte Carlo principle states that any characteristic of a random variable can be approximated arbitrarily well by generating enough random samples from its probabilistic distribution. On the other hand, Markov chains are first-order stochastic processes (random sequences with serial dependence such that “what happens next depends only on the state of affairs now”) that allow us to explore the posterior distribution when it has an unknown analytic form (see Carlin & Louis, 2008, Jackman, 2009, and Meyn & Tweedie, 2012 for a formal treatment of Markov chains). To put it another way, using Markov chain Monte Carlo (MCMC) algorithms, we can generate correlated random draws from the posterior distribution in order to learn about any aspect of it.

Notice that enough IID samples constitute a direct approximation of the posterior distribution (convergence in probability), which can not be guaranteed for MCMC samples (convergence in distribution). Given the dependency among samples arising from a MCMC algorithm, there is no absolute certainty that the simulated chain has reached convergence to its intended target distribution (i.e., the posterior distribution). Therefore, evaluating convergence is key. In practice, it is common to diagnose non-convergence through graphical displays (e.g., traceplots) and numerical measures (e.g., $\hat{R}$ statistic, Gelman et al. 2014). Also, it is highly recommended to run the algorithm a large number of times (typically higher than what would be needed using IID sampling, Hoff 2009).

Finally, aiming to increase as much as possible the effective sample size (equivalent sample size under IID sampling), it is customary to discard the initial values of the chain (burn-in period) as well as take systematic samples of it (thinning) to remove autocorrelation. It is also highly recommended to run several chains at different starting points of the parameter space to check whether they approach to the same stationary distribution or not.

4.3.1 Monte Carlo principle

Let $\theta$ be the parameter of interest and $\bm{y}$ be a sample of observed values from a posterior distribution $p(\theta\mid\bm{y})$ . Suppose that a IID random sample of size $B$ is drawn from $p(\theta\mid\bm{y})$ , i.e., $\theta^{(1)},\cdots,\theta^{(B)}\lx@stackrel{{\scriptstyle\text{i.i.d}}}{{\sim% }}p(\theta\mid\bm{y})$ . Thus, the empirical distribution induced by $\theta^{(1)},\cdots,\theta^{(B)}$ is known as the Monte Carlo approximation of the target distribution $p(\theta\mid\bm{y})$ . Such empirical distribution gets closer to the true target distribution as $B$ increases. In practice, it is customary to choose $B$ large enough such that the Monte Carlo standard error (i.e., the standard deviation of the Monte Carlo samples divided by the square root of $B$ ) is less than the desired precision (Hoff, 2009). Additionally, thanks to the law of large numbers states, we have that

$\displaystyle\frac{1}{B}\sum_{b=1}^{B}g(\theta^{(b)})\longrightarrow\mathsf{E}% (g(\theta)\mid\bm{y})=\int_{\Theta}g(\theta)\,p(\theta\mid\bm{y})\,\text{d}% \theta\,\hskip 5.690551pt\text{as long as}\hskip 5.690551ptB\rightarrow\infty\,,$

where $g(\theta)$ is any function of $\theta$ . Consequently, any aspect of the posterior distribution can be approximated arbitrarily well with a large enough Monte Carlo sample Hoff (2009).

4.3.2 Gibbs sampler

When it is difficult to simulate from the posterior distribution directly, it is recommended to sample iteratively from the full conditional distribution $p(\theta_{i}\mid\theta_{1},\ldots,\theta_{i-1},\theta_{i+1},\ldots,\theta_{k},% \bm{y})$ of each parameter $\theta_{i}$ , for $i=1,\cdots,k$ . The Gibbs sampler allow us to generate samples from the posterior distribution by updating sequentially each component of $\theta$ through its conditional distribution, given the most recent state of the other model parameters. Specifically, given the current state of the parameters $\theta^{(b-1)}=(\theta_{1}^{(b-1)},\ldots,\theta_{k}^{(b-1)})$ , we can generate the next state $\theta^{(b)}$ from $\theta^{(b-1)}$ , for $b=1,\ldots,B$ , as follows:

1. Draw $\theta_{1}^{(b)}\sim p(\theta_{1}\mid\theta_{2}^{(b-1)},\theta_{3}^{(b-1)},% \ldots,\theta_{k}^{(b-1)})$ .

2. Draw $\theta_{2}^{(b)}\sim p(\theta_{2}\mid\theta_{1}^{(b)},\theta_{3}^{(b-1)},% \ldots,\theta_{k}^{(b-1)})$ .

$\vdots$

$k$ . Draw $\theta_{k}^{(b)}\sim p(\theta_{k}\mid\theta_{1}^{(b)},\theta_{2}^{(b)},\ldots,% \theta_{k-1}^{(b)})$ .

This algorithm generates a dependent sequence of values of $\theta$ , namely, $\theta^{(1)},\ldots,\theta^{(B)}$ . In this random sequence, $\theta^{(b)}$ depends on $\theta^{(0)},\theta^{(1)}\ldots,\theta^{(b-1)}$ only through $\theta^{(b-1)}$ , which means that, given $\theta^{(b-1)}$ , $\theta^{(b)}$ is conditionally independent of $\theta^{(0)},\theta^{(1)}\ldots,\theta^{(b-2)}$ (this is the so called the Markov property). Finally, the target distribution is reached as $b\longrightarrow\infty$ , no matter what starting value $\theta^{(0)}$ is chosen to start the algorithm (although some starting are more convenient than others).

4.3.3 Metropolis-hastings

Again, when it is difficult or even possible to simulate from the posterior distribution directly, the Metropolis-Hastings algorithm provides a general setting to build a Markov chain through a series of “jumps” that generate a random sequence, whose target distribution is the posterior distribution $p(\theta\mid\bm{y})$ . Specifically, given the current state of the parameters $\theta^{(b-1)}=(\theta_{1}^{(b-1)},\ldots,\theta_{k}^{(b-1)})$ , we can generate the next state $\theta^{(b)}$ from $\theta^{(b-1)}$ , for $b=1,\ldots,B$ , as follows:

1.
Simulate a jump candidate $\theta^{}$ around $\theta^{(b-1)}$ using a proposal distribution $J(\theta^{}\mid\theta^{(b-1)})$ . Usually, $J(\theta^{}\mid\theta^{(b-1)})$ is taken to be symmetric, i.e., $J(\theta^{}\mid\theta^{(b-1)})=J(\theta^{(b-1)}\mid\theta^{})$ (in this case, the algorithm is simply known as Metropolis algorithm). For instance, when $\theta$ is univariate, commonly used proposal distributions are $\textsf{N}(\theta^{}\mid\theta^{(b-1)},\delta)$ and $\textsf{U}(\theta^{}\mid\theta^{(b-1)}-\delta,\theta^{(b-1)}+\delta)$ , where the tunning parameter $\delta$ is chosen to allow the algorithm run efficiently. In practice, it is common to set $\delta$ in such a way that the proportion of effective jumps roughly lies between 20 and 50%.
2.
Compute the acceptance ratio

$\displaystyle r=\frac{p(\theta^{}\mid\bm{y})/J(\theta^{}\mid\theta^{(b-1)})}% {p(\theta^{(b-1)}\mid\bm{y})/J(\theta^{(b-1)}\mid\theta^{})}\,.$

If the proposal distribution is symmetric, then the acceptance rate becomes

$\displaystyle r=\frac{p(\theta^{}\mid\bm{y})}{p(\theta^{(b-1)}\mid\bm{y})}\,.$

Typically, $r$ is expressed on logarithmic scale in order to achieve numerical stability.
3.
Determine the transition probability $\alpha=\min\{1,r\}$ . Thus, if the candidate increases the probability of the posterior distribution, then it is accepted with probability 1. On the other hand, if the candidate does not increase the probability of the posterior distribution, then it is accepted with probability $r$ .
4.
Simulate $u\sim\textsf{U}(0,1)$ .
5.
Set $\theta^{(b)}=\theta^{}$ , if $u\leqslant\alpha$ , and $\theta^{(b)}=\theta^{(b-1)}$ , otherwise.

Again, it can be shown the algorithm given above, regardless of the proposal distribution $J(\cdot\mid\cdot)$ and the initial value $\theta^{(0)}$ , generates a Markov chain whose stationary distribution is the posterior distribution $p(\theta\mid\bm{y})$ (Gamerman & Lopes, 2006). See also our discussion about the Markov property given in the previous section.
4.3.4 Monte Carlo hamiltonian

A possible inefficiency of the Gibbs sampler and the Metropolis-Hastings algorithm lies in their local random walk behavior (Gelman et al., 2014), which causes the chain to take too long to explore the posterior distribution efficiently. Such a behavior leads to long-time converge times, mainly when dealing with complex models such as those related to high-dimensional posterior distributions (Betancourt, 2019). The Hamiltonian Monte Carlo algorithm is an alternative to overcome such inefficiency.

This algorithm considers a “boost” variable $\varphi$ to explore more efficiently the target distribution by moving on different trajectories, suppressing the local random walk motion described by other samplers (Betancourt, 2017, 2019). In a Hamiltonian algorithm, samples are drawn from the joint distribution $p(\theta,\varphi\mid\bm{y})=p(\theta\mid\bm{y})\,p(\varphi)$ . However, only simulations of $\theta$ are of interest since $\varphi$ operates as an auxiliary variable. Specifically, given the current state of the parameters $\theta^{(b-1)}=(\theta_{1}^{(b-1)},\ldots,\theta_{k}^{(b-1)})$ , we can generate the next state $\theta^{(b)}$ from $\theta^{(b-1)}$ , for $b=1,\ldots,B$ , as follows:

1.
Simulate $\varphi\sim\textsf{N}(0,\mathbf{M})$ , where $\mathbf{M}$ is a diagonal matrix representing the covariance matrix associated with the impulse function $p(\varphi)$ . Typically, $\mathbf{M}$ is chosen to be as the identity matrix.
2.
Update $(\theta,\varphi)$ using $L$ “jumps” scalded by a factor $\epsilon$ . Specifically, in a given jump, both $\theta$ and $\varphi$ change relative to each other as follows:

(a)
Update $\varphi$ :

$\displaystyle\varphi\leftarrow\varphi+\frac{\epsilon}{2}\,\frac{\partial}{% \partial\theta}\log p(\theta\mid\bm{y})\,.$
(b)
Update $\theta$ :

$\displaystyle\theta\leftarrow\theta+\epsilon\,\mathbf{M}\varphi\,.$
(c)
Repeat the above steps $L-1$ times.

3.
Let $\theta^{(b-1)}$ and $\varphi^{(b-1)}$ be the initial values of $\theta$ and $\varphi$ respectively, and $\theta^{}$ and $\varphi^{}$ the corresponding values after the $L$ steps. Compute the acceptance ratio

$\displaystyle r=\frac{p(\theta^{}\mid\bm{y})\,p(\varphi^{})}{p(\theta^{(b-1)% }\mid\bm{y})\,p(\varphi^{(b-1)})}\,.$
4.
Determine the transition probability $\alpha=\min\{1,r\}$ .
5.
Simulate $u\sim\textsf{U}(0,1)$ .
6.
Set $\theta^{(b)}=\theta^{*}$ , if $u\leqslant\alpha$ , and $\theta^{(b)}=\theta^{(b-1)}$ , otherwise.

The tunning parameters $\epsilon$ and $L$ are chosen to allow the algorithm run efficiently. In practice, it is common to set them in such a way that the proportion of effective jumps roughly lies between 60 and 70%. See Gelman et al. (2014) for more details about the choice of $\epsilon$ , $L$ and $\mathbf{M}$ .
4.4 Goodness of fit

After establishing the structure of the model and approximating the posterior distribution $p(\theta\mid\bm{y})$ , it is convenient to evaluate the model’s fit, aiming to detect misleading inferences due to a poor model fitting. Formally, the model’s goodness of fit can be carried out through external validation tests, which consist in generating hypothetical replicas of the data, say $\bm{y}^{\text{rep}}$ , though the posterior predictive distribution, $p(\bm{y}^{\text{rep}}\mid\bm{y})=\int_{\Theta}{p(\bm{y}^{\text{rep}}\mid\theta% )\,p(\theta\mid\bm{y})}\,\text{d}\theta\,$ . Then, such replicated data are directly compered with the observed data. If the model fits well to the data, then replicated data should present a similar behavior to the observed data.

Usually, the model and data discrepancy is examined through a set of test statistics (e.g., measures of trend and variability), say $t(\bm{y})$ . These quantities are used as metrics to compare the predictive simulations with their corresponding observed values. In addition, such quantities allow us to identify the relevant aspects of the data that are reasonably reproduced by the proposed model. The lack of fit of the data concerning the posterior predictive distribution is measured by the posterior predictive $p$ value, $\text{ppp}=\textsf{Pr}(t(\bm{y}^{\text{rep}})>t(\bm{y})\mid\bm{y})$ , which can be interpreted as the probability that the replicated data is more extreme than the observed data (in test statistics terms). Thus, the model fits well to the data regarding the test statistic $t(\bm{y})$ if and only if the corresponding ppp does not assume extreme values such as 0 or 1 (Gelman et al., 2014).

4.5 Model comparison

Information criteria allow us to evaluate and compare models through their predictive performance. Popular alternatives include the Deviance Information Criterion (DIC, see Gelman et al., 2014; Spiegelhalter et al., 2002) and the Watanabe-Akaike Criterion (WAIC, see Gelman et al., 2014; Watanabe, 2013).

The DIC is defined as

$\displaystyle\text{DIC}=-2\log p(\bm{y}\mid\hat{\theta}_{\text{Bayes}})+2p_{% \text{DIC}}\,,$

where $\hat{\theta}_{\text{Bayes}}=\textsf{E}(\theta\mid\bm{y})\approx\frac{1}{B}\sum% _{b=1}^{B}\theta^{(b)}$ is the posterior mean of $\theta$ , and $p_{\text{DIC}}$ to the effective number of parameters,

$\displaystyle p_{\text{DIC}}=2\left[\log p(\bm{y}\mid\hat{\theta}_{\text{Bayes% }})-\textsf{E}(\log p(\bm{y}\mid\theta)\mid\bm{y})\right]\approx 2\left[\log p% (\bm{y}\mid\hat{\theta}_{\text{Bayes}})-\frac{1}{B}\sum_{b=1}^{B}\log p\left(% \bm{y}\mid\theta^{(b)}\right)\right]\,.$

On the other hand, the WAIC is defined as

$\displaystyle\text{WAIC}=-2\text{lppd}+2p_{\text{WAIC}},$

where

$\displaystyle\text{lppd}=\log\prod_{i=1}^{n}p(y_{i}\mid\bm{y})=\sum_{i=1}^{m}% \log\int_{\Theta}p(y_{i}\mid\theta)\,p(\theta\mid\bm{y})\,\text{d}\theta% \approx\sum_{i=1}^{n}\log\left(\frac{1}{B}\sum_{b=1}^{B}p(y_{i}\mid\theta^{(b)% })\right)$

is the posterior predictive distribution in logarithmic scale, which summarizes the predictive ability of the model fitted to the data. The corresponding effective number of parameters is given by

$\displaystyle p_{\text{WAIC}}=2\sum_{i=1}^{n}\left[\log\left(\textsf{E}(p(y_{i% }\mid\theta)\mid\bm{y})\right)-\textsf{E}\left(\log(p(y_{i}\mid\theta)\mid\bm{% y})\right)\right]\,,$

which in practice can be calculated as

$\displaystyle p_{\text{WAIC}}\approx 2\sum_{i=1}^{n}\left[\log\left(\frac{1}{B% }\sum_{b=1}^{B}p(y_{i}\mid\theta^{(b)})\right)-\frac{1}{B}\sum_{b=1}^{B}\log p% (y_{i}\mid\theta^{(b)})\right]\,.$

When Comparing models, lower DIC and WAIC values imply higher predictive accuracy.

Although the DIC is widely used as a model selection tool, it has several disadvantages compared to the WAIC. Common criticisms include the penalty term, $p_{\text{DIC}}$ , is not invariant to reparameterization; the DIC may not be consistent with identical replicates of the same experiment; the DIC is not based on a completely Bayesian predictive criterion (see Spiegelhalter et al., 2014, for more details). The WAIC addresses many of these criticisms. In particular, The WAIC is invariant to reparameterizations, which makes it useful in the case of models with hierarchical structures, in which the number of parameters increases with the sample size (Spiegelhalter et al., 2014).

5. Cases studies

This section illustrates the Bayesian methodologies described in previous sections with three case studies. First, we exemplify the Monte Carlo principle using IID sampling in the context of a multinomial-Dirichlet model. Then, we illustrate the Metropolis-Hastings algorithm, the Hamiltonian Monte Carlo algorithm, and goodness-of-fit methods in the context of a generalized linear model for count data. Finally, we show the Gibbs sampler and information criteria metrics in the context of hierarchical linear regression models. The interested reader may request the code to reproduce all the examples from any of the authors.

5.1 Political survey: A Multinomial-Dirichlet model

We implement a Multinomial-Dirichlet model to analyze the 2022 Colombian Presidential Consultations. In Colombia, Presidential Consultations are basically open primary elections in which voters (general citizens) can indicate their preference for their party’s candidate in the upcoming presidential elections. For those readers unfamiliar with the Colombian political system, Colombian Presidential Consultations work almost identically to the primary elections in the United States. The Multinomial-Dirichlet model allows us to estimate the population share of votes that each candidate will receive based on the data provided by a national pollster.

Table 1
Invamer’s survey results about party consultations in Colombia 2022

Pacto Histórico
G. Petro	F. Márquez	C. Romero	A. U. Guariyú	A. Saade	$n$
322	56	24	7	1	410
Coalición Equipo por Colombia
F. Gutiérrez	A. Char	E. Peñalosa	D. Barguil	A. Lizarazo	$n$
51	43	33	27	22	176
Coalición Centro Esperanza
S. Fajardo	J. M. Galán	C. Amaya	A. Gaviria	J. E. Robledo	$n$
45	28	18	15	13	119

The independent media company Valora Analitik reported that “after adding up the differences between the latest polls and the results given in Election Day, Invamer is the pollster that was closest in its predictions, followed by Guarumo and EcoAnalítica, and in third place, the CNC. The pollster furthest away from the results was Yanhaas, in fourth place” (https://www.valoraanalitik.com/2022/03/14/ranking-encuestadoras-elecciones-marzo-colombia-2022/). Consequently, we use the Invamer results to illustrate the way a Multinomial-Dirichlet model is implemented. The Invamer survey was conducted at the end of February 2022 (https://es.scribd.com/document/562600199/Invamer-Marzo-2022). It involved the participation of 1504 men and women aged 18 and over, representing diverse socio-economic levels across the country, including urban and rural areas. This survey seeks to gather information about participants’ preferences in the presidential consultations for Colombia’s elections in 2022. In Table 1, data corresponds to respondents who indicated their definite or probable vote for each party’s consultation. It is important to note that this count does not include undecided voters.

Although Invamer uses a particular kind of random sampling without replacement, it is customary to consider such a sample as a simple random sample with replacement, given that the total sample size is very small compared to the size of the Universe. Under the conditions given above and given that our uncertainty about the responses of the 1504 people in the survey is exchangeable, a particular version of De Finetti’s Theorem (Bernardo & Smith, 2000, p. 176) guarantees that the only sampling distribution appropriate for data of this nature is the Multinomial distribution. Below we describe the modeling approach as well as the results in the context of the political landscape in Colombia.

The population of interest consists of items categorized into $k\geqslant 2$ types, where each type $j$ has a proportion denoted by $0<\theta_{j}<1$ , with $j=1,\ldots,k$ . The components of $\bm{\theta}=(\theta_{1},\ldots,\theta_{k})$ are such that $\sum_{j=1}^{k}\theta_{j}=1$ . Now, an IID sample $\bm{y}=(y_{1},\ldots,y_{n})$ of size $n$ is taken from the population. Let $\bm{n}=(n_{1},\ldots,n_{k})$ be the random vector that represents the counts associated with each type of item. Here, $n_{j}$ denotes the number of elements in the random sample that belong to type $j$ , for $j=1,\ldots,k$ . In this context, for each political party, we have $k=5$ categories corresponding to candidates: Take for example the Pacto Histórico data, in which the observed counts are $n_{1}=332$ (G. Petro), $n_{2}=56$ (F. Márquez), $n_{3}=24$ (C. Romero), $n_{4}=7$ (A. U. Guariyú), and $n_{5}=1$ (A. Saade), resulting in a sample size of $n=\sum_{j=1}^{k}n_{j}=410$ . This is analogous for the other parties. Under this setting, $\bm{n}$ has a Multinomial distribution with parameters $n$ and $\bm{\theta}$ , which is defined as follows: $\bm{n}\mid n,\bm{\theta}\sim\textsf{Mult}(n,\bm{\theta})$ if and only if

$\displaystyle p(\bm{n}\mid n,\bm{\theta})=\frac{n!}{\textstyle\prod_{j=1}^{k}n% _{j}!}\prod_{j=1}^{k}\theta_{j}^{n_{j}}$ (1)

provided that $\sum_{j=1}^{k}n_{j}=n$ and $0\leqslant n_{j}\leqslant n$ , for $j=1,\dots,k$ . To make inferences about $\bm{\theta}$ , we consider the model with sampling distribution $\bm{n}\mid n,\bm{\theta}\sim\textsf{Mult}(n,\bm{\theta})$ and the prior distribution $\bm{\theta}\sim\textsf{Dir}(a_{1},\ldots,a_{k})$ , i.e.,

$\displaystyle p(\bm{\theta})=\frac{\Gamma\left(\textstyle\sum_{j=1}^{k}a_{j}% \right)}{\textstyle\prod_{j=1}^{k}\Gamma(a_{j})}\prod_{j=1}^{k}\theta_{j}^{a_{% j}-1}\,,$ (2)

where $a_{1},\ldots,a_{k}$ are the hyperparameters of the model.

Using Eqs (1) and (2), a direct application of Bayes’ Theorem states that the posterior distribution of $\bm{\theta}$ is such that

$\displaystyle p(\bm{\theta}\mid\bm{n})\propto p(\bm{n}\mid\bm{\theta})\cdot p(% \bm{\theta})\propto\prod_{j=1}^{k}\theta_{j}^{n_{j}}\times\prod_{j=1}^{k}% \theta_{j}^{a_{j}-1}=\prod_{j=1}^{k}\theta_{j}^{n_{j}+a_{j}-1}\,,$

which corresponds to the kernel of a Dirichlet distribution with parameters $a_{1}+n_{1},\ldots,a_{k}+n_{k}$ . Therefore, we get that $\bm{\theta}\mid\bm{n}\sim\textsf{Dir}(a_{1}+n_{1},\ldots,a_{k}+n_{k})$ , i.e., the family of Dirichlet distributions is conjugate to the Multinomial sampling distribution (see Section 4.2 for more details). Finally, we illustrate a typical property of conjugate models. Given that the expected value of the $j$ -th component of a random vector with $\textsf{Dirichlet}(c_{1},\ldots,c_{k})$ distribution is $c_{j}/c^{*}$ , with $c^{*}=\sum_{j=1}^{k}c_{j}$ , the posterior mean of $\theta_{j}$ is given by

$\displaystyle\textsf{E}(\theta_{j}\mid\bm{n})=\frac{a_{j}+n_{j}}{\sum_{j=1}^{k% }(a_{j}+n_{j})}=\frac{a_{j}+n_{j}}{a^{*}+n}=\frac{a^{*}}{a^{*}+n}\cdot\frac{a_% {j}}{a^{*}}+\frac{n}{a^{*}+n}\cdot\frac{n_{j}}{n}\,,$

where $a^{*}=\sum_{j=1}^{k}a_{j}$ and $n=\sum_{j=1}^{k}n_{j}$ , and consequently, the posterior mean of $\theta_{j}$ corresponds to a weighted mean between the prior mean of $\theta_{j}$ and the sample mean of category $j$ , for $j=1,\ldots,k$ .

Since the posterior distribution of $\bm{\theta}$ is Dirichlet, it immediately follows that the marginal posterior distribution of each $\theta_{k}$ is $\textsf{Beta}(a_{k}+n_{k},a_{k}+n-n_{k})$ . Thus, any posterior quantity of interest can be computed analytically without going through any sort of Monte Carlo machinery. For illustrative purposes, we present here the results of fitting the Multinomial-Dirichlet model using both Monte Carlo simulation as well as the exact analytical distribution (Beta). When using Monte Carlo, we draw 50000 IID samples of the posterior distribution of $\bm{\theta}$ to estimate the proportion of votes for Pacto Histórico, Coalición Equipo por Colombia, and Coalición Centro Esperanza. Either way, we use $a_{1}=\ldots=a_{k}=\frac{1}{2}$ (this choice of hyperparameters corresponds to Jeffreys’ prior; Gelman, 2009). In Appendix A we present an algorithm to simulate IID samples from the Dirichlet distribution.

In Table 2, we compare our results (posterior mean) with the final report of the Registraduría Nacional del Estado Civil, which is the observed value in Election Day (https://resultados.registraduria.gov.co/). We see that for Pacto Histórico and Coalición Centro Esperanza candidates, all credible intervals contain the observed value. On the other hand, for Coalición Equipo por Colombia, all the intervals, except the one corresponding to candidate D. Barguil, do not include the observed value. We strongly believe that this happened because of unexpected political changes prior to Election Day. Recall that according to the Colombian law, polling firms can only release polls up to one week before Election Day. During the week right before elections, both A. Char and E. Peñalosa made several controversial public statements, which ended up affecting their final results in the primary elections.

Table 2

Observed value, posterior mean, and lower ( $2.5\%$ ) and upper ( $97.5\%$ ) limits of a 95% credible interval based on percentiles for each candidate of each political group, using both Monte Carlo simulation and the exact analytical distribution (Beta). Quantities expressed in percentage points (%)

Consultation	Candidate	Observed	Monte Carlo			Exact
			Mean	2.5%	97.5%	Mean	2.5%	97.5%
	G. Petro	80.50	78.18	74.08	82.02	78.47	74.37	82.30
Pacto	F. Márquez	14.05	13.70	10.55	17.19	13.75	10.59	17.23
Histórico	C. Romero	4.06	5.94	3.87	8.42	5.96	3.89	8.44
	A. U. Guariyú	0.98	1.82	0.76	3.34	1.82	0.77	3.32
	A. Saade	0.38	0.36	0.03	1.12	0.36	0.03	1.13
	F. Gutiérrez	54.18	28.85	22.39	35.68	29.10	22.66	35.98
Coalición	A. Char	17.72	24.37	18.40	30.96	24.58	18.54	31.16
Equipo por	E. Peñalosa	5.80	18.77	13.42	24.84	18.93	13.52	25.00
Colombia	D. Barguil	15.77	15.41	10.52	21.09	15.54	10.60	21.21
	A. Lizarazo	6.51	12.61	8.14	17.87	12.71	8.23	17.99
	S. Fajardo	33.50	37.45	29.11	46.08	37.92	29.48	46.74
Coalición	J. M. Galán	22.55	23.46	16.39	31.34	23.75	16.60	31.72
Centro	C. Amaya	20.89	15.23	9.46	22.18	15.42	9.56	22.37
Esperanza	A. Gaviria	15.58	12.76	7.42	19.18	12.92	7.56	19.44
	J. E. Robledo	7.46	11.11	6.13	17.30	11.25	6.26	17.46

5.2 Population dynamics: A poisson regression model

In this study, we examine the investigation conducted by Arcese et al. (1992) on the reproductive activities of $n=52$ female sparrows during the summer. The research was later revisited by Hoff (2009, Chap. 10), who applied the Bayesian approach to analyze the data. We study the number of offspring as a function of age through a Poisson regression model. Although this application is typical of Bio-statistics, it is also interesting from the point of view of Social Sciences because it is strongly related to reproductive patterns and population dynamics. In this case, we illustrate the Metropolis-Hastings algorithm along with the Hamiltonian Monte Carlo algorithm for obtaining samples from the posterior distribution.

Given that the number of offspring is a count variable, we propose to model this variable as a function of age employing the following model:

$\displaystyle y_{i}\mid\theta_{i}\lx@stackrel{{\scriptstyle\text{iid}}}{{\sim}% }\textsf{Poisson}(\theta_{i})\,,$ (3)

where $y_{i}$ is the number of offspring of sparrow $i$ , for $i=1,\ldots,n$ , $\eta_{i}=\log(\theta_{i})=\sum_{j=1}^{k}\beta_{j}\,x_{i,j}=\bm{\beta}^{\textsf% {T}}\bm{x}_{i}$ is the linear predictor associated with the patterns in the data related to the fixed effects, with $\bm{\beta}=(\beta_{1},\ldots,\beta_{k})$ and $\bm{x}_{i}=(x_{i,1},\ldots,x_{i,k})$ , and finally, $x_{i,j}$ is the predictor $j$ observed in individual $i$ , for $i=1,\ldots,n$ and $j=1,\ldots,k$ . This formulation constitutes a generalized linear model (GLM, McCullagh, 2018) with a logarithmic link function.

A plot of the number of offspring versus age suggests that number of offspring varies with age according to a concave relationship (Hoff, 2009, p. 172). For this reason, we specify a linear predictor using a quadratic function of the form $\eta_{i}=\beta_{1}+\beta_{2}\,\text{age}_{i}+\beta_{3}\,\text{age}^{2}_{i}$ , so $k=3$ , $\bm{\beta}=(\beta_{1},\beta_{2},\beta_{3})$ and $\bm{x}_{i}=(x_{i,1},x_{i,2},x_{i,3})$ , with $x_{i,1}=1$ , $x_{i,2}=\text{age}_{i}$ , and $x_{i,3}=\text{age}^{2}_{i}$ , for $i=1,\ldots,n$ . In addition, we observe that the distribution (3) may be restrictive since under this formulation, we have that $\textsf{E}(y_{i}\mid\theta_{i})=\textsf{Var}(y_{i}\mid\theta_{i})=\theta_{i}$ . For this reason, we recommended examining the model’s goodness of fit through relevant test statistics (see Section 4.4 for more details). Other popular alternatives to the Poisson distribution are the Negative Binomial distribution (overdispersion: the variation is greater than the expected value) and the Comway-Maxwell-Poisson distribution (underdispersion: the variation is less than the expected value).

To complete the model specification with sampling distribution (3), it is necessary to specify a prior distribution for $\bm{\beta}$ . Except for the Normal regression model, there are generally no conjugate priors for regression parameters when working with GLMs. However, a standard family of prior distributions that works well in practice is the family of multivariate Normal distributions, so we let $\bm{\beta}\sim\textsf{N}(\bm{\beta}_{0},\mathbf{\Sigma}_{0})$ as a random mechanism to specify the external information about $\bm{\beta}$ . Consequently, the model parameters are $\beta_{1},\ldots,\beta_{k}$ and the model hyper-parameters are $\bm{\beta}_{0}$ and $\mathbf{\Sigma}_{0}$ .

In this case, the posterior distribution of $\bm{\beta}$ is

$\displaystyle p(\bm{\beta}\mid\bm{y})\propto\prod_{i=1}^{n}e^{-\theta_{i}}\,% \theta_{i}^{y_{i}}\times\exp\left\{-{\textstyle\frac{1}{2}}\bm{\beta}^{\textsf% {T}}\mathbf{\Sigma}_{0}^{-1}\bm{\beta}+\bm{\beta}^{\textsf{T}}\mathbf{\Sigma}_% {0}^{-1}\bm{\beta}_{0}\right\}\,,$

with $\bm{y}=(y_{1},\ldots,y_{n})$ and $\theta_{i}=\exp{\left(\bm{\beta}^{\textsf{T}}\bm{x}_{i}\right)}$ , for $i=1,\ldots,n$ , or equivalently in logarithmic scale,

$\displaystyle\log p(\bm{\beta}\mid\bm{y})=\bm{\beta}^{\textsf{T}}\sum_{i=1}^{n% }y_{i}\bm{x}_{i}-\sum_{i=1}^{n}\exp{\left(\bm{\beta}^{\textsf{T}}\bm{x}_{i}% \right)}-{\textstyle\frac{1}{2}}\bm{\beta}^{\textsf{T}}\mathbf{\Sigma}_{0}^{-1% }\bm{\beta}+\bm{\beta}^{\textsf{T}}\mathbf{\Sigma}_{0}^{-1}\bm{\beta}_{0}+% \mathrm{C}\,,$ (4)

where $\mathrm{C}$ is a constant that does not depend on $\bm{\beta}$ , and consequently the corresponding gradient is

$\displaystyle\frac{\partial}{\partial\bm{\beta}}\log p(\bm{\beta}\mid\bm{y})=% \sum_{i=1}^{n}\left(y_{i}-\exp{\left(\bm{\beta}^{\textsf{T}}\bm{x}_{i}\right)}% \right)\bm{x}_{i}\,.$ (5)

We note that $p(\bm{\beta}\mid\bm{y})$ does not correspond to any parametric family of standard distributions, which motivates the use of specialized algorithms to explore this posterior distribution through dependent random sequences. In particular, the Metropolis algorithm and the Hamiltonian algorithm allow us to empirically approximate $p(\bm{\beta}\mid\bm{y})$ through a sequence of values $\bm{\beta}^{(1)},\ldots,\bm{\beta}^{(B)}$ generated in a Markovian manner (see Section 4.3 for more details). Details about these algorithms are provided in Appendix B Poisson regression.

In this case, we fit the model assuming a non-informative prior information, by letting $\beta_{j}\mathrel{\overset{\makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{% \@vpt}IID}}}{\sim}}\textsf{N}(0,10)$ , for $j=1,2,3$ , i.e., $\bm{\beta}\sim\textsf{N}(\bm{\beta}_{0},\mathbf{\Sigma}_{0})$ , where $\bm{\beta}_{0}=\bm{0}_{3}$ and $\mathbf{\Sigma}_{0}=10\,\mathbf{I}_{3}$ . We choose as initial value $\bm{\beta}^{(0)}=\bm{0}_{3}$ . Then, we run the algorithms using $10000$ iterations after a burn-in period of $1000$ iterations. On the one hand, in order to implement the Metropolis-Hastings algorithm, we use $\mathbf{\Delta}_{0}=\mathrm{c}\,(\mathbf{X}^{\textsf{T}}\mathbf{X})^{-1}$ , with $\mathrm{c}=0.7$ and $\mathbf{X}=[\bm{x}_{1},\dots,\bm{x}_{n}]^{\textsf{T}}$ . On the other hand, in order to implement the Hamiltonian Monte Carlo algorithm, we use $L=100$ , $\epsilon=0.01$ , and $\mathbf{M}=\mathbf{I}_{3}$ . These adjustments lead to favorable acceptance rates of $38\%$ and $66\%$ , respectively (see Gelman et al., 2014, Chap. 12 for more details about the selection of tunning parameters).

Figure 1 shows the Markov chains associated with $p(\bm{\beta}\mid\bm{y})$ . We observe no evidence of a lack of convergence. Furthermore, we notice that the Hamiltonian algorithm produces chains with better mixing properties than the Metropolis-Hastings algorithm (this is expected given that the Hamiltonian’s convergence rate is higher than Metropolis’s). Finally, both the effective sample sizes and the Monte Carlo errors presented in Table 3 confirm that these chains are appropriate to make inferences about the parameters of interest (again, it is evident that the Hamiltonian algorithm is more efficient in exploring the posterior distribution).

Table 3

Effective sample sizes and Monte Carlo errors corresponding to the Markov chains associated with the posterior distribution $p(\bm{\beta}\mid\bm{y})$ of the Poisson regression model

Parameter	Effective size		MC error
	Metropolis	Hamiltonian	Metropolis	Hamiltonian
$\beta_{1}$	802.2	6429.1	0.016	0.005
$\beta_{2}$	729.1	6407.7	0.013	0.004
$\beta_{3}$	665.7	6233.2	0.002	0.001

Figure 1.

Markov chains associated with the posterior distribution $p(\bm{\beta}\mid\bm{y})$ of the Poisson regression model.

Figure (a), (b), and (c) in Fig. 2 display the posterior distribution of the regression coefficients, accompanied by the respective point estimate and a 95% credible interval based on percentiles. Our findings indicate that age and age-squared effects are significant (credible intervals do not contain 0). Furthermore, the signs of the point estimates of $\beta_{2}$ (positive) and $\beta_{3}$ (negative) confirm that the number of offspring varies with age through a concave relationship. This behavior is clear in panel (d) of Fig. 2, where it is evident that the reproductive pattern of this species has a moderate period of ascent (years 1 and 2), then reaches a peak (year 3), and then, it has a prolonged period of decline (years 3 to 6).

Finally, the model’s goodness of fit is evaluated by means of the posterior predictive distribution of a perdifined set of test statistics (see Section 4.4 for more details). In this case, the mean and variance are chosen as test statistics since they characterize essential aspects of the data (trend and dispersion) that might be overshadowed due to the mean-variance restriction of the Poisson model. Panels (e) and (f) of Fig. 2 suggest that the model fits the data well because the observed values of the data are typical values of the posterior predictive distribution of the corresponding test statistics (i.e., posterior predictive $p$ values are not close to either 0 or 1).

Figure 2.

Panels (a)–(c): posterior distribution of $\beta_{1}$ , $\beta_{2}$ , and $\beta_{3}$ , along with the mean posterior (solid line) and the limits of a 95% credible interval based on percentiles (dotted lines). Panel (d): posterior mean and limits of a 95% credible interval based on percentiles, for age in $\{1;\ldots;6\}$ . Panels (e)–(f): posterior predictive distribution of the mean and variance (test statistics), along with the observed value (solid line) and the corresponding posterior predictive $p$ value.

5.3 Standardized educational testing: Hierarchical linear regression model

In this study, we employed three multiple linear regression models to examine the math score outcomes of the Saber 11 Test during the first semester of 2020 in Colombia. The Instituto Colombiano para la Evaluación de la Educación (ICFES) applies this standardized test periodically to measure the skills of students who finish secondary school. We aim to make inferences about the Colombian student population at the national and departmental levels about their performance in mathematics. We examined the score in mathematics because it is a variable that social researchers usually relate to other important educational factors (e.g., Anis et al., 2016; Živković et al., 2023). This dataset is publicly available (https://www2.icfes.gov.co/data-icfes).

Based on the Saber 11 exam design, the mathematics test is graded on a scale ranging from 0 to 100 (with whole numbers only), and also, it is calibrated using a 3PL model (3-parameter logistic model characterizing the probability of a correct answer based on ability, item difficulty, item discrimination, and pseudo-chance) in such a way that it has an average score of 50 points together with a standard deviation of 10 points. In our analysis, we treated the score as the response variable, while considering the student’s sex and employment status as covariates. Prior to model fitting, a pre-processing step was conducted, which involved eliminating all records with missing data. Furthermore, the variables “sex” and “employment status” were encoded (sex: 1 if male, 0 if female; employment status: 1 if worked 0 hours during the last week, 0 otherwise). The resulting dataset, formed through these adjustments, comprised a total of 14,015 records. Bayesian imputation methods are available (see for example Ch. 7 Hoff, 2009). However, since the percentage of records lost by direct deletion is very small, adding this level of additional complexity in any of the models is not necessary.

We observe that the averages oscillate between 35 and 70 approximately. In addition, we do not have any information for eight departments (including the archipelago of San Andrés, Providencia, and Santa Catalina). We also appreciate that the department with the highest average is Quindio, while the lowest is Caquetá. Finally, those departments located in the Orinoquía and Amazonía Regions of the country exhibited the lowest scores nationwide.

Let $y_{i,j}$ and $\bm{x}_{i,j}=(x_{i,j,1},\ldots,x_{i,j,p})$ be the response variable and the vector of covariates corresponding to individual $i$ in group $j$ , respectively, for $i=1,\ldots,n_{j}$ and $j=1,\ldots,m$ . In this case, $y_{i,j}$ corresponds to the mathematics score of student $i$ in department $j$ , where $n_{j}$ is the number of students in department $j$ , and $m$ is the number of departments. In addition, $p=3$ covariates are considered, namely $x_{i,j,1}$ , constant variable equal to 1 associated with the intercept of the linear predictor, $x_{i,j,2}$ , dummy variable associated with the sex of student $i$ in department $j$ , and $x_{i,j,3}$ , dummy variable associated with the employment condition of student $i$ in department $j$ . Three multiple regression models with different characteristics are proposed below to analyze the data. Figure 3 shows the representation of the models using directed acyclic graphs (DAGs). These models can be easily extended to consider spatial information. The Bayesian paradigm is particularly useful in such a case (e.g., Banerjee et al., 2014).

Model 1: Multiple linear regression

•
Sampling distribution:

$\displaystyle y_{i,j}\mid\bm{\beta},\sigma^{2},\bm{x}_{i,j}\mathrel{\overset{% \makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}IND}}}{\sim}}\@setsize% {\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{N}}(\bm{x}_{i,j}^{\@setsize{\small}{11pt% }{\ixpt}{\@ixpt}{\mathsf{T}}}\bm{\beta},\sigma^{2})\,,\qquad i=1,\ldots,n_{j}% \,,\qquad j=1,\ldots,m\,,$

where $\bm{\beta}=(\beta_{1},\ldots,\beta_{p})$ is the vector of regression coefficients and $\sigma^{2}$ is the variance of the response variable. The sampling distribution is equivalent to

$\displaystyle\bm{y}\mid\bm{\beta},\sigma^{2}\sim\@setsize{\small}{11pt}{\ixpt}% {\@ixpt}{\mathsf{N}}_{n}(\mathbf{X}\bm{\beta},\sigma^{2}\mathbf{I}_{n})\,,$

where $\bm{y}=(\bm{y}_{1},\ldots,\bm{y}_{m})$ , with $\bm{y}_{j}=(y_{1,j},\ldots,y_{n_{j},j})$ , and $\mathbf{X}=[\mathbf{X}_{1}^{\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{T}}% },\ldots,\mathbf{X}_{m}^{\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{T}}}]^% {\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{T}}}$ , with $\mathbf{X}_{j}=[\bm{x}_{1},\ldots,\bm{x}_{n_{j}}]^{\@setsize{\small}{11pt}{% \ixpt}{\@ixpt}{\mathsf{T}}}$ , and $\mathbf{I}_{n}$ is the identity matrix $n\times n$ .
•
Prior distribution:

$\displaystyle\bm{\beta}\sim\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{N}}(% \bm{\beta}_{0},\mathbf{\Sigma}_{0})\,,\qquad\sigma^{2}\sim\@setsize{\small}{11% pt}{\ixpt}{\@ixpt}{\mathsf{GI}}\left({\textstyle\frac{\nu_{0}}{2}},{\textstyle% \frac{\nu_{0}\sigma^{2}_{0}}{2}}\right)\,.$
•
Hyperparameters: $\bm{\beta}_{0}$ , $\mathbf{\Sigma}_{0}$ , $\nu_{0}$ , $\sigma^{2}_{0}$ .

Model 2: Multiple linear regression with random effects

•
Sampling distribution:

$\displaystyle y_{i,j}\mid\bm{\beta},\theta_{j},\sigma^{2}\mathrel{\overset{% \makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}IND}}}{\sim}}\@setsize% {\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{N}}(\bm{x}_{i,j}^{\@setsize{\small}{11pt% }{\ixpt}{\@ixpt}{\mathsf{T}}}\bm{\beta}+\theta_{j},\sigma^{2})\,,\qquad i=1,% \ldots,n_{j}\,,\qquad j=1,\ldots,m\,,$

where $\theta_{j}$ is the random effect associated with the response variable in group $j$ . The random effects $\theta_{1},\ldots,\theta_{m}$ represent latent (unobserved) group-specific characteristics associated with the response variable. Basically, $\theta_{j}$ quantifies the mean specific effect of what we would have observed if the test in department $j$ would have been carried out, not just on the students in the sample, but on all the students similar to those in the sample from that particular region. Because the $\theta_{j}\,$ s are trying to measure the same thing (the mean specific score), our uncertainty about them before we saw the data was exchangeable, meaning that it is reasonable to model them as conditionally IID from a single distribution, which is Normal in our model. This assumption, does not arise from context, but is instead conventional (it is analytically and computationally convenient, and also, it is easily generalizable to non-IID scenarios, either temporal or spatial). The sampling distribution is equivalent to

$\displaystyle\bm{y}\mid\bm{\beta},\bm{\theta},\sigma^{2}\sim\@setsize{\small}{% 11pt}{\ixpt}{\@ixpt}{\mathsf{N}}_{n}(\mathbf{X}\bm{\beta}+\bm{\vartheta},% \sigma^{2}\mathbf{I}_{n})\,,$

where $\bm{\theta}=(\theta_{1},\ldots,\theta_{m})$ and $\bm{\vartheta}=(\theta_{1}\bm{1}_{n_{1}},\ldots,\theta_{m}\bm{1}_{n_{m}})$ , with $\bm{1}_{n_{j}}$ the vector of ones of size $n_{j}$ .
•
Prior distribution:

$\displaystyle\theta_{j}\mid\tau^{2}\mathrel{\overset{\makebox[0.0pt]{\mbox{% \@setsize{\tiny}{6pt}{\vpt}{\@vpt}IID}}}{\sim}}\@setsize{\small}{11pt}{\ixpt}{% \@ixpt}{\mathsf{N}}(0,\tau^{2})\,,\qquad\tau^{2}\sim\@setsize{\small}{11pt}{% \ixpt}{\@ixpt}{\mathsf{GI}}\left({\textstyle\frac{\eta_{0}}{2}},{\textstyle% \frac{\eta_{0}\tau^{2}_{0}}{2}}\right)\,,\qquad\bm{\beta}\sim\@setsize{\small}% {11pt}{\ixpt}{\@ixpt}{\mathsf{N}}(\bm{\beta}_{0},\mathbf{\Sigma}_{0})\,,\qquad% \sigma^{2}\sim\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{GI}}\left({% \textstyle\frac{\nu_{0}}{2}},{\textstyle\frac{\nu_{0}\sigma^{2}_{0}}{2}}\right% )\,.$
•
Hyperparameters: $\eta_{0}$ , $\tau^{2}_{0}$ , $\bm{\beta}_{0}$ , $\mathbf{\Sigma}_{0}$ , $\nu_{0}$ , $\sigma^{2}_{0}$ .

Model 3: Multilevel multiple linear regression with random effects

•
Sampling distribution:

$\displaystyle y_{i,j}\mid\bm{\beta}_{j},\theta_{j},\sigma_{j}^{2}\mathrel{% \overset{\makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}IND}}}{\sim}}% \@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{N}}(\bm{x}_{i,j}^{\@setsize{% \small}{11pt}{\ixpt}{\@ixpt}{\mathsf{T}}}\bm{\beta}_{j}+\theta_{j},\sigma_{j}^% {2})\,,\qquad i=1,\ldots,n_{j}\,,\qquad j=1,\ldots,m\,,$

where $\bm{\beta}_{j}=(\beta_{1,j},\ldots,\beta_{p,j})$ and $\sigma^{2}_{j}$ are the group-specific vector of regression coefficients and the group-specific variance associated with the response variable in group $j$ , respectively. Unlike previous models, the incorporation of these quantities enables departments to exhibit varying slopes (beyond the intercept) and variances. This flexibility permits explanatory variables to exert unique effects within each group, particularly under specific conditions of heterogeneity. This implies that the relationship between the mathematics score and the covariates may vary across different groups. Since we lack prior information distinguishing groups, we can regard our uncertainty our about the group-specific regression parameters as exchangeable. This allows us, once again, to model them as conditionally IID from a single distribution, which we choose as Normal for the same reasons mentioned earlier. The sampling distribution is equivalent to

$\displaystyle\bm{y}_{j}\mid\bm{\beta}_{j},\bm{\theta},\sigma_{j}^{2}\mathrel{% \overset{\makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}IND}}}{\sim}}% \@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{N}}_{n_{j}}(\mathbf{X}_{j}\bm{% \beta}_{j}+\theta_{j}\bm{1}_{n_{j}},\sigma_{j}^{2}\mathbf{I}_{n_{j}})\,,\qquad j% =1,\ldots,m\,.$
•
Prior distribution:

$\displaystyle\begin{array}[]{rlrlrl}\theta_{j}\mid\tau^{2}&\mathrel{\overset{% \makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}IID}}}{\sim}}\@setsize% {\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{N}}(0,\tau^{2})\,,&\bm{\beta}_{j}\mid\bm% {\beta},\mathbf{\Sigma}&\mathrel{\overset{\makebox[0.0pt]{\mbox{\@setsize{% \tiny}{6pt}{\vpt}{\@vpt}IID}}}{\sim}}\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{% \mathsf{N}}_{n}(\bm{\beta},\mathbf{\Sigma})\,,&\sigma_{j}^{2}\mid\nu,\sigma^{2% }&\mathrel{\overset{\makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{\@vpt}% IID}}}{\sim}}\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{GI}}\left({% \textstyle\frac{\nu}{2}},{\textstyle\frac{\nu\sigma^{2}}{2}}\right)\,,\\ \tau^{2}&\sim\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{GI}}\left({% \textstyle\frac{\eta_{0}}{2}},{\textstyle\frac{\eta_{0}\tau^{2}_{0}}{2}}\right% )\,,&\bm{\beta}&\sim\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{N}}_{n}(\bm% {\mu}_{0},\mathbf{\Lambda}_{0})\,,&\nu&\sim e^{-\kappa_{0}\nu}\,,\\ &&\mathbf{\Sigma}&\sim\textsf{WI}(n_{0},\mathbf{S}^{-1}_{0})\,,&\sigma^{2}&% \sim\textsf{G}(\alpha_{0},\beta_{0})\,.\\ \end{array}$
•
Hyperparameters: $\eta_{0}$ , $\tau^{2}_{0}$ , $\bm{\mu}_{0}$ , $\mathbf{\Lambda}_{0}$ , $n_{0}$ , $\mathbf{S}_{0}$ , $\kappa_{0}$ , $\alpha_{0}$ , $\beta_{0}$ .

Figure 3.
DAGs for the multiple regression models.

We fit the models using a Gibbs sampler (see Section 4.3 for more details) with $55000$ iterations. The first $5000$ iterations of the algorithm constitute the burn-in period, so they are not considered to carry out the posterior computations. Details of the Gibbs sampler for each model are given in Appendix C Multiple linear regression. Furthermore, we implement the models using the following hyperparameters based on a (empirical) unit information prior distribution (Kass & Wasserman, 1996) as follows:

•
Model 1: $\bm{\beta}_{0}=\hat{\bm{\beta}}_{\textsf{ols}}$ , $\mathbf{\Sigma}_{0}=n\,\hat{\sigma}_{\textsf{ols}}^{2}(\mathbf{X}^{\@setsize{% \small}{11pt}{\ixpt}{\@ixpt}{\mathsf{T}}}\mathbf{X})^{-1}$ , $\nu_{0}=1$ , $\sigma^{2}_{0}=\hat{\sigma}_{\textsf{ols}}^{2}$ , where $\hat{\bm{\beta}}_{\textsf{ols}}$ and $\hat{\sigma}_{\textsf{ols}}^{2}$ are the ordinary least squares estimators of $\bm{\beta}$ and $\sigma^{2}$ , respectively, i.e., $\hat{\bm{\beta}}_{\textsf{ols}}=(\mathbf{X}^{\@setsize{\small}{11pt}{\ixpt}{% \@ixpt}{\mathsf{T}}}\mathbf{X})^{-1}\mathbf{X}^{\@setsize{\small}{11pt}{\ixpt}% {\@ixpt}{\mathsf{T}}}\bm{y}$ and $\hat{\sigma}_{\textsf{ols}}^{2}={\textstyle\frac{1}{n-p}}(\bm{y}-\mathbf{X}% \hat{\bm{\beta}}_{\textsf{ols}})^{\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{% \mathsf{T}}}(\bm{y}-\mathbf{X}\hat{\bm{\beta}}_{\textsf{ols}})\,.$
•
Model 2: $\bm{\beta}_{0}=\hat{\bm{\beta}}_{\textsf{ols}}$ , $\mathbf{\Sigma}_{0}=n\,\hat{\sigma}_{\textsf{ols}}^{2}(\mathbf{X}^{\@setsize{% \small}{11pt}{\ixpt}{\@ixpt}{\mathsf{T}}}\mathbf{X})^{-1}$ , $\nu_{0}=\eta_{0}=1$ , $\sigma^{2}_{0}=\tau^{2}_{0}=\hat{\sigma}_{\textsf{ols}}^{2}$ .
•
Model 3: $\bm{\mu}_{0}=\hat{\bm{\beta}}_{\textsf{ols}}$ , $\mathbf{\Lambda}_{0}=\mathbf{S}_{0}=n\,\hat{\sigma}_{\textsf{ols}}^{2}(\mathbf% {X}^{\@setsize{\small}{11pt}{\ixpt}{\@ixpt}{\mathsf{T}}}\mathbf{X})^{-1}$ , $n_{0}=5$ , $\eta_{0}=\kappa_{0}=\alpha_{0}=1$ , $\tau^{2}_{0}=\beta_{0}=\hat{\sigma}_{\textsf{ols}}^{2}$ .

The previous prior formulation contains the same amount of information (represented in $\bm{\beta}_{\textsf{ols}}$ and $\hat{\sigma}_{\textsf{ols}}$ ) as that would be contained in a single observation (specified by letting $\nu_{0}=\eta_{0}=\kappa_{0}=\alpha_{0}=1$ ), and also, it is calibrated (tested) using those values provided by the exam design. An exhaustive convergence analysis (we do not present it here) indicates no signs of lack of convergence in any case.

Table 4 shows the estimate and 95% credible intervals based on percentiles for the components of $\bm{\beta}$ and $\sigma$ for each model ( $\bm{\beta}$ and $\sigma^{2}$ are part of the first hierarchy in Models 1 and 2, while the second in Model 3; see Fig. 3). On the one hand, the estimates of $\beta_{1}$ agree with the design of the test. The biggest difference is only $(51.28-50)/50=2.56\%$ regarding the test design (50 points), which in practical terms does not correspond to a substantial difference. On the other hand, the estimates of $\sigma$ indicate that the variability of the average scores turns out to be significantly higher than the test design (10 points). The smallest difference is $(11.36-10)/10=13.6\%$ , and the limits of all the credible intervals are greater than 10, which indicates a significantly higher heterogeneity in math scores than initially anticipated. Finally, the estimates associated with $\beta_{2}$ and $\beta_{3}$ indicate that there is a significant effect of gender and employment status on the average math score since the limits of the corresponding credible intervals are greater than 0 (except for that of $\beta_{2}$ in Model 3). Specifically, being a man working 0 hours a week are characteristics corresponding to a significant increase in the average math score of 3.13 and 8.99 points according to Model 1, 3.05 and 8.17 points according to Model 2, and 2.32 and 6.97 points according to Model 3, respectively.

Table 4
Posterior mean and lower (2.5%) and upper (97.5%) limits of a credible interval based on 95% percentiles for the $\bm{\beta}$ and $\sigma$ components in each model

Parameter Model 1 Model 2 Model 3

Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5%

$\beta_{1}$ 51.28 50.83 51.74 48.79 45.51 52.02 49.35 44.93 53.7

$\beta_{2}$ 3.13 2.71 3.56 3.05 2.64 3.47 2.32 $-$ 0.38 5.02

$\beta_{3}$ 8.99 8.53 9.45 8.17 7.72 8.65 6.97 3.63 10.25

$\sigma$ 12.76 12.61 12.91 12.43 12.29 12.58 11.36 10.07 12.66

The DIC evaluates the model’s predictive quality penalizing for the effective number of parameters. The results (Model 1: 111139.6; Model 2: 110435.8; Model 3: 109982.9) show that Model 3 has the best predictive capabilities according to the DIC. Unlike Models 1 and 2, Model 3 is a multilevel model with regression coefficients and specific variance components, which allows internal characterization of each department’s dynamics and direct department comparisons. For this reason, we use Model 3 to analyze behavior and differences between departments.

Figure 4.
Posterior mean and credible intervals based on percentiles using 95% (thick lines) and 99% (thin line) confidence, for each regression coefficient $\bm{\beta}_{k,j}$ , with $k\in\{1,2,3\}$ and $j\in\{1,\cdots,25\}$ . Intervals in blue do not contain the reference value (50 for $\beta_{1}$ and 0 for $\beta_{2}$ and $\beta_{3}$ ).

Figure 4 shows the posterior means and credible intervals based on percentiles using 95% and 99% confidence, for each regression coefficient $\bm{\beta}_{k,j}$ , with $k\in\{2,3\}$ and $j\in\{1,\cdots,25\}$ . These plots allow us to identify trends and significant differences from the reference values (50 for $\beta_{1}$ and 0 for $\beta_{2}$ and $\beta_{3}$ ) across departments. Intervals in blue do not contain the reference value, indicating significant differences concerning the corresponding reference value. Panel (a) of Fig. 4 indicates that all the departments behave very similarly concerning the intercept, given that all the posterior means are close to 50 and all the credible intervals contain this value. This phenomenon confirms the suitability of the test design in terms of central tendency. On the other hand, panel (b) of Fig. 4 indicates significant differences regarding sex with respect to the reference value in eight departments. This empirical evidence is unfortunate in terms of equity because the sex of the individual is not expected to have a significant association on the individual’s performance on the test. Indeed, this is the case in Antioquia, Bogotá, Caldas, Cauca, Cundinamarca, Meta, Nariño, and Valle del Cauca, where there exists a significant increase in the math score in favor of men. Finally, panel (c) of Fig. 4, once again indicates significant differences regarding employment status respect to the reference value, but this time in 12 departments.

Interestingly, employment status is significant in those departments where sex is also significant (except in Meta), in all cases, in favor of those individuals who did not work the week before taking the test. Other departments that turned out to have a significant association concerning employment status are Magdalena, Santander, and Risaralda. We observe that the most developed regions of the country, such as Bogotá, Antioquia (Medellín), and Valle del Cauca (Cali), where people commonly migrate to get job opportunities, present a greater inequality in terms of labor condition. Finally, we evidence an estimated effect greater than 5 points in some cases and up to 10 points in others, on math scores, for those who did not work in the previous week to perform the test. In particular, in Antioquia, Magdalena, and Santander, not working increases the math score considerably.
6. Discussion

Parameter	Model 1	Model 2	Model 3
	Mean	2.5%	97.5%	Mean	2.5%	97.5%	Mean	2.5%	97.5%
$\beta_{1}$	51.28	50.83	51.74	48.79	45.51	52.02	49.35	44.93	53.7
$\beta_{2}$	3.13	2.71	3.56	3.05	2.64	3.47	2.32	$-$ 0.38	5.02
$\beta_{3}$	8.99	8.53	9.45	8.17	7.72	8.65	6.97	3.63	10.25
$\sigma$	12.76	12.61	12.91	12.43	12.29	12.58	11.36	10.07	12.66

Our findings reveal that implementing the Multinomial-Dirichlet Model works well in scenarios requiring estimating proportions of interest from surveys. Specifically, in the context of political polls, we show that the majority (73.3%) of the credible intervals include the observed observed vales after Election Day. On the other hand, implementing the Poisson regression model exemplifies the use of Monte Carlo simulation in scenarios where the researcher has small sample sizes to assess the relationship between variables. In particular, the Bayesian model reasonably fits the data set in population dynamics. Likewise, the operationalization of the linear regression model from a Bayesian point of view allows us to illustrate the usefulness of hierarchical modeling to characterize population groups. Specifically, in the context of the performance of standardized tests, the model makes it possible to identify regions of Colombia with outstanding scores in mathematics, aside to quantify the association that covariates such as gender and employment status have on the me math score by geographic area.

As part of the revision process, one of the referees suggested that it would be quite beneficial to demonstrate how the results from applying a Bayesian analysis to the same data would be different from their frequentist counterpart. For example, the referee argues that the first case study could provide a good example to show the strengths and flexibility of Bayesian methods for integrating disparate sources of information if data available from several pre-election polls are integrated in a single analytic framework. Although, this is a fascinating idea, we do not follow this path here because our main purpose with this example is to illustrate a simple conjugate analysis together with the Monte Carlo principle. However, we sincerely encourage readers to pursue the referee’s proposal by formulating a multi-stage hierarchical model as in Section 5.3.

In addition, from the results in the applied contexts, we discuss and provide the technical details about conjugate modeling, hierarchical modeling, Monte Carlo simulation, Gibbs sampler, the Metropolis-Hastings algorithm, the Monte Carlo Hamiltonian algorithm, the evaluation of the model’s goodness of fit through test statistics, and the use of information criteria for model comparison.

On the other hand, the reader must be aware of the free-use specialized software alternatives currently available for doing Bayesian computing. However, we do not discuss them in this document for space reasons. These include Bugs (Bayesian inference Using Gibbs Sampling), Jags (Just Another Gibbs Sampler), Stan and Nimble (e.g., Kruschke 2014, and McElreath 2020), which are available in both R and Python. Finally, we encourage readers to inquire about other important topics typical of the Bayesian paradigm. These include exchangeability and De Finetti’s representation theorem, improper priors, objective priors, Bayes factors, model averaging, approximations of the posterior distribution through analytic methods (e.g., variational inference), and Bayesian non-parametric statistics. All of these topics can be found at Gelman et al. (2014), Reich and Ghosh (2019), and Heard et al. (2021).

Finally, as stated by one of the referees, it would be quite useful to demonstrate how the Bayesian approach offers certain advantages over its frequentist counterpart in relation to specific applications. In words of the referee, it would be quite beneficial for the readers to see how frequentist inferences using non-experimental data from social sciences can lead to misleading results; for example, resulting in confidence intervals that are too narrow because they neglect to account for all relevant sources of uncertainty. In this regard, we explicitly acknowledge that comparing frequentist and Bayesian methods is core to our research and would be pursed elsewhere.

Footnotes

Appendix

A Multinomial-Dirichlet model

Let $k$ be independent random variables $X_{1},\ldots,X_{k}$ such that $X_{j}\mid\alpha_{j},\beta\mathrel{\overset{\makebox[0.0pt]{\mbox{\@setsize{% \tiny}{6pt}{\vpt}{\@vpt}IND}}}{\sim}}\textsf{Gamma}(\alpha_{j},\beta)$ , for $j=1,\ldots,k$ . It can be shown that the random vector

$\displaystyle\bm{Y}=(Y_{1},\ldots,Y_{k})=\left(\frac{X_{1}}{X_{1}+\ldots+X_{k}% },\ldots,\frac{X_{k}}{X_{1}+\ldots+X_{k}}\right)$

has a Dirichlet distribution with parameter $\bm{\alpha}=(\alpha_{1},\ldots,\alpha_{k})$ , i.e., $\bm{Y}\mid\bm{\alpha}\sim\textsf{Dirichlet}(\bm{\alpha})$ . This result leads to the following algorithm to generate random vectors $\bm{\theta}=(\theta_{1},\dots,\theta_{k})$ with Dirichlet distribution with parameter $\bm{\alpha}$ :

Choose any value for $\beta>0$ (e.g., $\beta=1$ ).

Simulate $g_{1},\ldots,g_{k}$ such that $g_{j}\mathrel{\overset{\makebox[0.0pt]{\mbox{\@setsize{\tiny}{6pt}{\vpt}{\@vpt% }IND}}}{\sim}}\textsf{Gamma}(\alpha_{j},\beta)$ , for $j=1,\ldots,k$ .

Compute $\theta_{j}=g_{j}/\sum_{\ell=1}^{k}g_{\ell}$ , for $j=1,\ldots,k$ .

B Poisson regression

Let $\bm{\beta}^{(b)}$ be the state of parameter $\bm{\beta}$ at iteration $b$ of the algorithm, for $b=1,\ldots,B$ . Given an initial value $\bm{\beta}^{(0)}$ , the following algorithms generate a new state $\bm{\beta}^{(b)}$ from the preceding state $\bm{\beta}^{(b-1)}$ .

C Multiple linear regression

Let $\mathbf{\Theta}^{(b)}$ be the state of parameter $\mathbf{\Theta}$ at iteration $b$ of the algorithm, for $b=1,\ldots,B$ . Given an initial value $\mathbf{\Theta}^{(0)}$ , the following algorithms generate a new state $\mathbf{\Theta}^{(b)}$ from the preceding state $\mathbf{\Theta}^{(b-1)}$ , by iteratively sampling the elements of $\mathbf{\Theta}$ from the corresponding complete conditional distributions. These distributions are obtained directly from the posterior distribution of $\mathbf{\Theta}$ , taking into account only the expressions that involve the component of $\mathbf{\Theta}$ we are interested in since the other terms can be regarded as constant.

D Notation

The Gamma function is denoted by $\Gamma(\cdot)$ and is given by $\Gamma(x)=\int_{0}^{\infty}u^{x-1}\,e^{-u}\,\text{ d}u$ . Matrices and vectors with entries consisting of subscripted variables are denoted by the variable letter in bold. For example, $\bm{x}=(x_{1},\ldots,x_{n})$ denotes a column vector of $n\times 1$ with entries $x_{1},\ldots,x_{n}$ . We use $\bm{0}$ and $\bm{1}$ to denote the column vector whose entries are equal to 0 and 1, respectively, and we also use $\mathbf{I}$ to denote the identity matrix. A subscript in this context indicates the corresponding dimension. For example, $\mathbf{I}_{n}$ denotes the identity matrix of size $n\times n$ . The transpose of a vector $\bm{x}$ is denoted by $\bm{x}^{\textsf{T}}$ . Similarly for matrices. Also, if $\mathbf{X}$ is a square matrix, we use $\text{tr}(\mathbf{X})$ and $|\mathbf{X}|$ to denote the trace and determinant of $\mathbf{X}$ , respectively.

Below we present the probabilistic distributions used in the applications:

•

Gamma:

A random variable $X$ has a Gamma distribution with parameters $\alpha$ and $\beta$ , denoted by $X|\alpha,\beta\sim\textsf{G}(\alpha,\beta)$ , if the probability density function is

$\displaystyle p(x|\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}\,x^{% \alpha-1}\,\exp{\{-\beta x\}}\,,\quad x>0\,,\quad\alpha>0\,,\quad\beta>0\,.$

•

Inverse Gamma:

A random variable $X$ has an inverse gamma distribution with parameters $\alpha$ and $\beta$ , denoted by $X\mid\alpha,\beta\sim\textsf{GI}(\alpha,\beta)$ , if the probability density function is

$\displaystyle p(x\mid\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}\,x^{-% (\alpha+1)}\,\exp{\{-\beta/x\}}\,,\quad x>0\,,\quad\alpha>0\,,\quad\beta>0\,.$

•

Normal:

A random variable $X$ has a Normal distribution with parameters $\mu$ and $\sigma^{2}$ , denoted by $X\mid\mu,\sigma^{2}\sim\textsf{N}(\mu,\ sig^{2})$ , if the probability density function is

$\displaystyle p(x\mid\mu,\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\,\exp{% \left\{-\frac{1}{2}\,\frac{(x-\mu)^{2}}{\sigma^{2}}\right\}}\,,\quad x\in% \mathbb{R}\,,\quad\mu\in\mathbb{R}\,,\quad\sigma^{2}>0\,.$

•

Dirichlet:

A random vector $\bm{X}=(X_{1},\ldots,X_{K})$ has a Dirichlet distribution with parameter $\bm{\alpha}$ , denoted by $\bm{X}\mid\bm{\alpha}\sim\textsf{Dir}(\bm{\alpha})$ , if the probability density function is

$\displaystyle p(x\mid\bm{\alpha})=\left\{\begin{array}[]{ll}\frac{\Gamma\left(% \sum_{k=1}^{K}\alpha_{k}\right)}{\prod_{k=1}^{K}\Gamma(\alpha_{k})}\prod_{k=1}% ^{K}x_{k}^{\alpha_{k}-1},&\hbox{if $\sum_{k=1}^{K}x_{k}=1$, $\alpha_{1},\ldots% ,\alpha_{K}>0$;}\\ 0,&\text{otherwise.}\end{array}\right.$

•

Multivariate Normal:

A $d\times 1$ random vector $\bm{X}=(X_{1}\ldots,X_{d})$ has a Multivariate Normal distribution with parameters $\bm{\mu}$ and $\mathbf{\Sigma}$ , denoted by $\bm{X}\mid\bm{\mu},\mathbf{\Sigma}\sim\textsf{N}_{d}(\bm{\mu},\mathbf{\Sigma})$ , if the probability density function is

$\displaystyle p(\bm{x}\mid\bm{\mu},\mathbf{\Sigma})=(2\pi)^{-d/2}\,|\mathbf{% \Sigma}|^{-1/2}\,\exp{\left\{-{\textstyle\frac{1}{2}}(\bm{x}-\bm{\mu})^{% \textsf{T}}\mathbf{\Sigma}^{-1}(\bm{x}-\bm{\mu})\right\}}\,,\quad\bm{x}\in% \mathbb{R}^{d}\,,\quad\bm{\mu}\in\mathbb{R}^{d}\,,\quad\mathbf{\Sigma}>0\,.$

•

Inverse Wishart:

A $d\times d$ random matrix $\mathbf{W}$ has an Inverse Wishart distribution with parameters $\nu$ and $\mathbf{S}^{-1}$ , denoted by $\mathbf{W}\sim\textsf{WI}(\nu,\mathbf{S}^{-1})$ , if the probability density function is

$\displaystyle p(\mathbf{W})\propto|\mathbf{W}|^{-(\nu+d+1)/2}\,\exp{\left\{-{% \textstyle\frac{1}{2}}\text{tr}(\mathbf{S}\mathbf{W}^{-1})\right\}}\,,\quad% \mathbf{W}>0\,,\quad\nu>0\,,\quad\mathbf{S}>0.$

References

Anis

Krause

J.A.

, & Blum

E.N.

(2016). The relations among mathematics anxiety, gender, and standardized test performance. Research in the Schools, 23(2).

Arcese

Smith

J.N.

Hochachka

W.M.

Rogers

C.M.

, & Ludwig

. (1992). Stability, regulation, and the determination of abundance in an insular song sparrow population. Ecology, 73(3): 805-822.

Banerjee

Carlin

B.P.

, & Gelfand

A.E.

(2014). Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall/CRC.

Barberá

(2015). Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Political analysis, 23(1): 76-91.

Berger

J.O.

(2013). Statistical decision theory and Bayesian analysis. Chapter 3. Prior Information and Subjective Probability. Springer Science and Business Media.

Bernardo

J.M.

, & Smith

A.F.

(2000). Bayesian theory. John Wiley and Sons.

Betancourt

(2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434.

Betancourt

(2019). The convergence of Markov Chain Monte Carlo methods: from the Metropolis method to Hamiltonian Monte Carlo. Annalen der Physik, 531(3): 1700214.

Carlin

B.P.

, & Louis

T.A.

(2008). Bayesian methods for data analysis. CRC Press.

10.

Congdon

(2007). Bayesian statistical modelling, 704. John Wiley and Sons.

11.

Cox

R.T.

(1946). Probability, frequency and reasonable expectation. American Journal of Physics, 14(1): 1-13.

12.

Cox

R.T.

(1963). The algebra of probable inference. American Journal of Physics, 31(1): 66-67.

13.

Draper

(2009). Bayesian statistics. Encyclopedia of complexity and system science, 455-475.

14.

Fairfield

, & Charman

(2019). A dialogue with the data: The bayesian foundations of iterative research in qualitative social science. Perspectives on Politics, 17(1): 154-167.

15.

Fairfield

, & Charman

A.E.

(2022). Social Inquiry and Bayesian Inference. CAMBRIDGE University Press.

16.

Gamerman

& Lopes

H.F.

(2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. CRC press.

17.

Garthwaite

P.H.

Kadane

J.B.

, & O’Hagan

(2005). Statistical methods for eliciting probability distributions. Journal of the American Statistical Association, 100(470): 680-701.

18.

Gelman

(2009). Bayes, jeffreys, prior distributions and the philosophy of statistics. Statistical Science, 24(2): 176-178.

19.

Gelman

Carlin

J.B.

Stern

H.S.

Dunson

D.B.

Vehtari

, & Rubin

D.B.

(2014). Bayesian Data Analysis. CRC press.

20.

Gill

, & Walker

L.D.

(2005). Elicited priors for bayesian model specifications in political science research. The Journal of Politics, 67(3): 841-872.

21.

Heard

, et al. (2021). An Introduction to Bayesian Inference, Methods and Computation. Springer.

22.

Hoff

P.D.

(2009). A first course in Bayesian statistical methods, 580. Springer.

23.

Jackman

(2004). Bayesian analysis for political research. Annu Rev Polit Sci, 7: 483-505.

24.

Jackman

(2009). Bayesian analysis for the social sciences, 846. John Wiley and Sons.

25.

Kass

R.E.

& Wasserman

(1996). The selection of prior distributions by formal rules. Journal of the American statistical Association, 91(435): 1343-1370.

26.

Kruschke

(2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press.

27.

Kruschke

J.K.

(2021). Bayesian analysis reporting guidelines. Nature human behaviour, 5(10): 1282-1291.

28.

Lenhard

(2022). A transformation of bayesian statistics: Computation, prediction, and rationality. Studies in History and Philosophy of Science, 92: 144-151.

29.

Lynch

S.M.

, & Bartlett

(2019). Bayesian statistics in sociology: Past, present, and future. Annual Review of Sociology, 45: 47-68.

30.

McCullagh

(2018). Generalized Linear Models. Routledge.

31.

McElreath

(2020). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman and Hall/CRC.

32.

Meyn

S.P.

, & Tweedie

R.L.

(2012). Markov Chains and Stochastic Stability. Springer Science and Business Media.

33.

Moser

Rodríguez

, & Lofland

C.L.

(2021). Multiple ideal points: Revealed preferences in different domains. Political Analysis, 29(2): 139-166.

34.

Reich

B.J.

, & Ghosh

S.K.

(2019). Bayesian Statistical Methods. Chapman and Hall/CRC.

35.

Robert

, & Casella

(2013). Monte Carlo Statistical Methods. Springer Science and Business Media.

36.

Savage

L.J.

(1972). The Foundations of Statistics. Courier Corporation.

37.

Sosa

, & Buitrago

(2022). Illustrating advantages and challenges of Bayesian statistical modelling: An empirical perspective. Model Assisted Statistics and Applications, 17(3): 1-13.

38.

Spiegelhalter

D.J.

Best

N.G.

Carlin

B.P.

, & Van Der Linde

(2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4): 583-639.

39.

Spiegelhalter

D.J.

Best

N.G.

Carlin

B.P.

, & Van der Linde

(2014). The deviance information criterion: 12 years on. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 485-493.

40.

Stuart

Arnold

Ord

J.K.

O’Hagan

, & Forster

(1994). Kendall’s Advanced Theory of Statistics. Wiley.

41.

Turkman

M.A.A.

Paulino

C.D.

, & Müller

(2019). Computational Bayesian Statistics: An Introduction, 11. Cambridge University Press.

42.

Van de Schoot

Depaoli

King

Kramer

Märtens

Tadesse

M.G.

Vannucci

Gelman

Veen

Willemsen

, et al. (2021). Bayesian statistics and modelling. Nature Reviews Methods Primers, 1(1): 1-26.

43.

Van de Schoot

Kaplan

Denissen

Asendorpf

J.B.

Neyer

F.J.

, & Van Aken

M.A.

(2014). A gentle introduction to bayesian analysis: Applications to developmental research. Child Development, 85(3): 842-860.

44.

Walker

L.J.

Gustafson

, & Frimer

J.A.

(2007). The application of bayesian analysis to issues in developmental research. International Journal of Behavioral Development, 31(4): 366-373.

45.

Watanabe

(2013). WAIC and WBIC are information criteria for singular statistical model evaluation. In Proceedings of the Workshop on Information Theoretic Methods in Science and Engineering, 90-94.

46.

Western

, & Jackman

(1994). Bayesian inference for comparative research. American Political Science Review, 412-423.

47.

Živković

Pellizzoni

Doz

Cuder

Mammarella

, & Passolunghi

M.C.

(2023). Math self-efficacy or anxiety? the role of emotional and motivational contribution in math performance. Social Psychology of Education, 1-23.

Bayesian analysis for social science research

Abstract

Keywords

1. Introduction

2. Statistical inference: Frequentist versus bayesian

3. Bayesian inference in social science

4. Fundamentals for bayesian modeling

4.1 Prior specification

4.2 Conjugate distributions

4.3 Bayesian computation via monte carlo simulation

4.3.1 Monte Carlo principle

4.3.2 Gibbs sampler

4.3.3 Metropolis-hastings

4.5 Model comparison

5. Cases studies

5.1 Political survey: A Multinomial-Dirichlet model

Table 1 Invamer’s survey results about party consultations in Colombia 2022

Footnotes

Appendix

A Multinomial-Dirichlet model

B Poisson regression

C Multiple linear regression

D Notation

References

Table 1
Invamer’s survey results about party consultations in Colombia 2022