Abstract
In this manuscript, we discuss the substantial importance of Bayesian reasoning in Social Science research. Particularly, we focus on foundational elements to fit models under the Bayesian paradigm. We aim to offer a frame of reference for a broad audience, not necessarily with specialized knowledge in Bayesian statistics, yet having interest in incorporating this kind of methods in studying social phenomena. We illustrate Bayesian methods through case studies regarding political surveys, population dynamics, and standardized educational testing. Specifically, we provide technical details on specific topics such as conjugate and non-conjugate modeling, hierarchical modeling, Bayesian computation, goodness of fit, and model testing.
Introduction
Bayesian methods refer to data analysis tools derived from the principles of Bayesian inference, i.e., inductive learning through Bayes’ Theorem (Van de Schoot et al., 2021). These methods allow us to estimate parameters with good statistical properties, predicting/imputing future/missing observations, and making optimal decisions according to prespecified utility functions. Moreover, Bayesian techniques are also intrinsically linked to sophisticated computational algorithms for model estimation, selection, and validation (Hoff, 2009). A wide variety of scientific publications in the context of Social Sciences point out that Bayesian methods are a predominant methodology for the analysis of different phenomena since the early 1990s (e.g., Western & Jackman, 1994; Jackman, 2004; Gill & Walker, 2005; Barberá, 2015; Moser et al., 2021; Lynch & Bartlett, 2019; Fairfield & Charman, 2022).
Several reasons justify adopting a Bayesian approach to statistical inference in Social Sciences. Quantitative research shows that the social phenomenon is quite different from its counterpart in the experimental sciences, so its characteristics and methodological requirements are more akin to the Bayesian paradigm, far from the assumptions of the frequentist approach (e.g., Jackman, 2009; Fairfield & Charman, 2019). That is why we discuss advantages (and challenges!) about adopting a Bayesian spirit in Social Science research.
We provide foundations about model fitting under the Bayesian paradigm, from the prior distribution and sampling distribution specification to posterior inference mechanisms, including model verification and evaluation. Additionally, we illustrate the essentials of Bayesian methodologies through case studies in contexts such as political surveys, population dynamics, and standardized educational testing.
Unlike other authors (Lenhard, 2022; Sosa & Buitrago, 2022; Van de Schoot et al., 2021; Kruschke, 2021; Van de Schoot et al., 2014; Draper, 2009; Walker et al., 2007; Jackman, 2004), we provide a review of the Bayesian paradigm focused exclusively on Bayesian machinery framed in Social Science research. We aim to offer a frame of reference for a broad audience, not necessarily with specialized knowledge in Bayesian statistics, yet having interest in incorporating this kind of methods in studying social phenomena.
This paper is structured as follows. Section 2 draws a parallel between Bayesian and frequentist inference. Section 3 presents the rationale for a Bayesian approach to scientific research in Social Science. Section 4 shows the importance of sensitivity analysis, the suitability of conjugate analysis, model evaluation and testing, and Bayesian computation via Monte Carlo simulation. Section 5 provides fully develops case studies from a Bayesian perspective. Finally, Section 6 discusses our main findings as well as some alternatives for future research.
Statistical inference: Frequentist versus bayesian
Two paradigms to data analysis coexist in Statistics: frequentist and Bayesian. They differ in many ways, including the probability interpretation, the parameters’ nature, and the statistical/computational methods required to make inferences. However, both alternatives are governed by axiomatic foundations of probability and use the likelihood function to estimate unknown parameters.
Under any of these approaches, the relationship between data
Bayes’ Theorem states that
where
The previous expression makes evident two important aspects: (i) the posterior distribution is simply proportional to the likelihood function times the prior distribution, and (ii) some frequentist results can be seen as a particular cases of Bayesian analysis. Regarding the first aspect, the influence of prior beliefs and data on the posterior distribution depends on the amount of information provided in the prior distribution and the sample size, respectively. Regarding the second aspect, frequentist and Bayesian are typically equivalent when the prior distribution is non-informative (all the possible parameter values have the same density) and/or the sample size is large in comparison with the dimension of the parameter space. In this case, the posterior distribution takes the same form as the likelihood function.
Generally, Bayesian analysis are simple and direct from a conceptual point of view since they mainly rely on a naive application of Bayes’ Theorem. Cox (1946), Cox (1963) and Savage (1972) constitute theoretical support to justify that if
Bayesian inference in social science
In social research, frequentist methods become unreliable when data do not correspond to a random sample from a larger population. The conventional interpretation of confidence intervals as well as hypothesis tests, which rely on patterns emerging from repeated sampling, tend to be confusing and inadequate in contexts where uncertainty do not arise from variation in repeated sampling (Western & Jackman, 1994; Jackman, 2009). Consequently, adhering to a frequentist notion of probability in the absence of repeatable data loses its meaning. In contrast, the subjective interpretation of probability provided in the Bayesian paradigm offers a coherent and internally consistent tool for statistical inference when data can not be framed in the context of repeated experimentation.
Social scientists often encounter themselves working with “small” data gathered from real-life social behavior, where classical experimental-design requirements are typically not met. In such scenarios, subjective judgments regarding the model’s formulation become inevitable and intrinsic to the scientific process (Jackman, 2009). Thus, it is natural for researchers in this context setting prior probabilities about the unknown quantities and interpret them subjectively (depending on the modeler’s state of knowledge). Furthermore, the absence of “large datasets” implies that estimates obtained through a frequentist approach lack robust statistical properties, particularly concerning the asymptotic properties that validate classical inference procedures. Comparative research studies in the social domain have demonstrated that applying frequentist methods to small data may lead to imprecise estimates of the effects of explanatory variables (e.g., Western & Jackman, 1994).
The social field abounds with data grouped over several units or periods. Hence, a key research question is how a causal structure operating at one level of analysis (e.g., individuals) varies over a higher level (e.g., localities or periods). The Bayesian approach to statistical inference is well suited to answer this question since it allows the researcher to formalize assumptions about between and within between-group heterogeneity by formulation a proper structured of prior beliefs. Thus, the prior distribution, considered by critics of Bayesian inference as a weakness, provides a way to expand from a simple model to a model involving several sources of heterogeneity, which allows modelers analyzing cases where social research requires understanding the relative weight of unknown quantities at different levels (Jackman, 2009).
Other advantages of Bayesian inference, not exclusive of Social Science, include its conceptual simplicity. As mentioned previously, Bayesian inference does not require to consider hypothetical data as frequentists do when developing confidence intervals and hypothesis tests. This is because the posterior distribution directly represents the most up-to-date information about the parameter, given nothing more than the observed data. Such a straightforward nature inherent in the Bayesian paradigm is quite appealing to social scientists, making it the primary methodology adopted by quantitative researchers in this field (Jackman, 2009).
Nowadays, Bayesian computation has become more feasible than ever before due to the current developments in both and hardware and software. Such a computational framework make possible to solve complex statistical problems that a few decades before were just not possible to handle. Specifically, the recent low cost and computational speed make it attainable for social scientists to analyze data from a Bayesian simulation-based approach. In this sense, the set of algorithms known as Markov Chain Monte Carlo (MCMC, see Section 4.3 for details) allows the Bayesian approach to be a practical reality for applied researchers (Jackman, 2009). These algorithms provide a powerful and flexible way to approximate the posterior distribution for most multiparameter models. For example, estimates of hierarchical models, latent variable models, and estimates based on observations with missing data have become straightforward procedures because the mathematics and computation underlying Bayesian analysis is drastically simplified via Monte Carlo simulation.
Fundamentals for bayesian modeling
Here, we provide essential details for formulating and fitting Bayesian models, from conjugate families to specifics in Bayesian computation based on Monte Carlo simulation. This material is key to understand and implement the case studies given in Section 5.
Prior specification
In their very core, all Bayesian models are hierarchical. The Bayesian paradigm follows a highly useful general idea: Suppose we are studying an unknown
The main criticism of the Bayesian approach relies on the inherent subjectivism of the prior distribution. For frequentists, data analysis based on subjective information states (depending on the analyst) lacks scientific rigour. However, the question is, why should we neglect available external information that is consistent with reality when it can contribute to explaining the phenomenon of interest and lead to more accurate inferences and plausible conclusions? Even when external information is not available, we can set the prior distribution to reflect such state of knowledge. In this regard, there are available a number of options to specify the prior distribution in an “objective” fashion thought the so called objective priors (see Reich & Ghosh 2019, Chap. 2 for details).
Since the state of information may vary depending on the analyst, the choice of the prior distribution and the robustness of the inferences based on this choice is a fundamental issue in Bayesian inference. Regarding the prior choice, the literature exposes some methods for eliciting prior distributions (e.g., Berger, 2013; Congdon, 2007; Garthwaite et al., 2005; Stuart et al., 1994). For instance, the Jeffreys’ prior of a parameter
From a Bayesian perspective, studying the robustness of inferences means performing a sensitivity analysis. The purpose is to examine how the posterior distribution changes as different values of the hyperparameters are adopted (a hyperparameter is a parameter of the prior distribution, which is set by the analyst; this term is useful to distinguish them from parameters of the model). This analysis allows us to argue that the conclusions are consistent from both a qualitative and quantitative point of view. There are two ways to perform a sensitivity analysis in practice, either (i) by weakening or strengthening the adopted prior or (ii) by repeating the analysis with different priors (Jackman, 2004).
Conjugate distributions
Conjugate families are essential in Bayesian statistics because they greatly simplify computation. Specifically, suppose that the prior distribution
Under conjugacy, the update from the prior to the posterior distribution merely changes the parameters that define the corresponding conjugate family. This feature is easy to interpret, and besides easing computation, this characteristic allows us to develop some intuition about Bayesian learning through straightforward examples. However, it’s important to acknowledge that conjugate priors have certain limitations. For instance, not all likelihood functions have a known conjugate prior, and most conjugacy pairs are applicable only to small-scale examples with a limited number of parameters (Jackman, 2004, 2009). Furthermore, not every state of knowledge about an unknown parameter is easy to express using a conjugate prior distribution.
Bayesian computation via monte carlo simulation
In Bayesian inference, we can formulate complex models that require simulation-based methods to explore the posterior distribution, typically based on the Monte Carlo principle together with Markov chains (e.g., Gamerman & Lopes, 2006; Robert & Casella, 2013; Turkman et al., 2019). These methods allow the generation of random samples from the posterior distribution (target distribution) when the underlying computations are either extremely demanding or analytically intractable.
The Monte Carlo principle states that any characteristic of a random variable can be approximated arbitrarily well by generating enough random samples from its probabilistic distribution. On the other hand, Markov chains are first-order stochastic processes (random sequences with serial dependence such that “what happens next depends only on the state of affairs now”) that allow us to explore the posterior distribution when it has an unknown analytic form (see Carlin & Louis, 2008, Jackman, 2009, and Meyn & Tweedie, 2012 for a formal treatment of Markov chains). To put it another way, using Markov chain Monte Carlo (MCMC) algorithms, we can generate correlated random draws from the posterior distribution in order to learn about any aspect of it.
Notice that enough IID samples constitute a direct approximation of the posterior distribution (convergence in probability), which can not be guaranteed for MCMC samples (convergence in distribution). Given the dependency among samples arising from a MCMC algorithm, there is no absolute certainty that the simulated chain has reached convergence to its intended target distribution (i.e., the posterior distribution). Therefore, evaluating convergence is key. In practice, it is common to diagnose non-convergence through graphical displays (e.g., traceplots) and numerical measures (e.g.,
Finally, aiming to increase as much as possible the effective sample size (equivalent sample size under IID sampling), it is customary to discard the initial values of the chain (burn-in period) as well as take systematic samples of it (thinning) to remove autocorrelation. It is also highly recommended to run several chains at different starting points of the parameter space to check whether they approach to the same stationary distribution or not.
Monte Carlo principle
Let
where
Gibbs sampler
When it is difficult to simulate from the posterior distribution directly, it is recommended to sample iteratively from the full conditional distribution
1. Draw
2. Draw
This algorithm generates a dependent sequence of values of
Metropolis-hastings
Again, when it is difficult or even possible to simulate from the posterior distribution directly, the Metropolis-Hastings algorithm provides a general setting to build a Markov chain through a series of “jumps” that generate a random sequence, whose target distribution is the posterior distribution
Simulate a jump candidate Compute the acceptance ratio
If the proposal distribution is symmetric, then the acceptance rate becomes
Typically, Determine the transition probability Simulate Set
Again, it can be shown the algorithm given above, regardless of the proposal distribution
A possible inefficiency of the Gibbs sampler and the Metropolis-Hastings algorithm lies in their local random walk behavior (Gelman et al., 2014), which causes the chain to take too long to explore the posterior distribution efficiently. Such a behavior leads to long-time converge times, mainly when dealing with complex models such as those related to high-dimensional posterior distributions (Betancourt, 2019). The Hamiltonian Monte Carlo algorithm is an alternative to overcome such inefficiency.
This algorithm considers a “boost” variable
Simulate Update
Update
Update
Repeat the above steps Let
Determine the transition probability Simulate Set
The tunning parameters
After establishing the structure of the model and approximating the posterior distribution
Usually, the model and data discrepancy is examined through a set of test statistics (e.g., measures of trend and variability), say
Model comparison
Information criteria allow us to evaluate and compare models through their predictive performance. Popular alternatives include the Deviance Information Criterion (DIC, see Gelman et al., 2014; Spiegelhalter et al., 2002) and the Watanabe-Akaike Criterion (WAIC, see Gelman et al., 2014; Watanabe, 2013).
The DIC is defined as
where
On the other hand, the WAIC is defined as
where
is the posterior predictive distribution in logarithmic scale, which summarizes the predictive ability of the model fitted to the data. The corresponding effective number of parameters is given by
which in practice can be calculated as
When Comparing models, lower DIC and WAIC values imply higher predictive accuracy.
Although the DIC is widely used as a model selection tool, it has several disadvantages compared to the WAIC. Common criticisms include the penalty term,
Cases studies
This section illustrates the Bayesian methodologies described in previous sections with three case studies. First, we exemplify the Monte Carlo principle using IID sampling in the context of a multinomial-Dirichlet model. Then, we illustrate the Metropolis-Hastings algorithm, the Hamiltonian Monte Carlo algorithm, and goodness-of-fit methods in the context of a generalized linear model for count data. Finally, we show the Gibbs sampler and information criteria metrics in the context of hierarchical linear regression models. The interested reader may request the code to reproduce all the examples from any of the authors.
Political survey: A Multinomial-Dirichlet model
We implement a Multinomial-Dirichlet model to analyze the 2022 Colombian Presidential Consultations. In Colombia, Presidential Consultations are basically open primary elections in which voters (general citizens) can indicate their preference for their party’s candidate in the upcoming presidential elections. For those readers unfamiliar with the Colombian political system, Colombian Presidential Consultations work almost identically to the primary elections in the United States. The Multinomial-Dirichlet model allows us to estimate the population share of votes that each candidate will receive based on the data provided by a national pollster.
Invamer’s survey results about party consultations in Colombia 2022
Invamer’s survey results about party consultations in Colombia 2022
The independent media company Valora Analitik reported that “after adding up the differences between the latest polls and the results given in Election Day, Invamer is the pollster that was closest in its predictions, followed by Guarumo and EcoAnalítica, and in third place, the CNC. The pollster furthest away from the results was Yanhaas, in fourth place” (
Although Invamer uses a particular kind of random sampling without replacement, it is customary to consider such a sample as a simple random sample with replacement, given that the total sample size is very small compared to the size of the Universe. Under the conditions given above and given that our uncertainty about the responses of the 1504 people in the survey is exchangeable, a particular version of De Finetti’s Theorem (Bernardo & Smith, 2000, p. 176) guarantees that the only sampling distribution appropriate for data of this nature is the Multinomial distribution. Below we describe the modeling approach as well as the results in the context of the political landscape in Colombia.
The population of interest consists of items categorized into
provided that
where
Using Eqs (1) and (2), a direct application of Bayes’ Theorem states that the posterior distribution of
which corresponds to the kernel of a Dirichlet distribution with parameters
where
Since the posterior distribution of
In Table 2, we compare our results (posterior mean) with the final report of the Registraduría Nacional del Estado Civil, which is the observed value in Election Day (
Observed value, posterior mean, and lower (
In this study, we examine the investigation conducted by Arcese et al. (1992) on the reproductive activities of
Given that the number of offspring is a count variable, we propose to model this variable as a function of age employing the following model:
where
A plot of the number of offspring versus age suggests that number of offspring varies with age according to a concave relationship (Hoff, 2009, p. 172). For this reason, we specify a linear predictor using a quadratic function of the form
To complete the model specification with sampling distribution (3), it is necessary to specify a prior distribution for
In this case, the posterior distribution of
with
where
We note that
In this case, we fit the model assuming a non-informative prior information, by letting
Figure 1 shows the Markov chains associated with
Effective sample sizes and Monte Carlo errors corresponding to the Markov chains associated with the posterior distribution
Markov chains associated with the posterior distribution 
Figure (a), (b), and (c) in Fig. 2 display the posterior distribution of the regression coefficients, accompanied by the respective point estimate and a 95% credible interval based on percentiles. Our findings indicate that age and age-squared effects are significant (credible intervals do not contain 0). Furthermore, the signs of the point estimates of
Finally, the model’s goodness of fit is evaluated by means of the posterior predictive distribution of a perdifined set of test statistics (see Section 4.4 for more details). In this case, the mean and variance are chosen as test statistics since they characterize essential aspects of the data (trend and dispersion) that might be overshadowed due to the mean-variance restriction of the Poisson model. Panels (e) and (f) of Fig. 2 suggest that the model fits the data well because the observed values of the data are typical values of the posterior predictive distribution of the corresponding test statistics (i.e., posterior predictive
Panels (a)–(c): posterior distribution of 
In this study, we employed three multiple linear regression models to examine the math score outcomes of the Saber 11 Test during the first semester of 2020 in Colombia. The Instituto Colombiano para la Evaluación de la Educación (ICFES) applies this standardized test periodically to measure the skills of students who finish secondary school. We aim to make inferences about the Colombian student population at the national and departmental levels about their performance in mathematics. We examined the score in mathematics because it is a variable that social researchers usually relate to other important educational factors (e.g., Anis et al., 2016; Živković et al., 2023). This dataset is publicly available (
Based on the Saber 11 exam design, the mathematics test is graded on a scale ranging from 0 to 100 (with whole numbers only), and also, it is calibrated using a 3PL model (3-parameter logistic model characterizing the probability of a correct answer based on ability, item difficulty, item discrimination, and pseudo-chance) in such a way that it has an average score of 50 points together with a standard deviation of 10 points. In our analysis, we treated the score as the response variable, while considering the student’s sex and employment status as covariates. Prior to model fitting, a pre-processing step was conducted, which involved eliminating all records with missing data. Furthermore, the variables “sex” and “employment status” were encoded (sex: 1 if male, 0 if female; employment status: 1 if worked 0 hours during the last week, 0 otherwise). The resulting dataset, formed through these adjustments, comprised a total of 14,015 records. Bayesian imputation methods are available (see for example Ch. 7 Hoff, 2009). However, since the percentage of records lost by direct deletion is very small, adding this level of additional complexity in any of the models is not necessary.
We observe that the averages oscillate between 35 and 70 approximately. In addition, we do not have any information for eight departments (including the archipelago of San Andrés, Providencia, and Santa Catalina). We also appreciate that the department with the highest average is Quindio, while the lowest is Caquetá. Finally, those departments located in the Orinoquía and Amazonía Regions of the country exhibited the lowest scores nationwide.
Let
Sampling distribution:
where
where Prior distribution:
Hyperparameters:
Sampling distribution:
where
where Prior distribution:
Hyperparameters:
Sampling distribution:
where
Prior distribution:
Hyperparameters:
DAGs for the multiple regression models.
We fit the models using a Gibbs sampler (see Section 4.3 for more details) with
Model 1: Model 2: Model 3:
The previous prior formulation contains the same amount of information (represented in
Table 4 shows the estimate and 95% credible intervals based on percentiles for the components of
Posterior mean and lower (2.5%) and upper (97.5%) limits of a credible interval based on 95% percentiles for the
The DIC evaluates the model’s predictive quality penalizing for the effective number of parameters. The results (Model 1: 111139.6; Model 2: 110435.8; Model 3: 109982.9) show that Model 3 has the best predictive capabilities according to the DIC. Unlike Models 1 and 2, Model 3 is a multilevel model with regression coefficients and specific variance components, which allows internal characterization of each department’s dynamics and direct department comparisons. For this reason, we use Model 3 to analyze behavior and differences between departments.
Posterior mean and credible intervals based on percentiles using 95% (thick lines) and 99% (thin line) confidence, for each regression coefficient 
Figure 4 shows the posterior means and credible intervals based on percentiles using 95% and 99% confidence, for each regression coefficient
Interestingly, employment status is significant in those departments where sex is also significant (except in Meta), in all cases, in favor of those individuals who did not work the week before taking the test. Other departments that turned out to have a significant association concerning employment status are Magdalena, Santander, and Risaralda. We observe that the most developed regions of the country, such as Bogotá, Antioquia (Medellín), and Valle del Cauca (Cali), where people commonly migrate to get job opportunities, present a greater inequality in terms of labor condition. Finally, we evidence an estimated effect greater than 5 points in some cases and up to 10 points in others, on math scores, for those who did not work in the previous week to perform the test. In particular, in Antioquia, Magdalena, and Santander, not working increases the math score considerably.
Our findings reveal that implementing the Multinomial-Dirichlet Model works well in scenarios requiring estimating proportions of interest from surveys. Specifically, in the context of political polls, we show that the majority (73.3%) of the credible intervals include the observed observed vales after Election Day. On the other hand, implementing the Poisson regression model exemplifies the use of Monte Carlo simulation in scenarios where the researcher has small sample sizes to assess the relationship between variables. In particular, the Bayesian model reasonably fits the data set in population dynamics. Likewise, the operationalization of the linear regression model from a Bayesian point of view allows us to illustrate the usefulness of hierarchical modeling to characterize population groups. Specifically, in the context of the performance of standardized tests, the model makes it possible to identify regions of Colombia with outstanding scores in mathematics, aside to quantify the association that covariates such as gender and employment status have on the me math score by geographic area.
As part of the revision process, one of the referees suggested that it would be quite beneficial to demonstrate how the results from applying a Bayesian analysis to the same data would be different from their frequentist counterpart. For example, the referee argues that the first case study could provide a good example to show the strengths and flexibility of Bayesian methods for integrating disparate sources of information if data available from several pre-election polls are integrated in a single analytic framework. Although, this is a fascinating idea, we do not follow this path here because our main purpose with this example is to illustrate a simple conjugate analysis together with the Monte Carlo principle. However, we sincerely encourage readers to pursue the referee’s proposal by formulating a multi-stage hierarchical model as in Section 5.3.
In addition, from the results in the applied contexts, we discuss and provide the technical details about conjugate modeling, hierarchical modeling, Monte Carlo simulation, Gibbs sampler, the Metropolis-Hastings algorithm, the Monte Carlo Hamiltonian algorithm, the evaluation of the model’s goodness of fit through test statistics, and the use of information criteria for model comparison.
On the other hand, the reader must be aware of the free-use specialized software alternatives currently available for doing Bayesian computing. However, we do not discuss them in this document for space reasons. These include Bugs (Bayesian inference Using Gibbs Sampling), Jags (Just Another Gibbs Sampler), Stan and Nimble (e.g., Kruschke 2014, and McElreath 2020), which are available in both R and Python. Finally, we encourage readers to inquire about other important topics typical of the Bayesian paradigm. These include exchangeability and De Finetti’s representation theorem, improper priors, objective priors, Bayes factors, model averaging, approximations of the posterior distribution through analytic methods (e.g., variational inference), and Bayesian non-parametric statistics. All of these topics can be found at Gelman et al. (2014), Reich and Ghosh (2019), and Heard et al. (2021).
Finally, as stated by one of the referees, it would be quite useful to demonstrate how the Bayesian approach offers certain advantages over its frequentist counterpart in relation to specific applications. In words of the referee, it would be quite beneficial for the readers to see how frequentist inferences using non-experimental data from social sciences can lead to misleading results; for example, resulting in confidence intervals that are too narrow because they neglect to account for all relevant sources of uncertainty. In this regard, we explicitly acknowledge that comparing frequentist and Bayesian methods is core to our research and would be pursed elsewhere.
Footnotes
Appendix
A Multinomial-Dirichlet model
Let
has a Dirichlet distribution with parameter
Choose any value for Simulate Compute
B Poisson regression
Let
C Multiple linear regression
Let
D Notation
The Gamma function is denoted by
Below we present the probabilistic distributions used in the applications:
Gamma: A random variable
Inverse Gamma: A random variable
Normal: A random variable
Dirichlet: A random vector
Multivariate Normal: A
Inverse Wishart: A
