Abstract
This article presents two alternative methods to null hypothesis significance testing (NHST) for improving inferences from underpowered research designs. Post hoc design analysis (PHDA) assesses whether an NHST analysis generating null findings might otherwise have had sufficient power to detect effects of plausible magnitudes. Bayesian analysis with default priors offers advantages over NHST for assessing null findings and detecting signals in underpowered data. Both methods are illustrated by application to Pager and Quillian’s influential study on attitude-behavior correspondence. PHDA results suggest the original study lacked sufficient power to detect strong associations between employers’ attitudes and behaviors. Bayesian analysis confirms strong attitude–behavior associations cannot be ruled out given the data. Together, these results question a frequently cited conclusion about attitude–behavior incongruence in survey vignettes. Overall, the examples illustrate how these analytical tools can be useful for describing uncertainty surrounding estimates and for improving substantive and theoretical debates across sociology.
Keywords
Calls for increasing attention to reproduction and replication of scientific knowledge have been present in the social sciences for decades (cf. Freese 2007a, 2007b; Freese and Peterson 2017; King 1995). However, concerns about a “reproducibility crisis” have recently peaked, triggered by numerous failed attempts to replicate high-profile studies across several scientific disciplines (cf. Camerer et al. 2016; Errington et al. 2014; Freese and Peterson 2017; Jasny et al. 2011; Open Science Collaboration 2015). This reproducibility crisis has sparked renewed attention to normative scientific procedures and widespread reporting and publishing biases that may detrimentally affect scientific research (Gelman and Loken 2014; Ioannidis 2005; Simmons, Nelson, and Simonsohn 2011). In response, scholars have called for modifying normative practices and scientific standards to increase the verifiability, robustness, and repeatability of scientific findings (e.g., Freese and Peterson 2017; see also https://www.projecttier.org).
Amid reproducibility concerns, null hypothesis significance testing (NHST) in quantitative research has been a frequent target of criticism, particularly when such analyses lack sufficient power to detect meaningful effects (Cumming 2014; Fraley and Vazire 2014; Maxwell, Lau, and Howard 2015; Baker 2016). Generally, an NHST analysis involves estimating a focal effect or population difference from a set of observations then calculating a p value, which represents the probability of observing a point estimate at least as extreme as that observed over the long run with repeated random draws of comparable samples from the same population and under the assumption that the null hypothesis (e.g., of no effect or no difference) is true. NHST methods and results are not ipso facto problematic; rather, they generate useful information when applied and interpreted appropriately (Cumming 2014; Murtaugh 2014). However, p values and other NHST results are commonly misinterpreted (cf. Greenland et al. 2016; Wasserstein and Lazar 2016).
Moreover, researchers often apply NHST methods in underpowered research designs (Cohen 1992; Cumming 2014; Gill 1999). Such practices, combined with reporting and publishing biases, may contribute to an overabundance of false positive findings in published research (Fanelli 2012; Ioannidis 2005). Consequently, researchers and editors increasingly are encouraged to report and publish null findings as a remedy to this “file drawer” problem (cf. Franco, Malhotra, and Simonovits 2014; Goodchild van Hilten 2015; Grimes, Bauch, and Ioannidis 2018). Yet null findings from NHST studies often are misinterpreted as evidence in favor of the null hypothesis (Gill 1999; Greenland 2011). Moreover, misinterpretations of null findings are especially problematic in underpowered designs, where such findings both are more frequently expected to occur and more likely to reflect false negatives or Type II errors.
In response, we showcase two methods that can improve conclusions drawn from NHST methods, particularly when null findings are generated in a potentially underpowered design. Although perhaps unfamiliar, these methods should be accessible to most sociologists trained in standard NHST statistical approaches. First, we present a simple post hoc design analysis (PHDA; Gelman and Carlin 2014) for assessing whether conclusions inferred from null findings generated using NHST are likely to be biased by low statistical power. Second, we present Bayesian analysis with default priors (Bååth 2014; Gelman et al. 2013:51-55) as an important tool for deriving signals from small samples, rare event data, or other underpowered designs.
We illustrate these methods by applying them to published data from Pager and Quillian’s (2005) influential sociological study examining the (non)overlap between employers’ attitudes and behaviors. Specifically, the original study reports a null association between employers’ decisions to hire applicants with criminal records in a hypothetical survey vignette and their actual callback behaviors in an experimental audit. This null finding is frequently cited as strong evidence of the invalidity of survey methods for assessing real-world behaviors (e.g., Jerolmack and Khan 2014).
The remainder of this article is organized around providing practical answers to the following methodological and substantive research questions: How can we know if an NHST analysis generating null findings had sufficient power to detect an effect of a reasonable magnitude? How can we detect a meaningful signal from underpowered data? How likely is it that Pager and Quillian’s (2005) null findings indicate a lack of association between “what employers say” and “what they do?”
PHDA
The first question motivating this research is: How we can know if an NHST analysis generating null findings had sufficient power to detect an effect of a reasonable magnitude? Below, we briefly describe statistical power and some problems stemming from underpowered designs, then describe PHDA as a useful tool for retrospectively calculating meaningful estimates of the power of a statistical test.
Statistical Power and Inference Errors
Statistical power in NHST is the probability that a given test will correctly reject the null hypothesis. It is determined by a study’s (sub)sample size, the estimated effect size, the α level, and measurement variance. Generally, studies examining large expected effects with low variance in large samples have a high probability of correctly rejecting the null hypothesis at a given α level.
In an underpowered study, the probability of accurately inferring the existence of a true nonnull effect is low. As a common rule of thumb (see Cohen 1992), a test with power less than .80 is considered underpowered and, hence, overly prone to such Type II errors. Somewhat paradoxically, low statistical power also increases the chances of overestimating effect sizes for true nonnull effects. This is because only exaggerated estimates can pass an α level threshold in a substantially underpowered study, and “statistically significant” nonnull estimates are more likely to be reported and published than “statistically nonsignificant” estimates (cf. “winner’s curse” discussions in Button et al. 2013; Vasishth and Gelman 2017).
Put simply, underpowered studies are characterized by a low probability of accurately detecting a signal indicating the existence of a true nonnull effect and a low probability of detecting an accurate signal of a true nonnull effect. Certain research designs are notoriously underpowered, such as those investigating small effects or relying on small samples (cf. Bradburn et al. 2007; Oakes 2017; Turner, Bird, and Higgins 2013). However, even studies investigating large expected effects in large samples can be substantially underpowered.
For example, consider the power of a test of group differences in proportions (which we apply later to Pager and Quillian’s [2005] data on differences in employer callback rates). Given the same expected effect size, fixed power and α levels, and equal sample sizes for both groups, the power of this test increases with the rarity of the outcome (i.e., because variance necessarily decreases as the baseline event rate diverges from p = .50). However, observational data often contain imbalanced groups, and the power is limited by the smallest subgroup frequency. Moreover, when also examining rare events across imbalanced groups, a test may be comparing cells containing zero or near-zero frequencies. Hence, NHST analyses may suffer from issues such as biased estimation and lack of statistical power when data contain only a small number of cases on the rarer of two outcomes (King and Zeng 2001; Ma, Chu, and Mazumdar 2016; Vuolo, Uggen, and Lageson 2016), even if the total sample size and expected proportion difference across groups both are relatively large.
Post hoc Power Calculations
Given the high likelihood of inference errors in underpowered research, it is crucial to determine whether an NHST analysis has sufficient power to detect an effect of a plausible magnitude. For this reason, an a priori power analysis is routinely recommended before collecting data (e.g., Lakens and Evers 2014). Yet much quantitative research in the social sciences involves analyses of secondary data, and useful insights are often gleaned from exploratory analyses that were not planned in advance and could not have been subjected to an a priori design analysis (Gelman and Loken 2014:464-65). Therefore, meaningful post hoc power estimates would be useful for evaluating many of the NHST designs and results in our field.
Unfortunately, the most common method for retrospectively estimating statistical power is to use a reported or published effect size estimate from a specific test to calculate the power of that same test. However, such so-called observed power or post hoc power estimates are fundamentally uninformative (i.e., redundant with the observed effect’s p value) and, worse, are frequently misinterpreted (Hoenig and Heisey 2001; O’Keefe 2007).
In contrast, Gelman and Carlin’s (2014) PHDA represents a significant departure from retrospective “observed power” calculations. Centrally, their PHDA calls for the thoughtful consideration of “plausible” effect sizes for a test that are based on external knowledge of substantive relationships (see 2014:647-48). Upon positing these plausible effect sizes, their PHDA method involves calculation of the probability of making errors when drawing inferences from NHST results.
It is important to note that Gelman and Carlin’s (2014) PHDA aims to shift focus away from power calculations (and statistical significance) and toward effect size estimation and precision by encouraging calculation of the probability of sign (type S) and magnitude (type M) errors. 1 With that said, NHST methods are still paradigmatic in the social sciences, statistical power remains a well-known (if often misunderstood) and important concept for NHST analysis, and concerns about potentially underpowered NHST analyses are central to the “reproducibility crisis.” Thus, in our view, attempts to retrospectively generate meaningful estimates of the statistical power of an NHST analysis are defensible as well as informative in certain situations. We argue the process of generating power estimates can be useful for assessing published conclusions from null NHST findings, and it is most valuable if paired with follow-up Bayesian analyses aimed at estimating credible effect sizes and relative probabilities for null and alternative hypotheses.
Thus, we adopt Gelman and Carlin’s (2014:647-48) practice of identifying plausible effect sizes, or what we refer to as counterfactual expected effects, in PHDA. The identification of counterfactual expected effects is conceptually akin to the processes involved in proper a priori design analysis and the identification of Bayesian informative priors. Moreover, it is this central element of Gelman and Carlin’s PHDA that permits calculation of meaningful retrospective power estimates.
Given their direct relationship to p values, power estimates calculated using counterfactual expected effects can serve as an intuitive indicator for whether an NHST study generating null findings otherwise might have been able to detect an effect of a reasonable magnitude. Our application to Pager and Quillian’s (2005) data illustrates the utility of this power-centered approach to PHDA in the context of null NHST findings. Here, we recommend the following procedures for conducting a post hoc assessment of the power of an NHST study:
(1) Identify the design elements of a given study that are relevant to power calculations, including sample size, variable distributions or measurement variance, and α levels.
(2) Given the study design features listed in #1 (e.g., sample size and variable distributions), hypothesize a range of plausible effect size estimates, or counterfactual expected effects for the focal association or population difference.
These counterfactual effects should be independent of the study’s observed estimates of effects or population differences. That is, they should be determined using theory and logic in combination with any available external information about the focal effects or population differences (cf. Gelman and Carlin 2014:647-48; Lakens and Evers 2014).
The focal “observed” effect size or population difference and associated p value may be noted for later comparisons but do not use these observed values in post hoc power calculation.
(3) Calculate a range of meaningful post hoc power estimates by entering the design elements identified in #1 and the counterfactual expected effects identified in #2 into a standard statistical calculator or software program designed to calculate power.
Counterfactual Effects: From Testing to (Bayesian) Estimation
Following the procedures above, a PHDA can indicate whether a given NHST study that generated null findings might have been sufficiently powered to detect a range of plausible effects of reasonable magnitudes, thereby answering our first question. Like an a priori power analysis, if the post hoc power estimates from a PHDA are below .80 (Cohen 1992) or an alternative error threshold (e.g., .90; Lakens and Evers 2014), then the test in question might be considered underpowered. A smaller power estimate essentially indicates that, given the study’s design features and plausible counterfactual effect sizes, a comparable NHST analysis would have a low probability of correctly rejecting the null (i.e., of detecting a true nonnull effect). In these cases, one should take extra caution in making inferences or drawing conclusions from the statistical test. As explained earlier, underpowered tests too often result in false negative inferences when failing to reject the null (Type II errors; Cohen 1992) and, when rejecting the null, too often result in false positive inferences (Type 1 errors; Colquhoun 2014) or in overestimation of true nonnull effects (Type M errors; Gelman and Carlin 2014).
In addition to allowing retrospective calculation of meaningful statistical power estimates, the identification of counterfactual expected effects in PHDA might encourage a focal shift in quantitative sociology from hypothesis testing to effect estimation with quantified uncertainty (cf. Cumming 2014; Gelman and Carlin 2014; Hoenig and Heisey 2001; Kruschke and Liddell 2018b). Moreover, the thoughtful consideration of counterfactual expected effects might serve as a bridge to help scholars trained in NHST methods grasp Bayesian methods that rely on informative priors, since these methods involve a conceptually similar process of identifying plausible or likely effect size estimates (Gelman et al. 2013). As we will demonstrate, these counterfactual effects can also be used as simple heuristic thresholds that enrich interpretations in a simple follow-up Bayesian analysis specifying a default prior distribution.
Bayesian Analysis
Recall, the second research question: How might we detect a meaningful signal from underpowered data? Bayesian analysis is well suited for this question, as it provides quantifiable estimates of both the magnitude of an effect and the degree of precision or uncertainty surrounding an effect estimate. Underpowered designs are characterized by imprecise estimation of effects or population differences; a Bayesian analysis provides a quantified summary in intuitive probability terms of the degree of precision or confidence associated with a range of credible estimates. We begin with a brief description of Bayesian analysis, then describe how a Bayesian approach to data analysis is particularly useful when NHST methods generate null findings, as is often the case in underpowered studies.
Much Ado about Priors
A defining feature of the Bayesian approach to data analysis is the specification of a prior distribution for model parameters. In Bayesian analysis, a posterior probability density is estimated as a function of both the data-driven likelihood estimates and a prior distribution for the model parameters. Specification of a prior distribution allows researchers to intentionally build in substantive knowledge about (a range of values for) a focal effect or population difference (i.e., “informative” priors). Subsequently, the analysis permits assessment of the degree to which model parameters estimated from new observations update—that is, modify or confirm—prior knowledge as represented in the prior distribution. Studies with very large samples and low variation would typically result in (posterior density) estimates that primarily reflect the signal in the data (i.e., the likelihood). Conversely, Bayesian estimates (from the posterior distribution) can minimize inference errors (i.e., by relying more heavily on an informative prior distribution) when the signal-to-noise ratio is low, such as when the data (likelihood) generate highly imprecise estimates from small samples with high variability.
Not all applications of Bayesian analysis involve attempts to specify the subjective “state of knowledge” in an informative prior distribution (Gelman et al. 2013:34). Rather, many Bayesian applications rely on the so-called default prior distributions, which are otherwise known as “noninformative,” “reference,” “objective,” “weakly informative,” or “minimalist” priors (cf. Berger 2006; Gelman et al. 2013:51-55; Gelman, Simpson, and Betancourt 2017; Ghosh 2011). Default priors are typically specified to ensure that posterior probability densities are driven primarily by the observed data. An example is the uniform or flat prior distribution for the binomial parameter, which assumes that all plausible values of y (e.g., observable “success” rates) are a priori equally probable (see Gelman et al. 2013:29-34).
Here, we recommend Bayesian analysis with default priors for a few reasons. First, because estimates (the posterior) are dominated by the data (likelihood), this approach typically generates comparable results to NHST methods that are more familiar to sociologists. 2 Specifically, with large samples, an NHST point estimate usually approximates the median of the posterior density under a default prior specification, and NHST 95 percent confidence intervals (CIs) similarly approximate Bayesian 95 percent posterior density credible intervals (see Schoot et al. 2014). However, with small samples, Bayesian analysis can improve NHST inferences—even with weakly informative prior distributions—by generating stabilized estimates that are more robust, make more sense, or perform better predictively in certain situations (Gelman et al. 2008).
Relatedly, since a Bayesian analysis with default priors essentially “tips the scales” in favor of the data and generates results that are comparable to NHST results, this approach may be useful for assessing the degree of uncertainty surrounding published NHST estimates. Subsequently, scholars can assess whether such data-driven findings make sense in view of prior knowledge, either by specifying informative priors in a follow-up analysis or, as we will illustrate, by simply comparing estimates and credible intervals to a range of plausible counterfactual expected effects identified in PHDA.
Finally, the proper translation of substantive knowledge into mathematical prior distributions can be complex and may require additional training or consultation. In contrast, default priors are generally easier to use and interpret. For instance, in our example below, we rely on the Bayesian First Aid package in R (Bååth 2014), which is specifically designed to ease the transition from NHST methods to Bayesian alternatives. Currently, the package provides alternatives to six classical .test functions in R (e.g., binomial test, t tests, Pearson correlation, test of proportions, Poisson test; for more information, see https://github.com/rasmusab/bayesian_first_aid/blob/master/README.md). Moreover, Bayesian regression alternatives with default prior specifications are built into many popular statistical packages, including R (e.g., bayesglm; see Gelman et al. 2008), SAS (e.g., bayes statement; see Stokes, Chen, and Gunes 2014), SPSS (Bayesian Statistics menu; see https://www.ibm.com/support/knowledgecenter/SSLVMB_25.0.0/statistics_mainhelp_ddita/spss/advanced/idh_bayesian.html), Stata (e.g., bayes prefix; see https://www.stata.com/new-in-stata/bayes-prefix/), and Mplus (estimation= bayes; see Muthén 2010).
Null Findings: No Effect or Weak Signal?
Although widely applicable to many research questions (see Gelman et al. 2013; Kruschke 2015), Bayesian analysis is especially useful in situations where an NHST analysis generates null findings (Kruschke 2011). To illustrate, consider that a null finding in NHST is fundamentally uninformative about the true nature of an effect or population difference. As Gill (1999:661) explains, Failing to reject the null hypothesis essentially provides almost no information about the state of the world. It simply means that given the evidence at hand one cannot make an assertion about some relationship: all you can conclude is that you can’t conclude that the null was false.
In contrast, compared to NHST, Bayesian analysis provides more information about the strength and precision of a given signal in the data. A key benefit of Bayesian analysis in these cases is the generation of relative probabilities for various alternative hypotheses, which permit researchers “to quantify to what extent null results reflect a real absence of effects or a lack of statistical sensitivity” (Vadillo, Konstantinidis, and Sharks 2016:88; see also Dienes 2014; Kruschke 2011).
Put differently, in cases where NHST generates null findings, Bayesian analysis permits researchers to determine whether the null hypothesis is “more credible” than an alternative hypothesis, which is something “NHST can never do” (Kruschke and Liddell 2018b:196). Kruschke and Liddell note this feature “is highly desirable for theoretical domains in which ‘proving’ the null is the goal” (p. 196), as is arguably the case in Pager and Quillian’s (2005) study.
Example: “Walking the Talk”
Pager’s (2003) experimental audit is among the most well-known studies in contemporary sociology. Among other findings, her audit study showed that employers were less likely to call back applicants with a (randomly assigned) criminal record compared to similar applicants without criminal records.
Pager and Quillian’s (2005; hereafter P&Q) study paired these employer audit data with a self-report survey of the individuals in charge of hiring for audited employers (Pager 2002). Published analyses of the survey data reported stark disparities between self-reported employer attitudes and actual callback practices in the audit study. When presented with a vignette describing an applicant similar to those who had audited the employers, 62 percent of employers surveyed expressed some degree of willingness (“somewhat likely” or “very likely”) to hire a drug offender who had recently served a prison sentence. However, in the audit study, only 17 percent of white and 5 percent of black testers received callbacks.
Perhaps more surprising was the reported lack of association between employers’ reported willingness to hire a drug offender and their actual callback behaviors (Kendall’s τ-b = .012, p > .05; see P&Q 2005:367). This discrepancy between employers’ vignette reports and audit behaviors is the central theme of P&Q’s (2005) influential American Sociological Review article reanalyzed here, entitled “Walking the Talk? What Employers Say versus What They Do.”
Legacy of “Walking the Talk”
P&Q’s article has been widely read and cited, accumulating over 500 citations on Google Scholar to date. More importantly than how frequently the article is cited is how or why it is cited. Despite the original authors’ measured discussions and their cautions against overstating conclusions, this article routinely is cited as evidence of the potential invalidity of survey methods for assessing real-world behaviors. Consider a few recent examples from sources published in 2017 expressing concerns about uncertainty surrounding the validity of surveys: – “A number of significant concerns regarding vignettes’ real-life relevance come out of the work of Pager and Quillian (2005)…” (McDonald 2017:5) – “Scholars have shown that in some cases, respondents may seek to give socially appropriate answers to questions, even if this involves distorting the truth (Pager and Quillian, 2005).” (Occhiuto 2017:278) – “…since Pager and Quillian (2005) unquestionably demonstrated incongruence between what employers say and what they do.” (Reich 2017:129)
While citations of this kind are often defensive, reflecting authors’ attempts to recognize concerns about the use of survey designs, not all citations fit this description. Perhaps the strongest example is Jerolmack and Khan’s (2014) recent article in Sociological Methods and Research, entitled “Talk Is Cheap: Ethnography and the Attitudinal Fallacy.” In this article, the authors draw heavily on P&Q to make a case in favor of adopting ethnographic methods and against using “verbal accounts” like those collected in surveys due to the “fact that what people say is often a poor predictor of what people do” (Jerolmack and Khan 2014:178).
Why Revisit “Walking the Talk?”
P&Q’s study has three particularly desirable characteristics for illustrating the utility of PHDA and Bayesian analysis. First, the original study reports a statistically nonsignificant and near-zero association between attitudes and behaviors. Recall, null findings from NHST are uninformative, and underpowered statistical tests (and other design features) can cause failures to reject the null. Nonetheless, P&Q’s null finding is regularly cited as evidence of attitude–behavior incongruency or as evidence in favor of the null (e.g., Jerolmack and Khan 2014). A PHDA can help determine whether the initial study design might have had sufficient power to detect a reasonably strong attitude–behavior association in the first place.
Second, the focal study’s sample size (n = 156) might seem sufficiently large to suppress obvious concerns about low statistical power. However, the focal study analyzed employer callbacks of applicants with criminal records, which is a relatively rare event. Statistical power is limited by small subgroups and imbalanced cells, and these characteristics are common in observational and rare event data routinely analyzed by sociologists (Bradburn et al. 2007).
Third, the focal study is impressively transparent in reporting data and methods, particularly for research conducted prior to the recent replication crisis and subsequent transparency and open science movements. As a result, it is possible to reproduce the original analyses and verify published conclusions using a Bayesian approach, as well as assess the extent to which researchers’ measurement decisions affect the robustness of reported results using data published in the original article. Thus, to a degree, we can assess how the application of NHST methods and the “garden of forking paths” (Gelman and Loken 2014) might have affected P&Q’s estimate of the association between employers’ attitudes and behaviors.
Data
Pager’s (2003) original audit experiment involved sending same-race matched pairs of white (n = 150 pairs) and black (n = 200 pairs) men of similar age and credentials to apply for advertised, entry-level job openings in Milwaukee, WI. In total, 350 employers each were audited by a pair of same-race applicants (n = 700 applications), with one member of each pair randomly assigned to a “criminal record” condition.
The follow-up employer survey included a vignette closely approximating the audit conditions. Employers were presented with a hypothetical description of a 23-year-old black or white male named Chad (with Chad’s race matching the auditors’ race) applying for an entry-level opening. The vignette described Chad as having good references and interacting well with people, and it indicated Chad was convicted of a drug felony, served 12 months in prison, was released last month, and is now looking for a job. Employers were then asked how likely they are to hire Chad for an entry-level opening. Four response options ranged from “very likely” to “very unlikely.” In all, 177 employers completed the survey for a response rate of 51 percent (Pager 2002). For additional details about the original employer audit or follow-up survey data, see Pager (2002, 2003) and P&Q (2005).
Pooling data from the audit study and the follow-up survey (valid N = 156) permitted P&Q (2005) to examine congruence between employers’ vignette-based hypothetical hiring decisions and their actual callback behaviors. Our reanalysis relies on cross-tabular group frequency data published in the results and appendices of P&Q’s (2005:table 2, p. 367, and appendix B, p. 377) article. Ideally, this reanalysis would employ the original data. However, while the employer survey data were publicly available, the matching employer audit data were neither publicly available nor available upon request. 3
Measurement Variations
In P&Q’s analysis, employers’ audit callbacks were measured dichotomously as “yes” or “no” to indicate whether an audited employer called back the audit-matched applicant with a criminal record. Similarly, P&Q collapsed the very likely and somewhat likely survey responses and collapsed the somewhat unlikely and very unlikely responses to create a dichotomous indicator of whether an employer was “willing” or “unwilling” to hire the comparable hypothetical applicant.
Theoretically, P&Q justified collapsing somewhat likely and very likely responses by claiming this grouping corresponds most closely with the audit’s behavioral callback measure. Specifically, they argued that a callback “may in fact represent a very low bar of approval” when employers contemplate hiring an applicant (p. 364). However, callbacks might also represent a relatively high bar of approval, as employers might only call back those applicants whom they are “very likely” to hire. Moreover, when faced with competing qualified candidates with and without criminal records—as was the case for each employer in the matched-pair audit design—all other employers might at least somewhat favor candidates with comparable qualifications yet no criminal records.
Therefore, our reanalysis compares results obtained using the original coding condition with those produced using three alternative dichotomous employer contrasts: (1) very likely to hire versus at most somewhat likely to hire, (2) very unlikely to hire versus at least somewhat unlikely to hire, and (3) employers who answered very likely versus very unlikely. These contrasts permit assessment of the robustness of published conclusions across measurement specifications.
Example #1: PHDA
Method
A key goal of the PHDA is to assess whether P&Q’s data offer sufficient statistical power to detect a vignette–audit association of a reasonable magnitude. For this analysis, statistical power estimates are calculated at α = .05 with the Fisher method using the power.exact.test procedure in R (version 3.3.3; refer to https://cran.r-project.org/web/packages/Exact/Exact.pdf). In interpreting results, we rely on the commonly recommended power threshold of .80 (Cohen 1992). Analyses falling short of this threshold may be substantially underpowered and prone to Type II errors.
Identifying Counterfactual Expected Effects
The greatest challenge faced in a PHDA is determining plausible effect sizes (Gelman and Carlin 2014). In the case of P&Q’s study, this involves making a priori assumptions about how large a difference in audit callbacks of applicants with criminal records one should have expected to observe between employers who say they are more versus less likely to hire such candidates. In other words, to calculate meaningful estimates of statistical power, we must first identify counterfactual expected effects, or a range of estimates of the degree of overlap between vignette and audit data that might have been expected prior to conducting the original study.
As explained previously, counterfactual expected effects must be independent of the observed degree of overlap. Here, we estimate what the joint vignette/audit frequency distributions might have looked like if there were a relationship between what employers say and do, given the marginal probabilities of audit callbacks and of employers’ vignette-reported willingness to hire candidates with criminal records. Since counterfactual estimates are inherently disputable (for a thoughtful discussion, see Gelman and Carlin 2014:647-48), we start by identifying the “maximum possible” association that could have been observed, then calculate two different estimates representing a “moderate” and a “strong” expected vignette/audit association.
How Large an Effect Might Have Been Expected?
It is tempting to assume that a high correlation coefficient is expected between vignette and audit results, such as is found in prior research assessing attitude–behavior correspondence (e.g., r > .50; cf. Glasman and Albarracin 2006; Vaisey 2014). However, this assumption is misleading when estimating expected associations between binary variables measuring rare events, such as job applicant callbacks.
P&Q’s (2005:377) conclusions about group differences in callback rates are based on only 11 total observed callbacks of applicants with criminal records across all 156 employers with valid overlapping survey and audit data. Since the vast majority of employers (93 percent) in both the “more willing” and “less willing” groups do not call back candidates, these shared nonevents suppress the maximum size of observed associations. Furthermore, without a “no criminal record” control condition in the survey vignette, P&Q’s (2005) analyses are limited to examining callbacks of candidates with criminal records—an especially rare event.
As an illustrative example of the effect of imbalanced frequency distributions on the calculation of correlation coefficients, consider the size of the “criminal record effect” on callbacks documented in Pager’s (2003) audit study. The marginal probabilities of callbacks for whites and blacks with and without a criminal record (Pager 2003:958) can be transformed into a frequency distribution, from which a correlation coefficient can be calculated. For instance, it appears that 79 of the 350 applicants in the “no criminal record” received callbacks (whites: 150 × .34 = 51; blacks: 200 × .14 = 28), compared to about 36 of the 350 applicants in the “criminal record” condition (whites: 150 × .17 = 25.5; blacks: 200 × .05 = 10). This “criminal record effect” equates to an odds ratio (OR) of 2.54, or a Φ coefficient of association of −0.17.
This coefficient might be mistakenly interpreted as a small effect, for instance, if one were to rely on Cohen’s (1988, 1992) “rules of thumb” for interpreting correlational effect sizes. Such a mistake becomes apparent upon calculating the “maximum possible” Φ coefficient (Φmax) for the observed marginal distribution. This can be done simply by imagining that all 36 of the applicants in Pager’s criminal record condition who received callbacks from employers instead did not receive callbacks. In this counterfactual case, the Φ coefficient representing the correlation between “criminal record” condition and employer callbacks would equal a maximum negative value of −0.35. In other words, this is the lower-bound (i.e., “maximum” negative) value that Φ might have taken given marginal probabilities. Note that this value deviates substantially from −1.0 or the lower standardized threshold for a Pearson’s r coefficient.
Thus, as illustrated in this example, rare events suppress the minimum and maximum possible values for Φ, making interpretation of effect sizes such as Φ values (e.g., −0.17) problematic using widely known “rules of thumb” (Cohen 1988). 4 However, dividing observed Φ by maximum Φ (Φ/Φmax) produces a normed coefficient that is conceptually and often empirically comparable to the absolute value of a Pearson’s r (but not equivalent; cf. Davenport and El-Sanhurry 1991; Kaltenhauser and Lee 1976:310; Olivier and Bell 2013). Using this normed statistic, the magnitude of the observed “criminal record effect” in Pager’s (2003) audit is approximately half the maximum possible given the observed marginal probabilities (Φ/Φmax = −.17/−.35 = .49), which is conceptually akin to Cohen’s (1988) threshold for a “large” or strong effect (r = .50).
As in the above illustration, we begin our PHDA by identifying counterfactual “maximum possible” vignette–audit associations (Φ/Φmax = 1) for each coding condition. Subsequently, based on prior information about meta-analytic average correlations between survey-based and observational measures of behavior (Glasman and Albarracin 2006; Kraus 1995), these “maximum” distributions are modified to produce two counterfactual expected callback distributions approximately corresponding to “moderate” (Φ/Φmax ≈ .38) and “strong” (Φ/Φmax ≥ .52) vignette–audit associations. Estimation of these counterfactual effects relies on knowledge about probabilities of the outcome, or the relative rarity of audit callbacks, as well as information about probabilities of treatment/group allocation, or different levels of vignette-reported willingness to hire applicants with criminal records. (For a discussion of these two types of probabilities and an example of their use in a reanalysis, see Olivier and Bell 2013.) Probabilities of the outcome are derived from observed callback proportions in Pager’s (2003) original audit study; probabilities of group allocation are marginal probabilities derived from survey responses in P&Q’s (2005:377) analytic sample.
Assumptions about Expected Callbacks
We calculate expected callback proportions for employers in the “more willing to hire” groups in each coding condition by multiplying the marginal probability for the “more willing” group by a weighted estimate of the probability of callbacks for that group. First, to calculate the weighted probability of a callback, we assume that employers who report being very likely to hire the hypothetical applicant with a criminal record in the survey vignette will, on average, call back candidates with criminal records at the audit-average callback rate for applicants without criminal records. That probability, combined for whites and blacks, is .226 ([.34 × 150 + .14 × 200]/350 = 79/350 = .226; see estimates from Pager 2003:958). This expected callback probability for employers in the very likely group is an anti-conservative estimate that should result in conservative statistical conclusions favoring the original study. Specifically, it is untenable to assume that employers, when faced with competing candidates, are as likely to call the applicant with a criminal record as they are to call the audit-matched candidate without a record. Hence, the plausible expected effect sizes generated by this analysis should be inflated and, thus, should tip the scales in favor of supporting P&Q’s conclusion that the vignette/audit overlap is weak or null in these data.
In comparison, we assume that employers who report being somewhat likely and somewhat unlikely to hire the hypothetical drug offender will callback applicants with criminal records at the audit-average rate that employers called back candidates in the “criminal record” condition. That callback probability, combined for whites and blacks, is .103 ([.17 × 150 + .05 × 200]/350 = 36/350 = .103]. Although perhaps more realistic than the very likely assumption, this assumption should also be somewhat anti-conservative in that it presumes employers reporting some reservations (somewhat likely) and those reporting even greater reservations (somewhat unlikely) about hiring a hypothetical candidate with a criminal record both will call back such applicants at the audit-average rate in real-world competitive hiring conditions.
Based on these assumptions, we might have expected approximately five callbacks of ex-convicts from the 22 employers in the very likely group (.226 × 22 = 4.97), eight callbacks from the 74 employers in the somewhat likely group (.103 × 74 = 7.62), and three callbacks from the 28 employers in somewhat unlikely group (.103 × 28 = 2.88). This equates to 16 expected callbacks out of 156 applications in the “criminal record” condition, or an expected callback probability of .103. This counterfactual callback rate (.103) is equivalent to the audit-average callback rate observed in Pager’s (2003) study and greater than the observed callback rate in the overlapping vignette–audit sample (11/156 = .071).
Calculating “Maximum Possible” Associations
After fixing the marginal probabilities for vignette responses and setting assumptions about expected audit callbacks among employers in the “more likely” groups, we can calculate the “maximum possible” association (Φmax) for each vignette coding condition. This is achieved by fixing the expected number of audit callbacks by employers in the “less likely” group to zero. For example, using P&Q’s (2005) coding, this procedure results in 13 expected callbacks of 96 employers in the somewhat/very likely group (somewhat likely callbacks = 8; very likely callbacks = 5) versus zero expected callbacks of 60 employers in the somewhat/very unlikely group (see column 1, panel 1.A, Table 1 in the Results section), for a maximum Φ association (Φmax) equal to .24.
Counterfactual Associations between What Employers Say in a Vignette and What They Do in an Audit, by Expected Effect Magnitude and Vignette Coding Condition.
Note: Joint frequency distributions are counterfactual “expected” audit callbacks of applicants with criminal records by employers’ reported willingness to hire such candidates in a survey vignette. Distributions were calculated by starting with the marginal frequencies for employers’ willingness to hire candidates with criminal records in the overlapping vignette/audit sample (Pager and Quillian 2005:377). Next, expected callback rates for employers in the “more likely” group (top row in each panel) were calculated by multiplying very likely and somewhat likely/unlikely subgroup frequencies, respectively, by observed callback probabilities for applicants without and with criminal records (.226 and .103; see Pager 2003). “Maximum possible” associations between vignette reports and audit behaviors (column A) were then obtained by fixing the “less likely” callback rate to zero. Strong and moderate associations (columns B and C) were calculated using mean meta-analytic attitude–behavior correlations as benchmarks (i.e., Φ/Φmax values comparable to r ≥ .52 and r ≈ .38).
“Strong” and “Moderate” Associations
Counterfactual “strong” and “moderate” frequency distributions were estimated by fixing the marginal probabilities for vignette group membership and callback outcomes at the same values in the “maximum possible” distributions for each coding condition, then adjusting expected callback frequencies to identify distributions where Φ/Φmax is, respectively, equal to or greater than .52 (“strong”) or approximately equal to .38 (“moderate”). These values were selected based on prior information about comparable meta-analytic average correlations between survey-based and observational measures of behavior (Glassman and Albaraccin 2006; Kraus 1995). This decision is in line with both Cohen and Cohen’s (1983:59, 60) and Gelman and Carlin’s (2014:647) recommendations for hypothesizing an expected effect size. Moreover, these normed values have the added advantage of being conceptually comparable to Cohen’s (1988) “rule of thumb” thresholds for medium (r = .30) and large (r = .50) effect sizes in the social sciences.
This procedure generates counterfactual “strong” distributions (column 2 in Table 1) with Φ/Φmax values ranging from .53 to .70. Examination of the relative odds of expected callbacks for employers in the “more likely” versus “less likely” groups in these “strong” distributions reveals ORs ranging from 3.75 to 10.42 (see Table 1). Each of these values exceed Olivier and Bell’s (2013) recommended “strong” threshold (OR = 3.0) for 2 × 2 tables.
The counterfactual “moderate” distributions (column 3, Table 1, in Results below) were identified by changing the “strong” distribution by one callback. That is, we moved a single expected callback from “more likely” to “less likely” in each coding condition. This change resulted in Φ/Φmax values ranging from .30 to .40. Compared to “strong” distributions, these “moderate” distributions may be more plausible—yet still anti-conservative or inflated—expected effect size estimates. This is because design features of P&Q’s (2005) study, including the dichotomous outcome variable and dissimilar target behaviors (i.e., hiring versus callbacks), tend to suppress overall attitude–behavior correspondence (for detailed discussions, see Glasman and Albarracin 2006; Kraus 1995).
With that said, even these “moderate” associations equate to relatively large ORs ranging from 1.91 to 4.37. Hence, our lower-bound expectation sets a relatively high bar for the vignette–audit overlap, as a “moderate” association is defined as the following expectation in this analysis: An employer who reports being “more willing” to hire a hypothetical candidate with a criminal record in a noncompetitive vignette is expected to call back such applicants in a real-world competitive hiring condition—in which the employer faced at least one equally competitive audit-matched candidate without a criminal record—at approximately two to four times the rate of their “less willing” counterparts.
Results
Table 1 presents counterfactual frequency distributions representing “maximum possible” (column 1), “strong” (column 2), and “moderate” (column 3) associations between employers’ vignette reports and audit callbacks for each coding condition. Specifically, panel 1A employs P&Q’s (2005) coding condition, whereas the remaining panels contrast employers who are very likely versus at most somewhat likely (1B), very unlikely versus at least somewhat unlikely (1C), and very likely versus very unlikely (1D) to hire the hypothetical applicant.
Expected vignette–audit associations are summarized using raw (Φ) and normed (Φ/Φmax) Φ coefficients of association as well as proportion differences and ORs contrasting audit callbacks between “more willing” versus “less willing” groups. The expected proportion difference estimates in the “strong” and “moderate” conditions (columns 2 and 3) will be used later as benchmarks for evaluating the observed effects in our reanalysis.
In the PHDA, expected effect size thresholds and corresponding frequency and proportion distributions permit retroactive calculation of statistical power. Hence, Table 1 also includes statistical power estimates for each expected effect magnitude (columns) and vignette coding condition (panels). These estimates reveal a few particularly noteworthy findings.
First, only 2 of the 12 counterfactual distributions show sufficient power to detect an effect of the displayed magnitude. One such sufficiently powered condition involves P&Q’s coding condition (.99; panel 1A, column 1); the other involves a contrast between employers in the very likely and at most somewhat likely groups (.98; panel 1B, column 1). In both cases, these results suggest that future analyses using the same design would be sufficiently powered to detect the “maximum possible” effect size at greater than .80 probability. Specifically, using a two-tailed test at α = .05, an analysis performed with P&Q’s coding should have a 99 percent chance of observing a statistically significant group difference in callbacks of .14, or the maximum possible observable proportion difference given marginal probabilities. In contrast, analyses involving the remaining two coding conditions (panel 1C and D) might be insufficiently powered, according to conventional standards, to detect a statistically significant group difference of even the “maximum possible” effect size at greater than .80 probability (power = .66 and .78, respectively). 5
Moving on to our plausible counterfactual expected effects, the PHDA results suggest comparable analyses would be severely underpowered to detect a “moderate” effect (power ranging from .06 to .28) as well as underpowered to detect a “strong” effect (power ranging from .22 to .66) in all four coding conditions. Hence, future studies using the same design would have substantially less than an 80 percent chance of detecting a moderate or strong association between vignette reports and audit behaviors at α = .05, irrespective of measurement. In other words, even if the true vignette–audit associations were strong in magnitude, these low counterfactual power estimates suggest that NHST studies employing the same design would often generate false negatives (and, in cases where the null is accurately rejected, analyses would likely overestimate the true nonnull associations). Overall, results of this PHDA suggest that P&Q’s study likely lacked the necessary statistical power at the outset to detect a reasonable association between what employers say and what they do.
Example #2: Bayesian Analysis
Method
After conducting the PHDA, we verify and assess robustness of P&Q’s (2005:367) NHST results by assessing whether employers’ vignette-stated willingness to hire a hypothetical applicant with a criminal record are statistically independent of employers’ callbacks of such applicants in a matched experimental audit design across multiple measurement conditions. We then extend these NHST analyses by estimating plausible parameter values for the vignette–audit association using the Bayesian First Aid package in R (Bååth 2014).
Specifically, we use the bayes.prop.test command in the Bayesian First Aid package. This procedure relies upon observed data and a default prior distribution to estimate relative frequencies of success across groups, in this case proportion differences in audit callbacks across employers who report being more likely versus less likely to hire an applicant with a criminal record. The package specifies a binomial distribution, with the prior distribution of the relative frequency of success (θ) specified as flat or uniform (θ ∼ β[1,1]; refer to http://www.sumsar.net/blog/2014/01/bayesian-first-aid-binomial-test/).
A choice of default or “minimalist” priors (see Gelman et al. 2017) should be appropriate for these simple binomial models, where both successes (callbacks) and failures (noncallbacks) are possible outcomes and prior expert knowledge on the distribution of callbacks and the specific overlap between employers’ vignette reports and audit behaviors is sparse. 6 More importantly, the use of informative priors—such as knowledge of meta-analytic average correlations between attitudes and behaviors (e.g., r > .50; cf. Glasman and Albarracin 2006; Vaisey 2014)—would cause those previously documented stronger associations from larger overall samples to dominate the posterior distribution, thus tipping the scales toward rejecting P&Q’s (2005) conclusions.
Given the observed data and default prior, the bayes.prop.test command infers a posterior probability density of relative frequencies (proportion callbacks) from which point estimates (e.g., the median of the posterior density) and 95 percent credible intervals (highest posterior density intervals, or HDIs) are calculated. In addition, the analysis estimates a posterior probability density, a point estimate, and credible intervals for the difference in relative group frequencies.
These estimates can then be compared to the counterfactual expected “moderate” and “strong” effects for each coding condition. Specifically, akin to the logic of null hypothesis testing, we can examine whether expected group differences representing “moderate” or “strong” associations fall within the 95 percent credible intervals and, hence, are considered plausible values given the data and default priors.
Finally, and in important contrast to null hypothesis testing, these Bayesian estimations of posterior probability densities allow for calculation of the probability that the underlying relative frequency of callbacks is greater in one group compared to another. Put simply, this probability estimate tells us whether it is a “good bet” to assume that employers who say they are more likely to hire ex-convicts (in the vignette) do indeed call back ex-convicts (in the audit) at a greater rate than do their less likely counterparts. Probabilities close to .50 suggest a poor bet; one is as likely to be wrong as right in betting that employers do what they say given these data (and default priors). Probabilities approaching .99 indicate a good bet, suggesting a high probability that employers who say they are more likely to hire ex-convicts also are more likely to call back applicants with criminal records.
In this analysis, we calculate three different probability values for each coding condition. Specifically, we present probabilities that the difference between the “more” and “less” willing group is (1) greater than zero, (2) equal to or greater than the moderate counterfactual effect estimate, and (3) equal to or greater than the strong counterfactual effect estimate.
Results
Robustness of NHST Null Findings
Despite concerns about statistical power, for comparison purposes, Table 2 reports results of null hypothesis tests (column 3) applied to P&Q’s observed joint frequency data (columns 1 and 2). For each coding condition (panel 2A–D), these tests assess whether we can reject the null hypothesis of no difference in audit callback rates between employers who report being “more willing” versus “less willing” to hire applicants with criminal records. Column 3 of panel 2A in Table 2 reproduces P&Q’s (2005:367) published null findings using their original coding. The next three panels (panel 2B–D) replicate this null finding in all three alternative coding conditions. Given the aforementioned lack of statistical power to detect a realistic association in these data, the null hypothesis test results are unsurprising.
Observed Associations between What Employers Say in a Vignette and What They Do in an Audit When Considering Job Applicants with Criminal Records.
Note: OR = odds ratio.
a“Rel. freq.” refers to the estimated relative frequency of success, or the median of the posterior distribution for employers in the “more likely” (θ1) or “less likely” (θ2) groups; 95% HDI refers to the 95 percent highest density interval.
b“Est. diff.” is the estimated difference in callback proportions between employers in the “more likely” and “less likely” groups, or the difference in medians of the posterior distributions for “more likely” and “less likely” groups (θ1–θ2).
d“Observed” frequency distributions (2A–2D) calculated from Pager and Quillian’s appendix (2005:377).
Bayesian Estimated Group Differences
Column 4 of Table 2 also reports estimated callback proportions or relative frequencies (rel. freq., or the median of the posterior distribution) and 95 percent credible intervals for these estimates (95 percent highest density intervals or HDIs) derived from Bayesian reanalysis of these data. In addition, column 5 reports estimates of the difference in relative group frequencies (i.e., est. diff.) and 95 percent HDIs for these estimated group differences. These values reported in column 5 are also displayed in Figure 1, which shows full posterior probability densities for differences in callback proportions between “more willing” and “less willing” employer groups.

Distributions of credible effect sizes for the employer vignette/audit overlap by vignette coding condition, contrasted with counterfactual moderate and strong expected effect sizes.
Panel 2A in Table 2 shows the estimated proportion difference in callbacks between employers who report being somewhat or very likely and those who report being somewhat or very unlikely to hire ex-convicts is 0, with a 95 percent HDI ranging from −.09 to .08. Two things are particularly noteworthy about the findings reported in panel 2A. First, as expected, these estimates essentially reproduce P&Q’s published null findings, which showed a failure to reject the hypothesis that employers’ vignette responses and audit behaviors are statistically independent (proportion difference = .01; 95 percent CI [−.09, .09]; see 2005:376).
Second, the 95 percent credible interval for this estimated group difference (Table 2, panel 2A, column 2) contains both the counterfactual “moderate” (.05) and “strong” (.08) effect sizes. This is also visually apparent in panel 2A of Figure 1, as the vertical dashed lines representing “moderate” (a) and “strong” (b) proportion differences both fall within the solid horizontal line representing the 95 percent credible interval. Hence, although these data cannot rule out the possibility of absolute statistical independence between employers’ vignette reports and audit callbacks, neither can they rule out the possibility of a strong vignette–audit association. Put differently, using conventional confidence thresholds with these data, we cannot reject the possibility of a vignette–audit association equivalent to an OR of nearly four (OR = 3.75; see panel 1A, column 2, Table 1) or comparable in magnitude to strong meta-analytic average correlations between survey-based and observational measures of behavior (i.e., Φ/Φmax = .60 vs. r = .52; see Glasman and Albarracin 2006).
In comparison, results of reanalysis using alternative coding decisions for collapsing employer vignette responses into more likely and less likely employer groups show larger nonzero estimates of the group differences in callback proportions (ranging from .03 to .06; see column 5 of panels 2B–D in Table 2). Similarly, these results show that moderate and strong counterfactual effect sizes also fall well within the 95 percent credible intervals of the estimated group differences for every coding condition (see panels 2B–D in Figure 1).
Estimated Posterior Probabilities
Thus far, reported results should resemble the familiar logic underlying classic null- or point-equivalence hypothesis tests. In contrast, the probabilities reported in column 6 of Table 2 represent a salient departure from this frequentist paradigm, and one that illustrates a key benefit of adopting even a simple Bayesian modeling approach specifying default priors to conduct routine statistical tests. These estimates provide a direct, if tentative, answer to the central question motivating P&Q’s study—whether there is a mismatch between “what employers say versus what they do.” Calculation of these probabilities is made possible by the estimation of the posterior density from the (default) prior distribution and the data-driven likelihood.
Column 6 in panel 2A, which reproduces P&Q’s coding, reports a probability estimate of .52. Consistent with their oft-cited conclusions, this estimate suggests that betting on employers doing what they say is associated with odds that are slightly better than a coin flip. That is, there is an estimated 52 percent chance that employers who say they are somewhat or very likely to hire a hypothetical applicant with a criminal record in the vignette also call back applicants with criminal records in the audit study more often than employers who report they are somewhat or very unlikely to do so. This contrast is visually apparent in panel 2A of Figure 1, where 52 percent of the area under the curve (blue shaded region) falls to the right of zero.
In contrast, when employing alternative coding decisions for employers’ reported willingness to hire applicants with criminal records (see Table 2, panels 2B–D, column 6), the probability that the relative frequency of callbacks is greater among the more likely group compared to the less likely group is higher, ranging from .76 to .81. Likewise, in panels 2B–D of Figure 1, between 76 percent and 81 percent of the area under the posterior density curves falls to the right of zero. In other words, when employer “willingness to hire” is coded differently than in the original study, the probability is greater than 75 percent that employers who say they are more willing to hire applicants with criminal records indeed do call back such candidates at a greater rate than their less willing counterparts.
Finally, for each coding condition, similar posterior probabilities are reported for the counterfactual expected effect thresholds to the right of Figure 1. These values indicate how likely it is, in probability terms, that there is a “moderate” or “strong” association between employer vignette reports and audit behaviors, given these data and no prior substantive knowledge (i.e., flat priors). Probabilities of a “moderate” association vary between .13 for P&Q’s coding condition (panel 2A) and .45 for the very likely/very unlikely contrast (panel 2D). Probabilities of a “strong” association vary between .03 (panel 2A: P&Q’s coding condition) and .16 (panel 2B: very likely/at most somewhat likely condition). While these values offer additional information about the observed vignette–audit associations, we offer them primarily for transparency purposes. We caution against overinterpreting these values, given the anti-conservative or inflated nature of the counterfactual estimates and the unreliability of observed estimates produced by an underpowered study.
Discussion
This article introduces two methods for assessing and addressing issues emerging from the standard application of NHST analyses in potentially underpowered research designs. First, PHDA is presented as a means for retrospectively assessing whether an NHST analysis that generates null findings otherwise might have had sufficient statistical power to detect a range of plausible and potentially true nonnull effects. Second, Bayesian analysis with default priors is described as more informative than NHST for detecting signals in underpowered data. We illustrate the utility of these methods by applying them to P&Q’s (2005) influential study, which documented a mismatch between what employers say and what they do when considering hiring applicants with criminal records. Thus, in addition to showcasing PHDA and Bayesian analysis, a primary goal was to assess the validity and robustness of the frequently cited conclusion that what people say they will do in a survey is incongruent with what they do in a real-world audit.
PHDA, Statistical Power, and Null Findings
First, we conducted a PHDA to identify plausible counterfactual effect sizes and estimate statistical power for the original study. Results from the PHDA (Table 1) suggest that P&Q’s initial study likely lacked sufficient statistical power to detect moderate or even strong associations between employers’ attitudes and behaviors. This lack of power reflects both a modest overlapping audit and survey sample size (N = 156) and the relative rarity of successful callbacks (N callbacks = 11). In short, the signal-to-noise ratio is weak, making sign and magnitude errors likely in estimating the relationship between what employers say and what they do. (For a detailed discussion of power problems in audit designs, see Vuolo et al. 2016.)
Second, we reanalyzed P&Q’s (2005) cross-tabular frequency data using NHST methods across four different coding conditions. These robustness checks confirmed P&Q’s key finding—in all four measurement conditions, NHST tests failed to reject the null hypothesis of no difference in audit callbacks across employer groups.
Although our NHST reanalysis reproduces P&Q’s null result across four measurement conditions, we cannot echo their substantive interpretations of these results. For instance, P&Q (2005:373, 374) conclude that the “low correlation between expressed and observed hiring outcomes presents an epistemological worry” and that “these findings suggest that sociologists may need to reevaluate what is learned from studies that use vignettes of hypothetical situations.” Yet, strictly speaking, these findings only indicate that we are unable to reject the null hypothesis of no relationship between employers’ vignette responses and their audit behaviors—they do not provide evidence in favor of the null hypothesis.
Rather, failures to reject the null routinely occur even when a true nonnull association exists; such false negatives are especially likely in underpowered designs. After all, a typical social scientific research study that specifies a α = .05 error threshold and power = .80 tolerates a 5 percent false positive error rate (α-errors) and a 20 percent false negative error rate (β-errors; power = 1 − β). Thus, the typical design is four times as likely to incorrectly fail to reject the null when a true nonnull effect exists as it is to falsely reject a true null hypothesis. False negatives are even more likely to occur when a study is underpowered. For example, if α = .05 and true power = .50 then false negatives are 10 times more likely than false positives (cf. “threshold asymmetry”; Burt et al. 2017:474). Likewise, our PHDA results suggest P&Q’s research design is particularly susceptible to false negatives—it lacks sufficient power to reliably detect a strong, or in some conditions even the maximum possible, association between vignette reports and audit behaviors.
Our PHDA application has broad implications for recent debates about the “reproducibility crisis” in science. One concern emerging from these debates is the scientific community’s apparent aversion to publishing null findings. Ferguson and Heene (2012:558) referred to this aversion as “…arguably one of the most pernicious and unscientific aspects of modern social science.” In response, researchers and editors increasingly are encouraged to publish null findings to reduce reporting biases, improve accuracy in meta-analytic results, and ultimately encourage theoretical falsification (Ferguson and Heene 2012).
From this perspective, P&Q’s (2005) study offers a prominent and widely cited counterexample of a null finding published in a high-impact journal (see also Greenland 2011; Vadillo et al. 2016). If researchers and editors begin heeding calls to publish null findings in response to concerns about scientific reproducibility, then studies like P&Q’s might become increasingly common across the social sciences. However, if false positives pose a problem for science (Ioannidis 2005; Simmons et al. 2011), then false negatives pose an equal or greater threat to scientific inquiry. As Fiedler, Kutzner, and Krueger note (2012:663), “[e]very α error [false positive] on a focal hypothesis entails β errors [false negatives] on alternative hypotheses, just as for every falsely convicted person one (or more) true criminals go free.” Also, false negatives are less likely than false positives to stimulate subsequent research—and thus less likely to be corrected by failed replication attempts. By lingering around longer, false negatives can perpetually contaminate scientific reasoning. Moreover, scientific advances often emerge from researchers overcoming false negatives, such as through precise specification and crucial tests of innovative alternative hypotheses. Conversely, though comparatively more common, theoretical innovations “hardly ever arise from abandoning false positives” (Fiedler et al. 2012:666).
Thus, though we support reporting and publishing null findings, our discussions and empirical example should highlight the importance of critically assessing and cautiously interpreting such findings; PHDA can help achieve this goal. With that said, Gelman and Carlin (2014:643) caution against improperly using retrospective power analysis “as an alibi to explain away nonsignificant findings” (cf. Greenland 2012; Hoenig and Heisey 2001). Instead, these authors suggest their version of PHDA is particularly useful for assessing whether strong (statistically significant) nonnull effect estimates might be potentially biased due to an underpowered design (p. 642). These cautions notwithstanding, our application to P&Q’s study illustrates how a power-focused PHDA using counterfactual expected effects can be used to assess the appropriateness of conclusions from published null findings and determine whether a test might have been underpowered in the first place.
Weak Signals and Bayesian “Best” Bets
Considering the inherently uninformative nature of the null findings from NHST analysis, we presented Bayesian analysis with default priors as a useful method for detecting signals in underpowered designs. In our application to P&Q’s data, we show how a Bayesian approach might uncover additional information about the relationship between what employers say versus what they do, even though the underlying signal in the likelihood is weak. Moreover, the Bayesian reanalysis also assessed robustness of conclusions across four measurement conditions.
Overall, results of the Bayesian reanalysis suggest that substantive conclusions about whether employers “walk the talk” may depend on how survey vignette responses are coded. Using P&Q’s original coding (i.e., collapsing the very likely and somewhat likely survey responses), our analysis reproduces their primary conclusions and calls into question whether employers act in accordance with their vignette-reported intentions. Even here, though, a caveat is necessary. In the original coding condition, a strong association remains plausible by conventional statistical standards, if perhaps unlikely given Bayesian posterior probabilities.
In contrast, if employer willingness is measured differently, then the findings present a challenge to the central conclusion emerging from P&Q’s study. Results using alternative coding decisions indicate there is between a 76 percent and 81 percent chance that employers who say they are “more willing” to hire indeed call back candidates with criminal records at a greater rate than their less willing counterparts. Hence, even in an underpowered analysis, in which conservative assumptions favoring the original study’s conclusions are applied or prior knowledge about attitude–behavior correspondence is ignored altogether, results in three of four conditions suggest it is likely that employers do “walk the talk,” though to what degree remains unknown.
The Bayesian models in this study specify default or minimalist priors, which place a greater burden than informative priors on the observed data to detect signal from noise. As explained previously, using informative priors such as average correlations in meta-analyses of attitude–behavior correspondence would have resulted in the typically large average prior correlations (e.g., r = .52; Glasman and Albarracin 2006) to dominate the posterior distribution, thus tipping the scales toward concluding that employers’ attitudes and actions substantially overlap. Yet this study raises serious questions about the amount of detectable signal present in these data, given the relatively small sample containing very few “success” events (i.e., only 11 total callbacks of applicants with criminal records were observed out of 156 observations).
Given these concerns, we caution against making strong claims from any analyses of these data. Our “best” bet is that employers who say they are very likely to hire applicants with criminal records probably do call back such applicants more frequently than their counterparts. Since P&Q’s (2005:377) published data alone do not provide strong grounds for making a “good” or “safe” bet of any kind, our bet is informed by our Bayesian reanalysis of these data and our prior knowledge of research on attitude–behavior correspondence (cf. Kim and Hunter 1993; Glasman and Albarracin 2006; Schuman and Johnson 1976; Vaisey 2014). Again, had we relied upon this existing research to specify an informative prior distribution or set of distributions for our Bayesian analysis, the strong priors would have dominated the weak signal in the likelihood, and this is the conclusion we would have reached.
On the broader issues of reproducibility and validity of scientific inquiry, we note that even our simple Bayesian application to underpowered data generated more information than do many complex NHST applications to large data sets. This is because common NHST applications in sociology merely involve a simple dichotomous decision: to reject or not to reject. Such point-null tests provide very little information about a single alternative hypothesis and a (generally untenable) null hypothesis: a difference or effect estimate is either “significant” (and worth reporting) if the p value is less than a specified threshold (e.g., α = .05), or it is not. This “yes/no” logic both parallels and reinforces simplistic “theoretical” debates in sociological literatures, which are often framed around whether two groups differ or whether a construct, variable, or process has an “effect” on an outcome of interest. Such simplistic debates too easily devolve into ideological shouting matches that fail to meaningfully advance substantive knowledge or theoretical precision about important social scientific questions.
In contrast, our Bayesian analysis easily generated an entire posterior probability density of effect estimates that can be used to compare the tenability of various alternative hypotheses. In our example, we showed how Bayesian analysis permits estimation of the probability, given priors and the likelihood, that the association between employers’ attitudes and behaviors is greater than zero; using the same analysis, we were also able to estimate the probability that the association is at least moderate or strong in size. While data limitations preclude precise estimation in this case, as estimation becomes more precise and credible intervals shrink in sufficiently powered analyses, even more informative comparisons are possible. For instance, one can define a “region of practical equivalence” or ROPE (Kruschke 2011:302) around zero by identifying a range of effect sizes that are deemed small enough to be substantively equivalent to zero. Then, if the credible interval around an effect estimate falls entirely outside the ROPE, one can plausibly reject the hypothesis that the effect is practically negligible in magnitude. Alternatively, if the credible interval is entirely inside the ROPE, one can essentially confirm the null—technically an impossibility in NHST—by concluding that the effect in question is likely zero or practically negligible (see Kruschke 2011).
Overall, Bayesian analysis is specifically designed to quantify the precision (or credibility) around our estimates. Reporting and interpreting CIs from a classical frequentist analysis—as P&Q (2005) do in parts of their study—similarly shifts emphasis from significant/nonsignificant hypothesis test decisions to estimation and precision (Cumming 2014). While such practices arguably represent improvements over common NHST procedures, results can also diverge sharply from Bayesian posterior distributions and generate misleading interpretations, particularly with small samples (cf. Gelman et al. 2013:91-95; Kruschke and Liddell 2018b).
Moreover, when combined with informative priors or counterfactual expected effects as presented here, Bayesian analysis also requires researchers to be clear and precise about their subjective evaluation of the state of evidence regarding a given phenomenon. Hence, the methods advocated for and demonstrated here can help “nudge” sociologists to more fruitfully advance substantive and theoretical debates in the face of the uncertainty and imprecision that characterizes much of our results. First, these methods might help avoid misinterpretations and overreactions to uncertainty stemming from underpowered studies, such as those frequently exemplified in pessimistic interpretations of P&Q’s study (e.g., Jerolmack and Kahn 2014). Second, these methods might generally improve our ability to summarize the uncertainty inherent to foregoing the laboratory and relying on noisy data from people living in an uncertain, changing, and imprecise world. Third, such methods might help us avoid “arrogance” and other problems caused by the failure to recognize or be transparent about what we do and do not know when engaged in theoretical and policy-oriented debates (Tittle 2004). Finally, a focal shift away from NHST and toward positing effect sizes, quantifying uncertainty, and contrasting alternative hypotheses might encourage theoretical movements toward precision and “strong inference.” Hence, in advocating for these alternative methods, we echo Fiedler and colleagues’ (2012:667) conclusions: The growth of science depends not so much on technical procedures of significance testing, but on clearly articulated theories and upfront debates leading to crucial tests of alternative hypotheses. Real progress can only be attained when clearly spelled-out theories enable and force researchers to predict what empirical results a theory excludes and what evidence might falsify a given theory or, preferably, allow for clear-cut decisions between two or more competing theories.
Conclusion
The present study introduces PHDA and Bayesian methods as tools that can generate valuable information beyond the results generated by standard NHST statistical approaches. We illustrate how these methods are especially useful for improving statistical inferences and minimizing errors in conclusions derived from underpowered designs. In an example, we show how pessimistic conclusions stemming from P&Q’s (2005) published null finding about the potential invalidity of survey vignettes—and of surveys more generally—are premature. Finally, we argue that adoption of these analytical tools might help push the field of sociology toward more sophisticated theoretical and substantive debates, which we view as necessary for meaningful scientific progress given the complexity of topics in our field and the impact that many sociologists want to have in the social world.
Footnotes
Acknowledgments
The authors would like to thank Christopher Winship, Jukka Savolainen, Christine Mair, Charles Tittle, Matt VanEseltine, and several anonymous reviewers for comments on previous versions of this manuscript. Also, we learned about the passing of Devah Pager while this article was under review. Her research was ahead of the curve in its thoroughness, transparency, and replicability, and she has inspired us to strive to live up to her scholarly example.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
