Abstract
The internal validity of observational study is often subject to debate. In this study, we define the counterfactuals as the unobserved sample and intend to quantify its relationship with the null hypothesis statistical testing (NHST). We propose the probability of a robust inference for internal validity, that is, the PIV, as a robustness index of causal inference. Formally, the PIV is the probability of rejecting the null hypothesis again based on both the observed sample and the counterfactuals, provided the same null hypothesis has already been rejected based on the observed sample. Under either frequentist or Bayesian framework, one can bound the PIV of an inference based on his bounded belief about the counterfactuals, which is often needed when the unconfoundedness assumption is dubious. The PIV is equivalent to statistical power when the NHST is thought to be based on both the observed sample and the counterfactuals. We summarize the process of evaluating internal validity with the PIV into a six-step procedure and illustrate it with an empirical example.
Keywords
Causal inferences are often made based on observational studies, which allow researchers to collect relatively large amounts of data with low cost per research question, compared to randomized experiments (Rosenbaum 2002; Schneider et al. 2007; Shadish, Cook, and Campbell 2002). Internal validity, which refers to whether one can make causal inference between two variables given they are correlated, is frequently challenged and difficult to assess for observational studies (Imai, King, and Stuart 2008; Imbens and Rubin 2015; Murnane and Willett 2011; Rosenbaum 2002, 2010; Rosenbaum and Rubin 1983a; Shadish et al. 2002). To characterize concerns about internal validity in observational studies, we adopt the concept of potential outcome, which refers to the outcome of every subject under every possible treatment (Holland 1986; Rubin 2007, 2008). A fundamental issue is that a subject can only choose one treatment at a time and thus only one potential outcome is observable. This renders all other potential outcomes missing (Imbens and Rubin 2015; Rubin 2005). Essentially, causal inference is treated as a missing data problem where the missing outcomes are assumed to be missing at random (MAR) conditional on a set of covariates, an assumption known as “unconfoundedness” (Imbens 2004; Rosenbaum and Rubin 1983b). Given the difficulty of justifying the unconfoundedness assumption, one may suspect the missing potential outcomes (i.e., counterfactual outcomes) are not MAR conditional on controlled covariates (Heckman 2005; Rosenbaum 1987; Rosenbaum and Rubin 1983a). This implies a missing confounder may exist and consequently the missing potential outcomes may not be comparable to the observed outcomes.
It is noteworthy that observational studies only approximate the missing outcomes based on this assumption; however, if important variables are omitted, such approximation would be misleading. The robustness of a causal inference is defined in this context as whether a causal relationship between two variables can still hold when the unconfoundedness assumption fails. The robustness of a causal inference is evaluated based on one’s belief about counterfactual outcomes or missing confounders in order to make a decision about whether this inference is trustworthy (Frank 2000; Frank et al. 2013). We leverage this logic to quantify the robustness of a causal inference based on one’s belief about the mean counterfactual outcomes for the treated subjects and the controlled subjects. To do this, we first define counterfactual outcomes as the unobserved sample and incorporate such unobserved sample into the observed sample to form the ideal sample, which, as indicated by its name, is ideal for making a causal inference (Frank et al. 2013; Rubin 2004, 2005; Sobel 1996). We focus the mean counterfactual outcomes (rather than individual values of counterfactual outcomes) because they are sufficient statistics for causal inference in a simple context. We further define the probability of a robust inference for internal validity (henceforth, we abbreviate it as the PIV) based on the ideal sample as the robustness index of internal validity. Our analytical procedure aims to bound the PIV of an inference based on one’s belief and inform the robustness of a causal inference based on such bound(s). We apply our approach to Hong and Raudenbush (2005) which estimated a negative effect of kindergarten retention on reading achievement. Although Hong and Raudenbush analyzed a nationally representative sample mitigating concerns about external validity, the treatments (i.e., retained in kindergarten vs. promoted to the first grade) were not randomly assigned in this observational study, raising potential concerns about internal validity (Allen et al. 2009; Frank et al. 2013; Hong 2010; Schafer and Kang 2008).
A Survey of Similar Approaches
Sensitivity Analysis
Sensitivity analysis (Rosenbaum 1986, 1987, 1991, 2002, 2010; Rosenbaum and Rubin 1983a) addresses the influence of a missing confounder on the estimates and inference for regression and nonparametric tests, and more importantly, it connects the violation of unconfoundedness assumption to the violation of random assignment in matched pairs. Therefore, it informs the internal validity of a matching design. Other literature on sensitivity analysis has similar orientation toward missing confounders (Copas and Li 1997; Hosman, Hansen, and Holland 2010; Lin, Psaty, and Kronmal 1998; Masten and Poirier 2018; Robins, Rotnitzky, and Scharfstein 2000; VanderWeele 2008). The PIV shares the objective of checking the sensitivity of results to potential violation of the unconfoundedness assumption with the sensitivity analysis, but the PIV is not limited to a single type of design (like matching) or estimation (like regression). In fact, the PIV can be employed in any design that deemed appropriate for observational studies.
Bayesian Sensitivity Analysis
Bayesian sensitivity analysis (BSA; McCandless and Gustafson 2017; McCandless, Gustafson, and Levy 2007; McCandless et al. 2012) parameterizes the models for explaining the outcome and the unmeasured confounder carefully, so that it can identify the key parameters of confounding effect and examine their impacts on the estimate of treatment effect under a Bayesian framework. BSA has two main advantages: First, the data augmentation in Bayesian modeling allows one to build a model for the unobserved confounder and repeatedly draw random samples of it. As a result, one would get expected distributions of the confounding and treatment effect parameters. Additionally, BSA offers modeling flexibility through prior specification. Comparing to BSA, the implementation and interpretation of the analysis for the PIV would be much easier as BSA is built on complicated Markov chain Monte Carlo algorithms.
The Robustness Indices of Causal Inferences
The robustness indices of causal inferences (Frank 2000; Frank et al. 2013) quantify the strength of internal validity in terms of the impact of an unmeasured confounding variable or the proportion of observed cases can be replaced by the null cases that an inference can afford. The PIV is inherently connected to both papers as it starts with the decision rules and the missing data perspective shared by Frank et al. (2013) and relies on the relationship between the estimate of average treatment effect and null hypothesis statistical testing (NHST), which has been studied by Frank (2000). The PIV is different from the robustness indices because it requires a bounded belief about counterfactual outcomes (or a missing confounder), and it is a probabilistic index which is shown to be equivalent to the statistical power.
Manski’s Bounds of Treatment Effect
Bounding treatment effect is proposed by acknowledging the issue of nonidentification of the estimate of average treatment due to counterfactual outcomes (Manski 1990, 1995; Manski and Nagin 1998). Different bounds of treatment effect can be obtained by imposing different assumptions on the counterfactuals, and the bounds of treatment effect would be tightened by making stronger assumption(s). Both the PIV and the bounds of treatment effect proposed by Manski consider the situations when the unconfoundedness assumption is implausible so that one has to form a belief about counterfactual outcomes. Different from the PIV, Manski’s bounds are not built on NHST and the parametric (normality) assumption. Rather, the bounds offer insights about the worth of a causal inference through exploring loss-based alternatives rooted in the context of program evaluation. Furthermore, Manski’s bounds leverage nonlinear relationships to determine constraints on parameter values, whereas the PIV is built on comparison of means and quantifies the likelihood an inference would hold, assuming normality.
Replication Probability
Various replication probabilities have been proposed for two main reasons: First, they purpose safeguarding readers from the misguidance and misinterpretation of p values. Second, they are used to accentuate that the true scientific significance is about replicability rather than statistical significance (Boos and Stefanski 2011; Greenwald et al. 1996; Killeen 2005; Posavac 2002; Shao and Chow 2002). The PIV is in fact the probability of replicating a significant result in observational study, and it is more akin to
Counterfactual Outcomes as the Unobserved Sample
Research Setting
This article targets observation studies with two groups, that is, the treatment group and the control group. Furthermore, we only consider observational studies with representative samples so that we can focus on internal validity. This article focuses on the simple group-mean-difference estimator (referred to as the simple estimator henceforth) of an average treatment effect, which computes the difference between the adjusted mean treated outcome and the adjusted mean control outcome. The adjusted means can be calculated based on propensity score matching or stratification and perceived as valid estimators of true means of treated outcome and control outcome when the unconfoundedness assumption holds.
Definitions
Example: The unobserved sample of Hong and Raudenbush (2005) is the collection of counterfactual reading scores of sampled students in their study. Specifically, this unobserved sample can be decomposed into the unobserved control sample which is the collection of reading scores of retained students had they all been promoted to first grade and the unobserved treated sample which is the collection of reading scores of promoted students had they all been retained in kindergarten.
Figure 1 illustrates the conceptualization of the unobserved sample in Hong and Raudenbush (2005) for the simple estimator. The observed outcome

The unobserved sample in Hong and Raudenbush (2005) for the simple estimator.
Finally, we define the ideal sample as follows:
Drawing on the definitions above, we argue that it is the unobserved sample that induces the bias which undermines internal validity. The unobserved sample can be perceived as the gap between the observed sample and the ideal sample needed for insuring internal validity. The unconfoundedness assumption implies the unobserved sample is ignorable based on a set of covariates, that is, the unobserved sample will essentially be the same as the observed sample conditional on the set of covariates. Given this assumption is frequently and constantly challenged, our goal is to quantify the robustness of the inference by discovering how the unobserved sample affects the NHST.
Sample Statistics and Notation
This section introduces the notations of the sample statistics defined based on the observed, unobserved, and ideal samples. In general, the observed sample statistics are all fixed and known quantities since the observed sample is held fixed when we consider using the PIV. The unobserved and ideal sample statistics, on the other hand, are unknown quantities of main interest. In this context, The observed sample statistics (known and fixed): The unobserved sample statistics (focused unknown): The ideal sample statistics (unknown due to the unobserved sample):
We are interested in the distribution of
The PIV
The PIV is rooted in NHST context. To conduct a causal inference, the null hypothesis H0:
Frank et al. (2013) provided the following decision rules on whether a causal inference will be invalidated due to limited internal validity: Given
Likewise, the PIV is defined as follows for a significantly negative
It’s noteworthy that the PIV in equations (1) and (2) are actually the simplified version of
For a significantly positive
For a significantly negative
Note here that equations (3) and (4) will only be approximately true for studies with small sample sizes and typically C is chosen based on the level of significance. For example, C would be 1.96 if
The Relationship Between the PIV and the Mean Counterfactual Outcomes
If the treated outcome and the control outcome are independent and roughly normally distributed, the distribution of
where,
Here we need to conceptualize
It’s remarkable that results equations (5) and (6) can be derived in a either frequentist fashion or Bayesian fashion (see derivations in Online Appendix, which can be found at http://smr.sagepub.com/supplemental/), and therefore, it has both frequentist and Bayesian interpretations (Li 2018). In frequentist world, the unobserved sample is part of the ideal sample so that
Results equations (5) and (6) show that the distribution of
For a significant positive
For a significant negative
Note that the decision threshold
Resultantly, the probit functions in equation (7) becomes
Likewise, the probit function in equation (8) becomes
Drawing on the results above, one can bound the PIV based on a belief about
Example: The Effect of Kindergarten Retention on Reading Achievement
Overview
Alexander, Entwisle, and Dauber (2003) established kindergarten retention as a widespread phenomenon in the United States and with profound impacts for both promoted children and retained children, and therefore, it has long been a controversial issue. To address such controversy, Hong and Raudenbush (2005) conducted an analysis that combined a multilevel model controlling for logits of propensity scores and propensity score strata to evaluate the effects of kindergarten retention policy and actual kindergarten retention on students’ academic achievement. They used a nationally representative sample that contained about 7,639 students and 1,070 schools. Drawing on this design, Hong and Raudenbush (2005) estimated the effect of kindergarten retention on students’ reading achievement as −9.01 with standard error of 0.68, which amounted to a significant effect whose size is about 0.67. In light of this considerable effect, Hong and Raudenbush (2005) concluded that “children who were retained would have learned more had they been promoted” and therefore “kindergarten retention treatment leaves most retainees even further behind” [page 220].
Nevertheless, the internal validity of Hong and Raudenbush (2005) is subject to debate because propensity score analysis is built on the assumption of unconfoundedness, which implies all confounding variables are able to be observed and controlled in the causal model. However, as argued by Frank et al. (2013), some confounding variables may not be fully measured and controlled, incurring selection bias in the estimate. In cases such that an omitted variable was negatively correlated with kindergarten retention and positively correlated with reading achievement, the negative effect of kindergarten retention could be biased, and thus their inference would be invalidated if such a variable were taken into account.
To address the concern about the internal validity of Hong and Raudenbush’s inference, we propose an analytical procedure that employs the PIV and its relationship with the mean counterfactual outcomes. This analytical procedure comprises six steps: (1) get the observed sample statistics, (2) choose critical value C, 2 (3) obtain the relationship between the PIV and the mean counterfactual outcomes, (4) state belief about the mean counterfactual outcomes, (5) bound the PIV, (6) conclusion.
Quantifying the Robustness of the Inference of Hong and Raudenbush (2005)
Get the observed sample statistics: The required observed sample statistics are as follows:
Choose critical value C: Since Hong and Raudenbush (2005) reported the effect of kindergarten retention was significantly negative, we choose C as −1.96 which means
Obtain the relationship between the PIV and the mean counterfactual outcomes: Once the observed sample statistics and C are plugged into the probit model equation (11), the probit model for Hong and Raudenbush can be explicitly written as
State belief about the mean counterfactual outcomes: This step asks one to state and bound his belief about
4.1 The first belief: Given the inference of Hong and Raudenbush (2005) mostly informed the mean counterfactual reading score of the retained students (i.e.,
4.2 The second belief: First of all, we believe that the average retention effects for the promoted students and for the retained students should be both negative and the average retention effect for the retained students, which was originally estimated as −9 by Hong and Raudenbush, was overestimated. Therefore, the plausible region is defined based on the bounded belief that
Bound the PIV: For the first belief, the lower bound for the PIV is 0.77 given
Conclusion: To facilitate the decision-making process, one can use a threshold about the PIV such that an inference is deemed robust for internal validity whenever the PIV exceeds this threshold. Since the PIV is the statistical power of retesting the null hypothesis: δ = 0 based on the ideal sample, one can use PIV = 0.8 as the threshold which is often used for strong statistical power (Cohen 1988, 1992). Therefore, the two beliefs we formed in the fourth step would lead to the conclusion that Hong and Raudenbush’s inference is robust for internal validity. We caution readers that this conclusion might not be hold if one has a different belief and/or a different threshold for the PIV.

The contour plot of the PIV in the plausible region (
There are two key observations in Figure 2: First, in general, the PIV will be more sensitive to
By definition, the PIV is the statistical power of retesting the null hypothesis: δ = 0 versus the alternative hypothesis:

The relationship between the PI and retesting hypothesis in the ideal sample for Hong and Raudenbush (2005), assuming
Conclusion
Focusing on the mean counterfactual outcomes for treated and controlled subjects, we began by defining the unobserved sample as the collection of counterfactual outcomes and the ideal sample as the collection of all the potential outcomes of the observed sample. It’s worth emphasizing that the ideal sample is sufficient for securing internal validity, and based on the ideal sample the null hypothesis is thought to be tested against the alternative hypothesis. The PIV is thus defined in this scenario as the probability of rejecting the same null hypothesis again in the ideal sample given it has been rejected in the observed sample. This study recasts the assessment of internal validity as the task of bounding the PIV for an inference based on a bounded belief about the mean counterfactual outcomes.
This article makes three main contributions to the field: First, it promotes counterfactual reasoning by prompting one to conceptualize the mean counterfactual outcomes and form bounded belief about them. Counterfactual reasoning is a necessary step of causal reasoning as it takes one to an imaginary world of what could have happened, thanks to human strength in thinking about cause (Pearl and Mackenzie 2018). Through counterfactual reasoning, causal inference really boils down to comparing the means as one explores all potential outcomes (Imbens and Rubin 2015). The PIV informs internal validity by quantifying the likelihood of an inference would still hold under all different scenarios of counterfactual reasoning. Second, the PIV has an intuitive interpretation. It is the statistical power of retesting the hypothesis
Future work should focus on extending this model in two aspects: First, future work should revise the current model for subpopulations that are either non-normal or heterogeneous in nature as the normality assumption is unlikely to hold in this case. Second, built on the framework which informs how counterfactuals affect the NHST through the PIV, future work needs to delve deeper into why counterfactuals change, which may due to missing confounders, the violation of Stable Unit Treatment Value Assumption (SUTVA), or measurement error.
Supplemental Material
Supplementary_material - The probability of a robust inference for internal validity
Supplementary_material for The probability of a robust inference for internal validity by Tenglong Li and Ken Frank in Sociological Methods & Research
Footnotes
Authors’ Note
Tenglong Li is now affiliated with Department of Biostatistics, Boston University.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
