Causal Inference With Two Versions of Treatment

Abstract

Causal effects are commonly defined as comparisons of the potential outcomes under treatment and control, but this definition is threatened by the possibility that either the treatment or the control condition is not well defined, existing instead in more than one version. This is often a real possibility in nonexperimental or observational studies of treatments because these treatments occur in the natural or social world without the laboratory control needed to ensure identically the same treatment or control condition occurs in every instance. We consider the simplest case: Either the treatment condition or the control condition exists in two versions that are easily recognized in the data but are of uncertain, perhaps doubtful, relevance, for example, branded Advil versus generic ibuprofen. Common practice does not address versions of treatment: Typically, the issue is either ignored or explicitly stated but assumed to be absent. Common practice is reluctant to address two versions of treatment because the obvious solution entails dividing the data into two parts with two analyses, thereby (a) reducing power to detect versions of treatment in each part, (b) creating problems of multiple inference in coordinating the two analyses, and (c) failing to report a single primary analysis that uses everyone. We propose and illustrate a new method of analysis that begins with a single primary analysis of everyone that would be correct if the two versions do not differ, adds a second analysis that would be correct were there two different effects for the two versions, controls the family-wise error rate in all assertions made by the several analyses, and yet pays no price in power to detect a constant treatment effect in the primary analysis of everyone. Our method can be applied to analyses of constant additive treatment effects on continuous outcomes. Unlike conventional simultaneous inferences, the new method is coordinating several analyses that are valid under different assumptions, so that one analysis would never be performed if one knew for certain that the assumptions of the other analysis are true. It is a multiple assumptions problem rather than a multiple hypotheses problem. We discuss the relative merits of the method with respect to more conventional approaches to analyzing multiple comparisons. The method is motivated and illustrated using a study of the possibility that repeated head trauma in high school football causes an increase in risk of early onset cognitive decline.

Keywords

causal effects closed testing full matching intersection-union test randomization inference sensitivity analysis versions of treatment

1. What Are Versions of Treatment?

Commonly, the effect on an individual caused by a treatment is defined as a comparison of the two potential outcomes that this individual would exhibit under treatment and under control (see Neyman, 1923/1990; Rubin, 1974; Welch, 1937). Implicit in this definition is the notion that the treatment and control conditions are each well defined. In particular, it is common to assume that there are “no versions of treatment or control” (see Rubin, 1986).

By definition, versions of treatment are not intended additions to a study design, but rather potential flaws in the study design. Versions of treatment or control are often associated with finding treatments that occur naturally, rather than experimentally manipulating a tightly controlled, uniform treatment. Versions of one treatment should be distinguished from the intentional study of distinct, competing treatments. When an investigator discusses versions of one treatment, she is expressing a preference for the conception that there is a single treatment but is acknowledging the possibility that her preferred conception is mistaken. Branded Advil and generic ibuprofen are versions of one treatment—possibly different, but very plausibly expected to be the same—whereas ibuprofen and aspirin are different competing treatments. The investigator and her audience prefer a primary analysis that does not distinguish versions of treatment, but both would be reassured by evidence that showed their preferred analysis does not embody a consequential error. Versions of control groups should also be distinguished from the deliberate use of two carefully selected control groups intended to reveal unmeasured biases if present (see, for instance, Rosenbaum, 1987). In particular, Campbell (1969) suggested that two control groups should be deliberately selected to systematically vary a specific unmeasured covariate in an effort to demonstrate its irrelevance; however, versions of control are unintended flaws in study design, not purposeful quasi-experimental devices.

There are two versions of either the treatment condition or the control condition if we recognize in available data either two types of treated subjects or two types of controls, but we are uncertain about, or perhaps explicitly doubt, the relevance of this visible distinction. Versions refer to a visible but perhaps unimportant distinction, not to a distinction that is hidden or latent. There are important methodological issues in recognizing treatments that inexplicably affect some people but not others; however, this is practically and mathematically a different problem (Conover & Salsberg, 1988; Rosenbaum, 2007a).

In discussing randomized clinical trials, Peto et al. (1976, pp. 590–591) wrote,

A positive result is more likely, and a null result is more informative, if the main comparison is of only 2 treatments, these being as different as possible.…[I]t is a mark of good trial design that a null result, if it occurs, will be of interest.

This advice is equally relevant for observational studies, and it is part of the reason that we prefer a conception in which there is a single treated condition and a single control condition. Despite this, an investigator may seek some reassurance that the study’s conclusions cannot be undermined by the possibility of two versions of treatment.

In that spirit, our analysis focuses on the main treatment–control comparison and subordinates the study of versions of treatment or versions of control. In particular, the main treatment–control comparison is unaffected by the exploration of versions of treatment—the usual confidence interval for a constant effect is reported—despite controlling the family-wise error rate in multiple comparisons that explore the possibility of versions of treatment with different effects. Two confidence intervals are reported, the usual interval for a constant effect and an interval designed to contain both effects if the two versions differ. If the effect is constant, then both intervals simultaneously cover that one effect with probability $\geq 1 - α$ , but if there are two versions, then the second interval covers both version effects with probability $\geq 1 - α$ . The investigator always reports both intervals, valid under different assumptions. This is an unusual type of simultaneous inference: There is essentially one question, but there are two sets of assumptions underlying the answer, so one question is answered twice, as opposed to answering several different questions. There are multiple assumptions rather than multiple hypotheses. The two intervals together permit an investigator to report the conventional confidence interval for a constant effect, without lengthening it for multiple testing, yet the investigator also provides some information about whether the study’s conclusions depend on the absence of versions of treatment. The two intervals may possibly disagree, say about whether no effect is plausible, and if they do disagree, then they demonstrate that the assumption about versions of treatment is playing an important role in the interpretation of the available data. Importantly, the method does not presume there is a single version of treatment by virtue of failing to reject the null hypothesis that the two versions are equal. That is, it entertains the possibility that there are versions of treatment even when the conventional interval assuming a constant treatment effect is not empty.

2. Possible Versions of Control in a Study of Football and Dementia

There is evidence that severe repeated head trauma accelerates the onset of cognitive decline or dementia (Graves et al., 1990; Mortimer et al., 1991), with specific concern about the risks faced by professional football players and boxers (Lehman et al., 2012; McKee et al., 2009). It is unclear whether there is also increased risk from playing football on a team in high school, but there have been several recommendations against tackle football in high school (Bachynski, 2016; Miles & Prasad, 2016). Does high school football accelerate the onset of cognitive decline?

A recent investigation used data from the Wisconsin Longitudinal Study, comparing cognition and mental health measured at ages 65 and 72, recorded in 2005 and 2011, of men who played football on a high school team in the mid-1950s to male controls of similar age who did not play football (Deshpande et al., 2017). Following the practice in clinical trials, and as is recommended for observational studies by Rubin (2007), the design and protocol for this study were published online after matching was completed but before outcomes were examined (Deshpande et al., 2016). The small number of people who engaged in sports other than football with high incidences of head trauma such as soccer, hockey, and wrestling were excluded from both the football and control groups. One outcome was the 0 to 10 score on a 10-item delayed word recall (DWR) test at ages 65 and 72. The DWR test was designed as an inexpensive measure of memory loss associated with dementia (see Knopman & Ryberg, 1989). In this test, a person is asked to remember a list of words that is then read to the person. Attention then shifts to another activity, and after a delay, the person is asked to recall as many words from the list as possible. The DWR score is the number of words remembered. On average, in the Wisconsin Longitudinal Study, performance on the DWR test declined by half a word from age 65 to 72. It is useful to keep that half-word, 7-year decline in mind when thinking about the magnitude of the effect of playing football.

A comparison of football players to all controls is natural and might be conducted without second thought. Among the controls, however, some played a noncollision sport-like baseball or track, while others played no sports at all. An investigator might reasonably seek reassurance that this natural comparison has not oversimplified these two version of “not playing football.” At the same time, the investigator does not want to sacrifice power to detect a constant treatment effect in the main comparison en route to obtaining this reassurance by subdividing the data into many slivers of reduced sample size and correcting for multiple comparisons. The method we propose achieves both of these objectives.

Our question concerns the effects of high school football. It is important to distinguish this question from questions about the effects of severe head trauma in general. It is at least conceivable that high school football is comparatively harmless, while severe head trauma is not, simply because severe head trauma is not common in high school football, and the benefits of exercise for all football players offset the harm of severe but rare trauma. Conversely, severe head trauma from automotive or other accidents may be difficult to prevent, but if high school football had grave consequences, then it could simply be banned, in the same way that most high schools do not have boxing teams. We ask about the effects of playing football in high school on subsequent cognitive function.

The Wisconsin Longitudinal Study describes a specific piece of the United States over a specific period of time, and caution is advised about extrapolating its conclusions to other times and places. High School football may have changed since the 1950s, and the demographic composition of Wisconsin in the 1950s is not the demographic composition of the United States. The Wisconsin Longitudinal Study is primarily a sequence of surveys, and it is impossible to use it to investigate questions not asked in those surveys. For instance, we cannot identify high school students who went on to play professional football, but we suspect they were few in number. Because many young people play high school football, the safety of high school football is an important question apart from the safety of professional football.

3. Full Matching of Football Players and Controls

We matched the 591 male football players to all 1,290 male controls who did not play football and did not play a contact sport. The match controlled for several factors that may affect later-life cognition, including the student’s IQ score in high school, their high school rank-in-class recorded as a percent, planned years of future education, and binary indicators of whether teachers rated him as an exceptional student, and whether his teachers and parents encouraged him to pursue a college education. We also accounted for aspects of family background like parental income and education.

The match was a “full match,” meaning that a matched set could contain one football player and one or more controls, or else one control and one or more football players. A full match is the form of an optimal stratification in the sense that people in the same stratum are as similar as possible subject to the requirement that every stratum contain at least one treated subject and one control (see Rosenbaum, 1991). Although the proof of this claim requires some attention to detail, the key idea is simple: If a matched set contained two treated subjects and two controls, it could be subdivided into two matched sets that are at least as close on covariates and are typically closer. See Hansen and Klopfer (2006) for an algorithm for optimal full matching, Hansen (2007) for software, and Hansen (2004) and Stuart and Green (2008) for applications. The match was constructed using Hansen’s optmatch package in R with the ratio of controls to treated units constrained between 1:6 and 6:1 to avoid excessively large matched sets.

In a full match, there are I matched sets, $i = 1, . . ., I$ and n_i individuals, $j = 1, . . ., n_{i}$ , in set i. If individual $i j$ played on a football team in high school, write $Z_{i j} = 1$ ; otherwise, write $Z_{i j} = 0$ . The number of football players in set i is $m_{i} = \sum_{j = 1}^{n_{i}} Z_{i j}$ , the total number of individuals is $N = \sum_{i = 1}^{I} n_{i}$ , and the total number of football players is $M = \sum_{i = 1}^{I} m_{i}$ . In a full match, $min (m_{i}, n_{i} - m_{i}) = 1$ for every i.

To explore versions of treatment, we constructed three matched samples. Each sample used all $M = 591$ football players. The first matched sample used all controls, that is, every male who played neither football nor another contact sport. The second matched sample used only controls who did not play any sport. The third matched sample used controls who played a noncollision sport such as baseball. In each match, controls and football players belong to at most one matched set. Table 1 describes the structure of the three matched samples, giving the frequency of sets of size $(m_{i}, n_{i} - m_{i})$ , as well as the number of sets, I; the number of individuals, N; and the number of football players, M. Obviously, the samples overlap extensively because they all use all $M = 591$ football players and no controls were discarded in forming the optimal full matchings; however, the three matches differ in structure, partly because there were only $N - M = 975 - 591 = 384$ controls who played a noncollision sport in the third match. In all three matches, adequate covariate balance was achieved with nearly all standardized differences in baseline covariates between football players and controls less than 0.2. Details of very similar matches can be found in Deshpande et al. (2017).

Table 1.

Distribution of Matched Set Sizes, (m_i, n_i − m_i), in Three Full Matches

Comparison	(Treated Count)-(Control Count)								Totals
Comparison	3-1	2-1	1-1	1-2	1-3	1-4	1-5	1-6	I	N	M
Football versus control	0	0	401	32	26	14	17	101	591	1,881	591
Football versus no sport	70	6	240	29	15	10	3	72	445	1,497	591
Football versus other sport	90	43	227	3	2	3	0	0	368	975	591

Note. A 2-1 set contains two treated individuals and one control, while a 1-2 set contains one treated individual and two controls. There are I matched sets, containing a total of N individuals, and each match includes all $M = 591$ football players.

In studying the effects of a treatment—here, high school football—it is typically inappropriate to adjust for events subsequent to the start of treatment, as this may introduce bias even where none existed prior to adjustments because part of the treatment effect may be removed (Rosenbaum, 1984). However, there are certain adult health outcomes that may be different between football players and controls due to disparities in unmeasured baseline health characteristics rather than an effect of playing football. This may threaten the validity of our study if these baseline health characteristics also play a role in later life cognitive health. Comparing these health outcomes may can be used, at least partially, to assess the comparability of the baseline health of the comparison groups. We checked on the health status of football players and matched controls at age 65 using the Mantel–Haenszel procedure, failing to find a difference significant at the .05 level for “ever had high blood pressure,” “ever had diabetes,” and “ever had heart problems.” Football players were more likely to report that they had “ever had a stroke,” with a p value of .03, and a 95% confidence interval for the odds ratio of $[1.09, 3.21]$ . Extensive comparisons of this kind are reported in Deshpande et al. (2017).

4. Review of Randomization Inference Without Versions of Treatment

If there were a single version of treatment or control, then individual $i j$ would have two potential DWR scores, $Y_{i j} (1)$ if he played football and $Y_{i j} (0)$ if he did not, where we observe only one of these, namely $Y_{i j} = Z_{i j} Y_{i j} (1) + (1 - Z_{i j}) Y_{i j} (0)$ , and the effect caused by playing football, namely $δ_{i j} = Y_{i j} (1) - Y_{i j} (0)$ , is not observed for any individual (see Neyman, 1923/1990; Rubin, 1974). Fisher’s (1935) sharp null hypothesis of no effect says $H_{0} : Y_{i j} (1) = Y_{i j} (0)$ , $i = 1, . . ., I$ , $j = 1, . . ., n_{i}$ , which we henceforth abbreviate as $H_{0} : Y_{i j} (1) = Y_{i j} (0)$ , $\forall i, j$ or as $H_{0} : δ_{i j} = 0$ , $\forall i, j$ . The treatment has an additive constant effect if there exists some constant $τ$ such that $δ_{i j} = Y_{i j} (1) - Y_{i j} (0) = τ$ , $\forall i, j$ . The hypothesis $H_{τ_{0}}$ specifies a particular numerical value $τ_{0}$ for $τ$ and asserts $H_{τ_{0}} : δ_{i j} = τ_{0}$ , $\forall i, j$ , and it is manifested in the observable distribution of $Y_{i j}$ by a within-set shift in the distribution of $Y_{i j}$ by $τ_{0}$ . If $H_{τ_{0}}$ were true, then $Y_{i j} - τ_{0} Z_{i j} = Y_{i j} (0)$ would satisfy Fisher’s hypothesis of no effect, H₀, and it is commonplace to test $H_{τ_{0}}$ by replacing $Y_{i j}$ by $Y_{i j} - τ_{0} Z_{i j}$ and testing H₀.

Until Section 8, we restrict attention to random assignment of treatments within matched sets; however, Section 8 considers sensitivity of inferences to departures from this assumption. Of course, people do not decide to play football at random, so Section 8 is closer to reality than random assignment. Fisher (1935), Pitman (1937), and Welch (1937) used the randomization distribution of the mean difference to test Fisher’s H₀, and we follow this approach with the short-tailed DWR scores, only briefly comparing the mean to a robust M-statistic. The mean is one M-statistic, but not a robust one. Because the matched sets are of unequal sizes, $(m_{i}, n_{i} - m_{i})$ , we compute the treated-minus-control mean difference in DWR scores within each set i and combine them with efficient weights based on the matched set sizes (see Rosenbaum, 2007b, section 4.1) for discussion of these weights, which are implemented in the senfm function of the sensitivityfull package in R with option trim=Inf.

As is always true, a $1 - α$ confidence interval $I_{c}$ for $τ$ is formed by inverting a level-α test, so $I_{c}$ is the shortest interval of values of $τ_{0}$ not rejected by the test (see Lehmann & Romano, 2005, section 3) for general discussion. Typically, a two-sided confidence interval is the intersection of two one-sided $1 - α / 2$ confidence intervals (see Shaffer, 1974).

Ignoring versions of treatment, using the first match in Table 1, and assuming that treatments are randomly assigned within matched sets, we obtain a randomization-based 95% confidence interval of $[- 0.308, 0.099]$ for $τ$ , that is, for a constant effect of playing football on the number of words remembered in the DWR test. Because this confidence interval includes zero, the hypothesis of no effect is not rejected at the 0.05 level. Because this confidence interval excludes all $τ$ with $|τ| \geq 1 / 3$ , constant effects of $\pm 1 / 3$ word remembered have been rejected as too large. It is important that “no effect” is plausible, but equally important that large effects, positive or negative, are implausible values for a constant effect, $τ$ . Our goal is to avoid lengthening this interval for $τ$ as we explore possible versions of the control, while controlling the family-wise error rate at α, conventionally $α = .05$ . This simultaneous inference is possible if the exploration of versions of treatment takes a specific form.

Incidentally, had we built the confidence interval for $τ$ using the default M-estimate in the senfm function, rather than the mean with option trim=Inf, then the 95% randomization interval for $τ$ would have been $[- 0.315, 0.096]$ . The default M-estimate in senfm corresponds to Huber’s $ψ$ -function, that is, $ψ (y) = y$ for $| y | \leq 1$ and $ψ (y) = sign (y)$ for $| y | > 1$ . Generally, use of robust procedures is advisable, but we do not do so in this example to simplify its presentation, as the robust procedures give similar answers in this short-tailed example.

5. Inference With Versions of Treatment

5.1. Structure of the Problem

With two versions of control, say “playing no sport” and “playing a noncollision sport” like baseball, each person has two potential control responses, $Y_{i j} (0, a)$ and $Y_{i j} (0, b)$ , and hence two treatment effects, $δ_{i j}^{a} = Y_{i j} (1) - Y_{i j} (0, a)$ and $δ_{i j}^{b} = Y_{i j} (1) - Y_{i j} (0, b)$ . If $Y_{i j} (0, a) = Y_{i j} (0, b)$ , $\forall i, j$ , then the two versions of control yield the same effects, $δ_{i j}^{a} = δ_{i j}^{b}$ , and so the versions need not be distinguished. This notation for potential outcome under versions of control follows Vanderweele and Hernan (2013), where potential outcomes are fixed by both treatment and version. If there are versions, the implied randomization distribution will have three arms when matched sets include both versions of control, which might complicate inference. However, we only use the matching with sets including both versions of control to conduct inference under the assumption that the versions are irrelevant, and thus the three-arm randomization distribution collapses to the simpler two-arm design.

Consider the two null hypotheses about additive effects for the two versions of control, $H_{τ_{0}}^{a} : δ_{i j}^{a} = τ_{0}$ , $\forall i, j$ and $H_{τ_{0}}^{^{b}} : δ_{i j}^{b} = τ_{0}$ , $\forall i, j$ . Here, $H_{τ_{0}}^{a}$ might be true when $H_{τ_{0}}^{^{b}}$ is false, or conversely. Define $H_{τ_{0}}$ to be the hypothesis that both $H_{τ_{0}}^{a}$ and $H_{τ_{0}}^{^{b}}$ are true, that is, $H_{τ_{0}} : δ_{i j}^{a} = δ_{i j}^{b} = τ_{0}$ , $\forall i, j$ , so the two versions of control yield the same effect $τ_{0}$ and need not be distinguished. By the definition of $H_{τ_{0}}$ , if either $H_{τ_{0}}^{a}$ or $H_{τ_{0}}^{^{b}}$ is false, then $H_{τ_{0}}$ is false; that is, if there are two versions of treatment or control with different effects, then there is not a constant effect.

It is straightforward to test $H_{τ_{0}}^{a}$ or $H_{τ_{0}}^{^{b}}$ using the methods in Section 4 simply by restricting attention to controls of one type or the other. These tests will be based on a smaller sample size than the test of $H_{τ_{0}}$ in Section 4 because not all of the controls are used. Moreover, if $H_{τ_{0}}$ , $H_{τ_{0}}^{a}$ , and $H_{τ_{0}}^{^{b}}$ are each tested at level α, then the chance of at least one false rejection would typically exceed α unless something is done to control the family-wise error rate. Understandably, an investigator would like to avoid weakening the inference about $H_{τ_{0}}$ by virtue of considering $H_{τ_{0}}^{a}$ and $H_{τ_{0}}^{^{b}}$ , and the question is how to achieve the investigator’s goals.

Suppose there are two versions of a constant additive treatment effect, $δ_{i j}^{a} = τ^{a}$ and $δ_{i j}^{b} = τ^{b}$ for every $i, j$ . Let $τ_{min} = min (τ^{a}, τ^{b})$ and $τ_{max} = max (τ^{a}, τ^{b})$ . If $τ^{a} = τ^{b} = τ$ , then $τ_{min} = τ$ and $τ_{max} = τ$ , so the versions do not matter. Our approach in Subsection 5.2 is to build two confidence intervals, one interval for $τ$ and another interval designed to contain $[τ_{min}, τ_{max}]$ . If there is no need to consider versions of treatment or control because $Y_{i j} (0, a) = Y_{i j} (0, b)$ , implying that $τ^{a} = τ^{b} = τ$ , then with probability at least $1 - α$ , both intervals simultaneously cover the true $τ$ . If $τ^{a} \neq τ^{b}$ , then $H_{τ_{0}}$ is false for every $τ_{0}$ , but with probability at least $1 - α$ the second interval covers the interval $[τ_{min}, τ_{max}]$ . Moreover, the first interval for $τ$ is the interval reported in Section 4 ignoring versions of treatment, so the investigator has ensured that under either the assumption of a constant treatment effect or the assumption of versions, the two intervals he reports control the family-wise error rate at α, while paying no additional price in power to detect a constant effect for consideration of versions of treatment. The inference is simultaneous in that the two intervals provide the correct coverage for $τ$ under the assumption of a constant effect and also the correct coverage for $τ^{a}$ and $τ^{b}$ under the presence of versions.

5.2. Inference When There May or May Not Be Two Versions of Treatment

The theory in this section is derived for one-sided intervals but, as we will see shortly, is easily extended to the two-sided $1 - α$ intervals described in the previous section. There is a valid, one-sided p value, say $P_{τ_{0}}^{a}$ , testing $H_{τ_{0}}^{a}$ against $τ^{a} > τ_{0}$ , so that $Pr (P_{τ_{0}}^{a} \leq α) \leq α$ if $H_{τ_{0}}^{a}$ is true. In parallel, there is a valid one-sided p value $P_{τ_{0}}^{b}$ , testing $H_{τ_{0}}^{^{b}}$ against $τ^{b} > τ_{0}$ , and a valid one-sided p value, $P_{τ_{0}}$ , testing $H_{τ_{0}}$ against $τ > τ_{0}$ . With a slight abuse of notation, write the probability that a random interval $I$ contains a fixed real number $τ$ as $Pr (I \supseteq τ)$ . Under the assumption, perhaps incorrect, that there is a single version of the treatment, $τ^{a} = τ^{b} = τ$ , let $I_{c}^{-}$ be the usual one-sided $1 - α$ confidence interval for $τ$ formed by inverting the test of $H_{τ_{0}}$ , so $I_{c}^{-}$ is the smallest set of the form $[\tilde{τ}, \infty)$ containing $\{τ_{0} : P_{τ_{0}} > α\}$ . If there are no versions of treatment, so $τ^{a} = τ^{b} = τ$ for some $τ$ , then $Pr (I_{c}^{-} \supseteq τ) \geq α$ by the familiar duality of tests and confidence intervals (see Lehmann & Romano, 2005, chapter 3). The investigator would like to report this standard interval $I_{c}^{-}$ for a constant effect, without lengthening it for multiple testing, yet would like to also say something about the possibility that there are versions of treatment with $τ^{a} \neq τ^{b}$ . Of course, if there are versions of treatment with $τ^{a} \neq τ^{b}$ , then $H_{τ_{0}}$ is false for every $τ_{0}$ and there is no true value of $τ$ for $I_{c}^{-}$ to contain or omit. The smallest set of the form $[\tilde{τ}, \infty)$ containing $\{τ_{0} : P_{τ_{0}} > α or P_{τ_{0}}^{a} > α or P_{τ_{0}}^{b} > α\}$ will be denoted, $I_{v}^{-}$ . Of course, $I_{v}^{-} \supseteq I_{c}^{-}$ .

The investigator does not know whether or not there are two versions of treatment, whether or not $τ^{a} = τ^{b}$ . The investigator would like to make two inferences appropriate for the two situations, $τ^{a} = τ^{b}$ or $τ^{a} \neq τ^{b}$ . The investigator would like to make an inference appropriate to this state of ignorance. The investigator says,

I do not know whether there are two versions of treatment, whether or not $τ^{a} = τ^{b}$ ; however, (i) if there are not versions of treatment so that $τ^{a} = τ^{b} = τ$ , then $I_{c}^{-} \supseteq τ$ , and (ii) whether or not there are two versions of treatment, even if $τ^{a} \neq τ^{b}$ , then $I_{v}^{-} \supseteq τ_{min}$ ; moreover, this method produces two true hypothetical statements with probability at least $1 - α$ .

Statement (ii) costs nothing, in the sense that $I_{c}^{-}$ is the usual one-sided confidence interval for $τ$ assuming there are not versions of treatment, yet both statements hold jointly without multiplicity correction. This is established in the following proposition:

Proposition 1. (i) If there is only one version of treatment, $τ^{a} = τ^{b} = τ$ , then $Pr (I_{v}^{-} \supseteq I_{c}^{-} \supseteq τ) \geq 1 - α$ . (ii) In any event, whether there are two versions of treatment, $τ^{a} \neq τ^{b}$ , or only a single version, $τ^{a} = τ^{b} = τ$ , we have $Pr (I_{v}^{-} \supseteq τ_{min}) \geq 1 - α$ .

Proof. By the definitions of $I_{v}^{-}$ and $I_{c}^{-}$ , we have $I_{v}^{-} \supseteq I_{c}^{-}$ . Then, (i) follows because, if there are not versions of treatment, $τ^{a} = τ^{b} = τ$ , then $I_{c}^{-}$ is a $1 - α$ confidence interval for $τ$ and $Pr (I_{v}^{-} \supseteq I_{c}^{-} \supseteq τ) \geq 1 - α$ . If there are not versions of treatment, $τ^{a} = τ^{b} = τ$ , then $τ_{min} = τ$ , so $Pr (I_{v}^{-} \supseteq τ_{min}) \geq 1 - α$ , as required for (ii). So suppose there are two versions of treatment. If $τ^{a} = τ_{min} < τ_{max} = τ^{b}$ , then $τ_{min} \notin I_{v}^{-}$ implies $P_{τ^{a}}^{a} \leq α$ which occurs with probability at most α. If $τ^{b} = τ_{min} < τ_{max} = τ^{a}$ , then $τ_{min} \notin I_{v}^{-}$ implies $P_{τ^{b}}^{b} \leq α$ which occurs with probability at most α. So in all three cases, $τ^{a} = τ^{b}$ or $τ^{a} < τ^{b}$ or $τ^{a} > τ^{b}$ , we have $Pr (I_{v}^{-} \supseteq τ_{min}) \geq 1 - α$ , proving (ii).▪

By a parallel argument, we obtain analogous $1 - α$ upper intervals, $I_{c}^{+}$ and $I_{v}^{+}$ , of the form $(- \infty, \tilde{τ})$ for $τ$ if $τ^{a} = τ^{b} = τ$ or without restrictions for $τ_{max}$ . Taking the intersections, $I_{c} = I_{c}^{-} \cap I_{c}^{+}$ and $I_{v} = I_{v}^{-} \cap I_{v}^{+}$ , of two one-sided $1 - α / 2$ intervals yields analogous two-sided $1 - α$ intervals for $τ$ if $τ^{a} = τ^{b} = τ$ or without restrictions for the interval $[τ_{min}, τ_{max}]$ . In most cases, $I_{v}$ can be constructed by taking the union of $I_{c}$ and the two two-sided intervals constructed from the matched sets using each separate version of control. When this union is disjoint, $I_{v}$ is the shortest interval that contains all three intervals.

In Case (ii), the proof above that $Pr (I_{v} \supseteq τ_{min}) \geq 1 - α$ is similar to, but not quite identical to, results in Lehmann (1952), Berger (1982), and Laska and Meisner (1989). These authors proposed tests that would invert to yield as a confidence interval the shortest interval $I_{*}$ containing $\{τ_{0} : P_{τ_{0}}^{a} > α or P_{τ_{0}}^{b} > α\}$ , whereas $I_{v}$ is the shortest interval containing $\{τ_{0} : P_{τ_{0}} > α or P_{τ_{0}}^{a} > α or P_{τ_{0}}^{b} > α\}$ , thereby ensuring $I_{v} \supseteq I_{c}$ . Of course, $I_{v} \supseteq I_{*}$ , but unlike $I_{*}$ , our method ensures that $I_{v}$ and $I_{c}$ both simultaneously cover $τ^{a} = τ^{b} = τ$ at rate $1 - α$ when there is actually only a single version of treatment. Because $I_{c}$ is built using all of the data and under stronger assumptions, it is unlikely that $I_{*}$ will be much shorter than $I_{v}$ ; however, this logical possibility is the price for reporting the usual interval, $I_{c}$ , without multiplicity correction.

Why not report a single robust interval like $I_{*}$ instead of two intervals, $I_{c}$ and $I_{v}$ , that have slightly more nuanced coverage properties? A simple example may help illustrate the advantage of reporting $I_{c}$ and $I_{v}$ over $I_{*}$ . Suppose that $I_{*}$ and $I_{v}$ both contain zero but $I_{c}$ does not. If we choose to report $I_{*}$ , then we have little additional information to determine why $I_{*}$ contains zero—Is Fisher’s sharp null true or is the smaller of the two versions of effect very close to zero? Or, are there versions of the effect with different signs? However, if we report $I_{c}$ and $I_{v}$ , we have evidence against Fisher’s sharp null, narrowing the plausible explanations for why zero is contained in $I_{v}$ , which we’ve noted will tend to be similar to $I_{*}$ .

5.3. Interval Estimates in the Football Study

The upper third of Figure 1, marked $Γ = 1$ , shows 95% intervals for the football study, assuming that treatments are randomly assigned within matched sets. First, there are the three conventional intervals for $τ$ , $τ^{a}$ , and $τ^{b}$ , corresponding to the three comparisons in Table 1. Each of these intervals is a 95% confidence interval on its own, but each one runs a 5% chance of error, so the chance that at least one interval fails to cover its corresponding parameter is greater than 5%. Obviously, we could make the three intervals longer, say using the Bonferroni inequality, so that the simultaneous coverage is 95%, but many investigators would find this unattractive because it would reduce the power of the conventional, primary analysis focused on $τ$ that uses all of the controls; that is, it would make the first interval longer.

Figure 1.

Comparison of interval estimates for the effect of high school football on the delayed word recall score. Note. The intervals for $Γ = 1$ assume that there is no bias from unmeasured covariates, while $Γ > 1$ permits unmeasured biases of unknown form but limited magnitude. The top three intervals for “all controls,” “no sport,” and “other sport” are conventional confidence intervals lacking simultaneous coverage. The bottom two intervals are $I_{c}$ and $I_{v}$ . The “all controls” interval equals $I_{c}$ and the union of the first three intervals equals $I_{v}$ .

In contrast, the intervals $I_{c}$ and $I_{v}$ in the top panel of Figure 1 have simultaneous coverage of 95% in the sense of Proposition 1. Notably, $I_{c} = [- 0.308, 0.099]$ is the interval for $τ$ from Section 4, so consideration of $I_{v}$ has not reduced power to detect a constant effect. The versions $I_{v} = [- 0.357, 0.219]$ is slightly longer than $I_{c}$ , but both intervals are compatible with no effect and both intervals are quite incompatible with an effect of half a word, $\pm 0.5$ . For comparison, recall from Section 2 that average performance on the DWR test declined by half a word from age 65 to 72. In Figure 1, the 95% interval for “all controls” equals $I_{c}$ , while $I_{v}$ is the union of the three intervals for “all controls,” “controls who played no sport,” and “controls who played another sport.”

6. Comparison to Conventional Approaches to Multiple Versions: F Tests and Bonferroni Correction

The method described in Subsection 5.2 makes an important trade-off: It prioritizes the primary comparison against all controls under the assumption of a constant treatment effect, allowing us to report the corresponding confidence interval with no correction, in exchange for the ability to distinguish between versions if they do, in fact, exists. In other words, our method emphasizes detection of nonzero, constant treatment effects over detecting different versions of the effect. Conventional approaches to multiple comparisons are often less focused and are designed to detect a broader range of departures from the null. For example, a simple Bonferroni correction places the primary comparison and the two versioned comparisons on equal footing. If the investigator does not suspect a priori that a particular alternative hypothesis is most likely, he may conduct an omnibus F test, whose power is distributed over a broad range of alternative hypotheses.

The researcher’s scientific aim should determine which method for multiple comparisons is most appropriate. With this guidance in mind, we compare our method to the omnibus F test and Bonferroni corrected intervals.

6.1. The Omnibus F Test

In exploratory analyses, the F test can be a useful “prelude to subsequent examinations of unplanned contrasts” (Steiger, 2004). However, in many studies, the researcher will have a particular contrast in mind. Several authors have argued that the omnibus hypothesis in analysis of variance studies be replaced with hypotheses that focus on a substantive research question, often involving just a single contrast (Rosenthal et al., 2000; Steiger, 2004). In the football study, we suspect a priori that versions are not terribly consequential and proceed first with our primary investigation of whether playing high school football accelerates the onset of cognitive decline. The hypotheses about versions are secondary to our main inquiry and are treated as such in our method.

The F test does not lend itself to effect size estimates, but we can compare it to our method by evaluating its power to reject Fisher’s sharp null hypothesis and the probability that $I_{c}$ excludes $τ = 0$ under a variety of alternative hypotheses with vary degrees of “versioning.” If the primary goal of the study is less directed and detecting any departure from the null of no effect is of interest and the researcher suspects that versions may play an important role, the F test may be more suitable. However, in a simulation study described in Appendix, we find that when the versions differ by less than $\sim 30 % − 40 %$ in magnitude and have the same sign our method has better power to reject Fisher’s sharp null than the F test. Being an omnibus test, the F test power is much less sensitive to the specific pattern of alternative hypothesis.

6.2. Bonferroni Corrected Intervals

In Figure 2, we compare the intervals returned by our version method to the three intervals returned by a Bonferroni correction (BC) to the comparison of the football players against all controls and each version of control. In the top panel, the BC interval using all controls is 22% longer than $I_{c}$ . In the bottom panel, $I_{v}$ is only 3% longer than the BC interval using only “no sport” controls and 15% shorter than the BC interval using “other sport’ controls. In this particular case, where versions appear to be relatively innocuous, the cost of lengthening the primary interval is significant for what appears to be little if any gain from investigating each version separately. Bonferroni correction may be more appropriate when the investigator suspects the versions are important, and he may even replace the primary comparison with a comparison of the two versions of control themselves. However, if the versions are only modestly different, for which our method is designed, it is unlikely that Bonferroni would have sufficient power to detect such modest differences in the versions of control.

Figure 2.

Comparison of $I_{c}$ (top panel, dark line) and $I_{v}$ (bottom panel, dark line) to Bonferroni corrected (BC) intervals comparing football players versus “all controls” (top panel, light line), versus “no sport” controls (bottom panel, topmost light line), and versus “other sport” controls (bottom panel, bottom-most light line). Note. $I_{v}$ is very similar in length to the BC interval using “no sport” controls and is noticeably shorter than the BC interval using “other sport” controls.

7. Versions, Effect Modification, and Insufficient Overlap

In the study of the effects of playing high school football on cognitive decline, the two versions of control and the football players have significant overlap in observed covariates. An anonymous reviewer suggested the following situation. Suppose that athletes who did not play football and nonathletes who did not play football differ noticeably on observed covariates such that a different set of football players were matched in the matchings using the different versions of control. If the treatment is heterogeneous, say it is modified by some observed covariate that differs between the two versions of control, does “this mean there are two versions of treatment or that there are two different groups of individuals that are involved in the two comparisons?” In this setting, the potential for treatment heterogeneity is aliased with the potential for versions of the treatment effect. But does this matter? The logic of Subsection 5.2 is agnostic to why there may be different treatment effects between versions. Thus, $I_{v}$ can be interpreted as assessing how robust our primary analysis is to the existence of versions of treatment or to the existence of effect heterogeneity between the two groups defined by “versions” of control.

8. Sensitivity to Departures From Random Assignment

So far, we have drawn inferences under the assumption that treatments are randomly assigned within matched sets. In an observational study, this assumption lacks support and is typically doubtful if not implausible. We examine sensitivity to bias from nonrandom assignment by assuming that two individuals with the same observed covariates may differ in their odds of treatment by at most a factor of $Γ \geq 1$ due to differences in unobserved covariates (see Rosenbaum, 2007b, 2017, section 9). This yields hypothesis tests that falsely reject a true null hypothesis with probability at most α when the bias in treatment assignment is at most $Γ$ . Then, $Γ$ is varied to display the magnitude of bias that would need to be present to alter the conclusions of a study. How much bias, measured by $Γ$ , would need to be present to lead us to fail to reject the null hypothesis of no effect of football when, in fact, football causes substantial harm? In the current example, where there is no evidence of a harmful effect of football, we may ask a parallel question that is related to equivalence testing: How much bias would need to be present to mask a substantial true effect of football on memory? For example, an increase or decrease of a DWR score by at least one word.

Aids to interpreting values of $Γ$ are discussed by Rosenbaum and Silber (2009) and Hsu and Small (2013). In particular, in a matched pair with $n_{i} = 2$ , the value $Γ = 1.25$ corresponds with an unobserved covariate that doubles the odds of playing football and doubles the odds of a worse memory score, while $Γ = 1.5$ corresponds with an unobserved covariate that doubles the odds of playing football and quadruples the odds of a worse memory score (see Rosenbaum 2017, section 9; Rosenbaum & Silber, 2009). Proposition 1 applies to the intervals obtained from upper bounds on p values from sensitivity analyses, providing the bias in treatment assignment is at most $Γ$ .

Figure 1 shows the expansion of $I_{c}$ and $I_{v}$ as $Γ$ increases from $Γ = 1$ for randomization inferences to $Γ = 1.25$ and $Γ = 1.5$ . For $Γ = 1.25$ , the intervals are $I_{c} = [- 0.534, 0.328]$ and $I_{v} = [- 0.574, 0.464]$ . For $Γ = 1.5$ , the intervals are $I_{c} = [- 0.716, 0.517]$ and $I_{v} = [- 0.771, 0.666]$ . A bias of $Γ = 1.5$ together with two versions of not playing football would be insufficient to mask an effect of one word on the memory test, $\pm 1$ . At $Γ = 2$ , not shown in Figure 1, effects of $\pm 1$ word start to be included in the confidence intervals, with $I_{c} = [- 0.997, 0.817]$ and $I_{v} = [- 1.082, 0.986]$ . A bias of $Γ = 2$ corresponds with an unobserved covariate that triples the odds of playing football and increases the odds of worse memory performance by 5-fold.

In brief, there is no sign of an effect of high school football on memory scores. Could the absence of any sign of an effect reflect a substantial effect and bias in who plays football? To mask a true effect of $\pm 1$ word, an unobserved bias would have to be moderately large, $Γ = 2$ . Even allowing for both moderate confounding due to unmeasured covariates and versions of treatment, large effects of high school football on memory scores are not consistent with the data.

9. Discussion: Simultaneous Inference About One Question Under Different Assumptions

Investigators sometimes candidly report two or more statistical analyses valid under different assumptions. In the process, they often lose the several advantages of a single, simple, primary analysis, that is, a single analysis with high power against a nonzero constant effect because it uses everyone and avoids needed corrections for multiple testing when several statistical tests are performed. With less candor, investigators sometimes perform several analyses and report some but not all analyses, a perhaps common practice that no one would publicly advocate.

Versions of treatment arise in observational studies when treatment or control conditions found in available data may not be uniform, as they would be in a tightly controlled experiment. The investigator would like to follow the practice of clinical trials and report a single, primary analysis using everyone without multiplicity correction. Nonetheless, the investigator would like to speak to the possibility that there are versions of treatment or control conditions. The proposed method always reports two interval estimates. The first, shorter interval, $I_{c}$ , is precisely the interval that would be reported in a single primary analysis without versions of treatment. The second longer interval, $I_{v}$ , attempts to cover both treatment effects if there are two versions of treatment or two versions of control. If there is, in fact, only a single treatment effect, the same for both versions, then the probability that both $I_{c}$ and $I_{v}$ simultaneously cover that one effect is the stated rate of $1 - α$ . If there are, in fact, two treatment effects that differ with the two versions, then the second interval, $I_{v}$ , covers both effects with the stated rate of $1 - α$ . In that sense, the added information provided by reporting two intervals, $I_{c}$ and $I_{v}$ , is free: The interval $I_{c}$ is clarified but not lengthened by examining $I_{v}$ . Although $I_{v}$ is always somewhat longer than $I_{c}$ , in the football example, it is only slightly longer, thereby suggesting that the primary analysis is not greatly distorted by the two versions of the control condition.

Appendix

Simulation Comparing the Power of the Omnibus F Test to $I_{c}$

In this appendix, we compare the power of the omnibus F test to the power of our version method to detect departures from Fisher’s sharp null hypothesis under varying degrees of “versioning.” The results can be found in Table A1. When the degree of versioning is modest, the version method is more powerful than the F test.

Table A1.

Comparison of the Power of the Version Method (Columns 2 and 4) and the Power of the F Test (Columns 3 and 5) to Reject Fisher’s Sharp Null Hypothesis Under Alternative Hypotheses With Varying Degrees of “Versioning.”

Version	$τ^{b} = 0.25$		$τ^{b} = 0.4$
Version	$P (I_{c} \subseteq 0)$	$P (F > c_{α})$	$P (I_{c} \subseteq 0)$	$P (F > c_{α})$
$τ^{a} = τ^{b}$	.61	.49	.94	.91
$τ^{a} = 0.95 \times τ^{b}$	.58	.47	.93	.88
$τ^{a} = 0.9 \times τ^{b}$	.56	.47	.93	.89
$τ^{a} = 0.75 \times τ^{b}$	.51	.43	.89	.85
$τ^{a} = 0.65 \times τ^{b}$	.44	.42	.83	.84
$τ^{a} = 0.6 \times τ^{b}$	.44	.44	.83	.86
$τ^{a} = 0.5 \times τ^{b}$	.37	.41	.75	.84
$τ^{a} = 0.25 \times τ^{b}$	.27	.55	.58	.95

Note. Significance level of the tests is $α = .05$ .

Simulation Settings

Let there be $I = 100$ matched sets, with $m_{i} = 1$ treated and $n_{i} - m_{i} = 4$ controls in each set. The design is balanced over versions, that is, there are two controls of each version in each matched set. In each set, we generate outcomes as follows: $Y_{i j} = τ + X_{i} + ε_{i j}$ if the jth subject in set i receives treatment, $Y_{i j} = X_{i} + ε_{i j}$ if the jth subject receives control version b, and $Y_{i j} = δ + X_{i} + ε_{i j}$ if the jth subject receives control version. We let the individual and matched-set level terms, $ε_{i j}$ and X_i, be distributed as independent standard normals. Finally, let $τ^{b} = τ$ and $τ^{a} = τ - δ$ . If there are no versions, then $δ = 0$ . When conducting the F test of the null of no treatment effect, we model X_i as a linear effect.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Sameer K. Deshpande

References

Bachynski

K. E.

(2016). Tolerable risks? Physicians and tackle football. New England Journal of Medicine, 374, 405–407.

Berger

R. L.

(1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics, 24, 295–300.

Campbell

D. T.

(1969). Prospective: Artifact and control. In Rosenthal

Rosnow

R. L.

(Eds.), Artifact in behavioral research (pp. 351–382). Academic Press.

Conover

W. J.

Salsburg

D. S.

(1988). Locally most powerful tests for detecting treatment effects when only a subset of patients can be expected to respond to treatment. Biometrics, 44, 189–196.

Deshpande

S. K.

Hasegawa

R. B.

Rabinowitz

A. R.

Whyte

Roan

C. L.

Tabatabaei

Baiocchi

Karlawish

J. H.

Master

C. L.

Small

D. S.

(2017). High school football and later life cognition and mental health: An observational study. JAMA Neurology, 74, 909–918. arXiv preprint arXiv:1607.01756.

Fisher

R. A.

(1935). The design of experiments. Oliver and Boyd.

Graves

A. B.

White

Koepsell

Reier

B. V.

Van Belle

Larson

E. B.

Raskind

(1990). The association between head trauma and Alzheimer’s disease. American Journal of Epidemiology, 131, 491–501.

Hansen

B. B.

(2004). Full matching in an observational study of coaching for the SAT. Journal of the American Statistical Association, 99, 609–618.

Hansen

B. B.

(2007). Flexible, optimal matching for observational studies. R News, 7, 18–24. (R package optmatch)

10.

Hansen

B. B.

Klopfer

S. O.

(2006). Optimal full matching and related designs via network flows. Journal of Computational and Graphical Statistics, 15, 609–627. (R package optmatch)

11.

Hsu

J. Y.

Small

D. S.

(2013). Calibrating sensitivity analyses to observed covariates in observational studies. Biometrics, 69, 803–811.

12.

Knopman

D. S.

Ryberg

(1989). A verbal memory test with high predictive accuracy for dementia of the Alzheimer type. Archives of Neurology, 46, 141–145.

13.

Laska

E. M.

Meisner

M. J.

(1989). Testing whether an identified treatment is best. Biometrics, 45, 1139–1151.

14.

Lehman

E. J.

Hein

M. J.

Baron

S. L.

Gersic

C. M.

(2012). Neurodegenerative causes of death among retired national football league players. Neurology, 79, 1970–1974.

15.

Lehmann

E. L.

(1952). Testing multiparameter hypotheses. Annals of Mathematical Statistics, 23, 541–552.

16.

Lehmann

E. L.

Romano

(2005). Testing statistical hypotheses (3rd ed.). Springer.

17.

McKee

A. C.

Cantu

R. C.

Nowinski

C. J.

Hedley-Whyte

Gavett

B. E.

Budson

A. E.

Santini

V. E.

Lee

H.-S.

Kublius

C. A.

Stern

R. A.

(2009). Chronic traumatic encephalopathy in athletes: Progressive tauopathy after repetitive head injury. Journal of Neuropathology and Experimental Neurology, 68, 709–735.

18.

Miles

S. H.

Prasad

(2016). Medical ethics and school football. American Journal of Bioethics, 16, 6–10.

19.

Mortimer

J. A.

van Duijn

C. M.

Chandra

Fratiglioni

Graves

A. B.

Heyman

Jorm

A. F.

Kokmen

Kondo

Rocca

W. A.

Shalat

S. L.

Soininen

, & the Eurodem Risk Factors Research Group. (1991). Head trauma as a risk factor for Alzheimer’s disease: A collaborative re-analysis of case-control studies. International Journal of Epidemiology, 20, S28–S35.

20.

Neyman

(1990). On the application of probability theory to agricultural experiments. Statistical Science, 5, 463–480. (Original work published 1923)

21.

Peto

Pike

Armitage

Breslow

N. E.

Cox

D. R.

Howard

S. V.

Mantel

McPherson

Peto

Smith

P. G.

(1976). Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. British Journal of Cancer, 34, 585–612.

22.

Pitman

E. J.

(1937). Statistical tests applicable to samples from any population. Journal of the Royal Statistical Society, 4, 119–130.

23.

Rosenbaum

P. R.

(1984). The consequences of adjustment for a concomitant variable that has been affected by the treatment. Journal of the Royal Statistical Society: Series A, 147, 656–666.

24.

Rosenbaum

P. R.

(1987). The role of a second control group in an observational study. Statistical Science, 2, 292–306.

25.

Rosenbaum

P. R.

(1991). A characterization of optimal designs for observational studies. Journal of the Royal Statistical Society: Series B, 53, 597–610.

26.

Rosenbaum

P. R.

(2007a). Confidence intervals for uncommon but dramatic responses to treatment. Biometrics, 63, 1164–1171.

27.

Rosenbaum

P. R.

(2007b). Sensitivity analysis for m-estimates, tests and confidence intervals in matched observational studies. Biometrics, 63, 456–464. (R package sensitivitymult; demonstration at https://rosenbap.shinyapps.io/learnsenShiny/)

28.

Rosenbaum

P. R.

(2017). Observation and experiment: An introduction to causal inference. Cambridge, MA: Harvard University Press.

29.

Rosenbaum

P. R.

Silber

J. H.

(2009). Amplification of sensitivity analysis in observational studies. Journal American Statistical Association, 104, 1398–1405. (amplify function in the R package sensitivitymult)

30.

Rosenthal

Rosnow

R. L.

Rubin

D. B.

(2000). Contrasts and effect sizes in behavioral research: A correlational approach. Cambridge University Press.

31.

Rubin

D. B.

(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688–701.

32.

Rubin

D. B.

(1986). Comment: Which ifs have causal answers. Journal of the American Statistical Association, 81, 961–962.

33.

Rubin

D. B.

(2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials. Statistics in Medicine, 26, 20–36.

34.

Shaffer

J. P.

(1974). Bidirectional unbiased procedures. Journal of the American Statistical Association, 69, 437–439.

35.

Steiger

J. H.

(2004). Beyond the F test: Effect size confidence intervals and tests of close fit in the analysis of variance and contras analysis. Psychological Methods, 9, 164–182.

36.

Stuart

E. A.

Green

K. M.

(2008). Using full matching to estimate causal estimates in nonexperimental studies: Examining the relationship between adolescent marijuana use and adult outcomes. Developmental Psychology, 44, 395–406.

37.

VanderWeele

T. J.

Hernan

M. A.

(2013). Causal inference under multiple versions of treatment. Journal of Causal Inference, 1, 1–20.

38.

Welch

B. L.

(1937). On the z-test in randomized blocks and Latin squares. Biometrika, 29, 21–52.

Causal Inference With Two Versions of Treatment

Abstract

Keywords

1. What Are Versions of Treatment?

2. Possible Versions of Control in a Study of Football and Dementia

3. Full Matching of Football Players and Controls

4. Review of Randomization Inference Without Versions of Treatment

5. Inference With Versions of Treatment

5.1. Structure of the Problem

5.2. Inference When There May or May Not Be Two Versions of Treatment

5.3. Interval Estimates in the Football Study

6. Comparison to Conventional Approaches to Multiple Versions: F Tests and Bonferroni Correction

6.1. The Omnibus F Test

6.2. Bonferroni Corrected Intervals

7. Versions, Effect Modification, and Insufficient Overlap

8. Sensitivity to Departures From Random Assignment

9. Discussion: Simultaneous Inference About One Question Under Different Assumptions

Appendix

Simulation Comparing the Power of the Omnibus F Test to I c

Simulation Settings

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

References

Simulation Comparing the Power of the Omnibus F Test to $I_{c}$