Design-Based Covariate Adjustments in Paired Experiments

Abstract

In paired experiments, participants are grouped into pairs with similar characteristics, and one observation from each pair is randomly assigned to treatment. The resulting treatment and control groups should be well-balanced; however, there may still be small chance imbalances. Building on work for completely randomized experiments, we propose a design-based method to adjust for covariate imbalances in paired experiments. We leave out each pair and impute its potential outcomes using any prediction algorithm such as lasso or random forests. This method addresses a unique trade-off that exists for paired experiments. By addressing this trade-off, the method has the potential to improve precision over existing methods.

Keywords

paired experiments covariate adjustment causal inference

1. Introduction

In randomized controlled trials, we expect the pretreatment covariates of the treatment and control groups to be similar except for the treatment itself. However, there will often be small imbalances in baseline covariates due to chance variation in treatment assignment, which can be addressed in multiple ways. One way to improve the precision of the treatment effect estimate would be to adjust for these imbalances during the analysis. Alternatively, it might be possible to balance covariates through the design of the experiment. For example, in paired experiments, participants are organized into pairs prior to treatment assignment, and then, one participant in each pair is randomly assigned to treatment. Ideally, the two participants in each pair would be as similar as possible.

Paired designs are commonly used when the sample size is small. For example, Pane et al. (2014) discuss a randomized trial involving schools in Texas testing the effectiveness of a computer program, the Cognitive Tutor Algebra 1 curriculum. In this trial, schools were organized into 22 pairs and then pair randomized.

While a paired design is often effective at balancing covariates between the treatment and control groups, it may still be helpful to make adjustments for remaining covariate imbalances. Similar situations can occur with other study designs; for example, covariate adjustments may be helpful in rerandomized trials (see Li & Ding, 2020). Perhaps in part because covariate balance is addressed through experimental design, covariate adjustment methods in paired experiments are relatively understudied. Covariate adjustment methods can be model-based or design-based (for a discussion, see Imai et al., 2009; Imbens, 2010). Model-based estimators have the potential to improve efficiency; however, incorrect modeling assumptions can result in bias and increased mean squared error. Design-based estimators rely only on randomization as the basis for inference, diminishing the concern of model misspecification. Hierarchical linear models (see Raudenbush & Bryk, 2002; Woltman et al., 2012) are an example of a model-based approach for blocked experiments including paired experiments. Pinheiro and Bates (2000) and Dixon (2016) note that hierarchical linear models are a common way to analyze blocked experiments. However, the use of such models requires one to make various modeling decisions, potentially raising concerns about model misspecification. For example, Dixon (2016) notes that there is some debate as to whether block effects should be modeled as fixed or random.

As noted above, covariate adjustments in paired experiments are relatively understudied, and design-based methods are even more so. Imbens and Rubin (2015) and Fogarty (2018) discuss regression-based adjustments. Imbens and Rubin work under a superpopulation model, assuming that the pairs within the experiment are drawn at random from an infinite population, and focus on the population average treatment effect. Fogarty examines the use of regression adjustments in paired experiments under a design-based framework, building on the work of Freedman (2008) and Lin (2013), who discuss regression adjustments in completely randomized experiments. More recently, covariate adjustment methods have been proposed for completely randomized and Bernoulli randomized experiments that involve the use of sample splitting and machine learning methods to impute potential outcomes. These include Aronow and Middleton (2013), Wager et al. (2016), Chernozhukov et al. (2018), E. Wu and Gagnon-Bartsch (2018), Spiess (2018), and Rothe (2018). Some of these methods can be used in more general designs including blocked experiments, for example, Aronow and Middleton (2013). However, unlike the case of regression adjustments, there is not currently an analogue to these methods which is specifically for paired experiments.

In this article, we present an analogous approach to these machine learning methods for paired experiments. The method is design-based; however, it also allows for the use of models to improve performance. We leave out each pair and impute the potential outcomes using information from the remaining observations. This imputation can be done with any prediction method such as linear regression or random forests. Regardless of the imputation method, the resulting treatment effect estimate is unbiased and randomization is the basis for inference. This flexibility has several advantages. For example, one issue when making covariate adjustments is choosing which and how many covariates to use. We can address this issue by choosing an imputation method that allows for automatic variable selection. An alternative approach is to use targeted maximum likelihood estimation, which Moore and van der Laan (2009) note allows for automatic variable selection when making covariate adjustments. Balzer, Laan, et al. (2016) and Balzer, Petersen, et al. (2016) propose the use of targeted maximum likelihood estimation in paired experiments.

Our method also addresses an issue that is specific to paired experiments, which we will call the pair inclusion trade-off. In paired experiments, the performance of a covariate adjustment method can suffer if it fails to properly account for the pair assignments. If the relationship between the covariates and outcome within pairs is the opposite of the relationship overall, that is, a Simpson’s paradox occurs, then omitting the pair assignments will hurt precision relative to the unadjusted estimator. However, in cases where the pair assignments are not predictive of the outcome, it is better to ignore the pairing. Both Aronow and Middleton (2013) and E. Wu and Gagnon-Bartsch (2018) present versions of their methods that allow for block randomizations; however, neither of these methods directly address the pair inclusion trade-off. We discuss the pair inclusion trade-off further in Section 4. The framework we present allows us to address the trade-off. We impute two sets of potential outcomes, one in which we account for and the other where we ignore the pair assignments. Having two sets of imputed potential outcomes, we then interpolate between them by minimizing the cross-validated mean squared error. By addressing this trade-off, we protect against the Simpson’s paradox but retain the potential for improvements in precision if the pairing is not informative.

Covariate adjustment methods have also been proposed for matched-pair cluster randomized trials. For example, Small, Ten Have, and Rosenbaum (2008) propose a design-based estimator, while Z. Wu et al. (2014) propose a method that assumes a superpopulation.

This article is organized as follows. In Section 2, we discuss the model and introduce notation. In Section 3, we present the estimator and derive a variance estimate. We discuss the pair inclusion trade-off further and present an imputation method to address it in Section 4. In Section 5, we apply the estimator to simulated data. In Section 6, we use the method to estimate the effect of the Cognitive Tutor Algebra 1 curriculum mentioned above. Section 7 concludes.

2. Background and Notation

2.1. Estimating the Average Treatment Effect

In this article, we work under the Neyman–Rubin model (see Rubin, 1974; Splawa-Neyman et al., 1990), a nonparametric model that is often used to analyze randomized experiments. Consider a randomized experiment in which there are $2 N$ individuals, indexed by $i = 1, 2, . . ., 2 N$ . We let $T_{i} = 1$ if the participant is assigned to treatment and $T_{i} = 0$ if control. Each of the $2 N$ participants has two fixed (nonrandom) potential outcomes, t_i and c_i . We observe t_i if participant i is assigned to treatment and c_i otherwise. That is, the observed outcome Y_i for participant i is

Y_{i} = T_{i} t_{i} + (1 - T_{i}) c_{i} .

We define the individual treatment effect for each participant as $t_{i} - c_{i}$ and the average treatment effect as

\bar{τ} = \frac{1}{2 N} \sum_{i = 1}^{2 N} (t_{i} - c_{i}) .

We first consider a case where the treatment assignments are not pair randomized. Suppose the T_i are independent Bernoulli random variables with probability $p = .5$ and that we wish to estimate the average treatment effect. One estimate is obtained by taking the average observed outcome of the treatment group and subtracting the average observed outcome of the control group (the “simple difference estimator”). This estimator is unbiased, conditional on both the treatment and control groups containing at least one participant. However, for each participant, suppose we observe a q-dimensional vector of baseline covariates Z_i prior to treatment assignment. It may be possible to use these covariates to improve the precision of the estimate over the simple difference estimator. For example, we could estimate the average treatment effect as

\frac{1}{2 N} \sum_{i = 1}^{2 N} {2 (Y_{i} - {\hat{m}}_{i}) T_{i} - 2 (Y_{i} - {\hat{m}}_{i}) (1 - T_{i})},

where ${\hat{m}}_{i}$ is a function of Z_i . Several authors have noted an estimator of this form can be used to incorporate covariate information (e.g., see Robins, 2000; Robins et al., 1994; Scharfstein et al., 1999; Tsiatis et al., 2008). Aronow and Middleton (2013) use this estimator in a design-based framework and note that if ${\hat{m}}_{i}$ is predictive of the observed outcome Y_i , then the resulting estimate will improve over the unadjusted estimator. E. Wu and Gagnon-Bartsch (2018) build on this work and suggest estimating the quantity $m_{i} = (t_{i} + c_{i}) / 2$ . In addition, Aronow and Middleton (2013) note that this estimate is unbiased if T_i and ${\hat{m}}_{i}$ are independent. One way to ensure this independence is by obtaining ${\hat{m}}_{i}$ through a sample splitting procedure. For example, one could leave out the ith observation and calculate ${\hat{m}}_{i}$ using the remaining observations. As noted by Aronow and Middleton (2013), sample splitting is especially natural in the case of block-randomized experiments, where treatment assignments in one block are independent of treatment assignments in the remaining blocks. See Wager et al. (2016), Chernozhukov et al. (2018), Spiess (2018), and Rothe (2018) for similar estimators.

2.2. Notation for Paired Experiments

We now consider the case where the participants are pair randomized. Suppose that the $2 N$ participants are organized into N pairs. We index the pairs by $i = 1, 2, . . ., N$ , each with two participants indexed by $j = 1, 2$ , and the quantities defined in subsection 2.1 are reindexed by i and j. For example, for participant j in pair i, we denote the potential outcomes as $t_{i j}$ and $c_{i j}$ and define the observed outcome, treatment indicator, and covariates as $Y_{i j}$ , $T_{i j}$ , and $Z_{i j}$ , respectively.

For each pair, one of the two participants is randomly chosen to be assigned to treatment and the other is assigned to control. That is, $T_{i 1} \sim Bern (0.5)$ and $T_{i 2} = 1 - T_{i 1}$ . The $T_{i j}$ s are not mutually independent because exactly one participant in each pair must be assigned to treatment. However, we assume the $T_{i 1}$ s are mutually independent. We can therefore essentially convert our paired experiment to a Bernoulli randomized experiment by treating each pair as an experimental unit as we describe next.

When treating each pair as a unit, we can draw direct analogues between the notation of paired and Bernoulli randomized experiments. We denote each pair’s treatment assignment by T_i , where $T_{i} = T_{i 1}$ . For each pair, we also observe a response variable W_i defined below and a $2 q$ -dimensional vector of baseline covariates $(Z_{i 1}, Z_{i 2})$ . As with a Bernoulli randomized experiment, each pair has two “potential outcomes”: We observe $a_{i} = t_{i 1} - c_{i 2}$ if $T_{i} = 1$ and $b_{i} = t_{i 2} - c_{i 1}$ if $T_{i} = 0$ . To differentiate these outcomes from those of the individual participants, we will refer to a_i and b_i as “potential differences.” We define the observed difference W_i as

W_{i} = T_{i} a_{i} + (1 - T_{i}) b_{i} .

We define the pair-level treatment effect $τ_{i}$ as

τ_{i} = \frac{(t_{i 1} - c_{i 1}) + (t_{i 2} - c_{i 2})}{2} = \frac{1}{2} (a_{i} + b_{i}),

and the average treatment effect $\bar{τ}$ as

\bar{τ} = \frac{1}{N} \sum_{i = 1}^{N} τ_{i},

which is our primary parameter of interest.

We can obtain an unbiased estimate for the average treatment effect in paired experiments by averaging the observed differences

{\hat{τ}}_{s d} = \frac{1}{N} \sum_{i = 1}^{N} W_{i} .

We will refer to this estimator as the simple difference estimator for paired experiments, as it is exactly equal to the difference in means between the treatment and control groups. However, the variance estimation of the simple difference estimator will be different under a paired design than it is in completely or Bernoulli randomized experiments. For more details, see Imai (2008), who analyzes ${\hat{τ}}_{s d}$ under the Neyman–Rubin model in a paired design.

As in the case of completely or Bernoulli randomized experiments, it may be possible to use covariates to improve precision over the simple difference estimator. We propose such a covariate adjustment method for paired experiments in the next section.

3. A Design-Based Covariate Adjustment Procedure

3.1. Estimating the Average Treatment Effect

We now present an estimator that is analogous to the estimator given in Equation (1) but for paired experiments. Define the quantity

d_{i} = m_{i 1} - m_{i 2},

= \frac{1}{2} (a_{i} - b_{i}),

where $m_{i j} = (t_{i j} + c_{i j}) / 2$ , and let

{\hat{τ}}_{i} = (W_{i} - {\hat{d}}_{i}) T_{i} + (W_{i} + {\hat{d}}_{i}) (1 - T_{i}),

where ${\hat{d}}_{i}$ is an estimate for d_i . This estimator differs from Equation (1) as d_i involves a difference of potential differences, while m_i in Equation (1) involves a sum of potential outcomes.

Recall that for Bernoulli randomized experiments, Equation (1) is an unbiased estimate of the average treatment effect if ${\hat{m}}_{i}$ and T_i are independent. An identical argument can be used for paired experiments to show that ${\hat{τ}}_{i}$ will be unbiased if ${\hat{d}}_{i}$ and T_i are independent.

We define an estimate of the average treatment effect as

\hat{τ} = \frac{1}{N} \sum_{i = 1}^{N} {\hat{τ}}_{i},

= \frac{1}{N} \sum_{i = 1}^{N} \{(W_{i} - {\hat{d}}_{i}) T_{i} + (W_{i} + {\hat{d}}_{i}) (1 - T_{i})\},

in which we estimate d_i by using a leave-one-out procedure. We will refer to this sample splitting estimator as the paired leave-one-out potential outcomes (P-LOOP) estimator. For each pair i, we drop both observations and use the remaining $N - 1$ pairs to impute a_i and b_i using any method (such as a random forest or linear regression). We then set ${\hat{d}}_{i} = \frac{1}{2} ({\hat{a}}_{i} - {\hat{b}}_{i})$ and repeat this procedure for all N pairs to obtain $\hat{τ}$ . This leave-one-out procedure ensures that the estimate will be unbiased, as ${\hat{d}}_{i}$ and T_i are independent.

To see why this estimator is generally an improvement over the simple difference estimator, we consider the baseline approach where we set ${\hat{d}}_{i} = 0$ for all i. In this case, Equation (2) will exactly equal the simple difference estimator. As we will show in subsection 3.3, the variance of Equation (2) depends directly on how well one estimates d_i . So long as one estimates the values d_i better than setting ${\hat{d}}_{i} = 0$ , the estimator will perform better than the simple difference estimator.

3.2. Asymptotic Normality

In this section, we demonstrate that Equation (2) is asymptotically normally distributed under certain regularity conditions. Consider an infinite sequence of pairs $i = 1, 2, 3, . . .$ . As before, the potential outcomes and covariates for all pairs are fixed quantities. For a given sample size N, we observe the first N pairs in the sequence, and we will consider the behavior of Equation (2) as N increases.

We first define some additional notation. Let $U_{i} = 2 T_{i} - 1$ (i.e., $U_{i} = 1$ if $T_{i} = 1$ and $- 1$ if $T_{i} = 0$ ) and note that U_i has expectation 0. For a given sample size N, let ${\hat{d}}_{i}^{(N)}$ be the estimate for d_i as calculated using the remaining $N - 1$ observations in the sample and define the quantities $d_{0 i}^{(N)} = E ({\hat{d}}_{i}^{(N)})$ and ${\tilde{d}}_{i}^{(N)} = {\hat{d}}_{i}^{(N)} - d_{0 i}^{(N)}$ . For simplicity, we will often suppress the superscript $(N)$ within an equation.

For some intuition as to why Equation (2) converges to a normal distribution, consider the following decomposition:

\hat{τ} = \frac{1}{N} \sum_{i = 1}^{N} (W_{i} - {\hat{d}}_{i} U_{i}),

= \frac{1}{N} \sum_{i = 1}^{N} (W_{i} - d_{0 i} U_{i}) - \frac{1}{N} \sum_{i = 1}^{N} {\tilde{d}}_{i} U_{i} .

The $W_{i} - d_{0 i} U_{i}$ terms are independent random variables and are generally of order 1 (i.e., they will not shrink to 0 as N increases). So long as none of the variances of the $W_{i} - d_{0 i} U_{i}$ dominate the variances of the remaining terms, we would expect the appropriately centered and scaled mean of these terms to converge to a normal distribution. We generally expect ${\tilde{d}}_{i}$ to shrink to 0 as the sample size increases. We might therefore expect the ratio

\frac{\sum_{i = 1}^{N} {\tilde{d}}_{i} U_{i}}{\sum_{i = 1}^{N} W_{i} - d_{0 i} U_{i}}

to converge to 0, in which case average of the “remainder terms” $\sum_{i = 1}^{N} {\tilde{d}}_{i} U_{i} / N$ would be asymptotically negligible. It would follow that $\hat{τ}$ is asymptotically normally distributed.

In order for this intuition to hold, the data and the imputation method used must be sufficiently well-behaved. We next present general assumptions regarding their behavior. Note that we do not necessarily prove that these assumptions hold for any specific prediction algorithm.

Assumption 1: There exists some $0 < C < \infty$ and $q > 0$ such that for all i,

Var ({\tilde{d}}_{i}) = Var ({\hat{d}}_{i}) \leq C / N^{q} .

That is, as we observe more units, the variation of ${\hat{d}}_{i}$ across randomizations will shrink to 0. For example, suppose we were imputing the potential outcomes using OLS. Under appropriate regularity conditions, such as those in Freedman (2008), we would have that for any fixed i, $Var ({\hat{d}}_{i})$ goes to 0 at a rate $1 / N$ , and in addition, if we assume the covariates are bounded, then a uniform bound C on $N \times Var ({\hat{d}}_{i})$ would follow. Note that it should be possible to relax this condition and allow C to vary across pairs (see E. Wu & Gagnon-Bartsch, 2018, for an analogous condition), but we assume a fixed C here for simplicity.

Assumption 2: Let $ρ_{i j}$ be the correlation of ${\tilde{d}}_{i} U_{i}$ and ${\tilde{d}}_{j} U_{j}$ , and $\bar{ρ} = \frac{\sum_{i \neq j} ρ_{i j}}{N (N - 1)}$ . We assume that

N^{1 - q} \bar{ρ} \to 0.

In the case where $q = 1$ , the average correlation would only need to go to 0 at any rate for Assumption 2 to hold. We would expect the correlation between ${\tilde{d}}_{i} U_{i}$ and ${\tilde{d}}_{j} U_{j}$ to be weak, as the only dependence between these terms comes from the inclusion of U_i in ${\tilde{d}}_{j}$ (and U_j in ${\tilde{d}}_{i}$ ). That is, if U_i and ${\tilde{d}}_{j}$ were independent, then we would have

Cov ({\tilde{d}}_{i} U_{i}, {\tilde{d}}_{j} U_{j}) = E ({\tilde{d}}_{i} U_{i} {\tilde{d}}_{j} U_{j}),

= E ({\tilde{d}}_{i} {\tilde{d}}_{j} U_{j}) E (U_{i}) = 0.

Moreover, the influence of pair j on ${\tilde{d}}_{i}$ (and hence the correlation) should decrease as the number of pairs increases. Even if ${\tilde{d}}_{i}$ and ${\tilde{d}}_{j}$ are themselves highly correlated, we would expect the correlation between ${\tilde{d}}_{i} U_{i}$ and ${\tilde{d}}_{j} U_{j}$ to be weak. As an extreme example, suppose ${\tilde{d}}_{i} = {\tilde{d}}_{j}$ exactly. Then,

Cov ({\tilde{d}}_{i} U_{i}, {\tilde{d}}_{j} U_{j}) = E ({\tilde{d}}_{i} U_{i} {\tilde{d}}_{j} U_{j}),

= E ({\tilde{d}}_{i}^{2} U_{i} U_{j}),

= E ({\tilde{d}}_{i}^{2} U_{i}) E (U_{j}) = 0.

Assumption 3: Recall that $d_{0 i}^{(N)} = E ({\hat{d}}_{i}^{(N)})$ for some fixed N. For each pair i, we assume that the limit of $d_{0 i}^{(N)}$ exists and denote the limit as $d_{\infty i}$ . We also assume

\frac{1}{N} \sum_{i = 1}^{N} {(d_{0 i}^{(N)} - d_{\infty i})}^{2} \to 0.

In other words, the expected value of the ${\hat{d}}_{i}$ converges pointwise to some limit and the mean square of the $d_{0 i}$ converges as well. This does not necessarily mean that $d_{0 i}$ will converge to the true value of d_i . In most cases, we would expect that the imputation method will not be able to perfectly estimate d_i on average, and we characterize this in the next assumption.

Assumption 4: Let $V_{N} = \sum_{i = 1}^{N} {(d_{i} - d_{\infty i})}^{2}$ . There exists $0 < K < \infty$ such that

\frac{V_{N}}{N} \to K,

and

max_{i = 1, \dots, N} \frac{{(d_{i} - d_{\infty i})}^{2}}{V_{N}} \to 0.

That is, the mean squared error of the imputation method converges to a value K. In addition, no single term of the mean squared error dominates the remaining terms.

When Assumptions 1 through 4 hold $N (\hat{τ} - τ) / \sqrt{V_{N}}$ converges in distribution to a standard normal random variable. For a proof, see Appendix 1 in the online version of the journal.

3.3. Variance

We now estimate the variance of Equation (2). Let ${\hat{W}}_{i} = {\hat{a}}_{i} T_{i} + {\hat{b}}_{i} (1 - T_{i})$ , and define the mean squared errors of ${\hat{d}}_{i}$ and ${\hat{W}}_{i}$ as $MSE ({\hat{d}}_{i}) = E {(d_{i} - {\hat{d}}_{i})^{2}}$ and $MSE ({\hat{W}}_{i}) = E {(W_{i} - {\hat{W}}_{i})^{2}}$ . In online Appendix 2, we show

Var ({\hat{τ}}_{i}) = MSE ({\hat{d}}_{i}),

and thus that the variance is

Var (\hat{τ}) = \frac{1}{N^{2}} \{\sum_{i = 1}^{N} MSE ({\hat{d}}_{i}) + \sum_{i \neq j} γ_{i j}\},

where $γ_{i j} = Cov ({\hat{τ}}_{i}, {\hat{τ}}_{j})$ . In online Appendix 3, we show that

\frac{\sum_{i \neq j} γ_{i j}}{\sum_{i = 1}^{N} MSE ({\hat{d}}_{i})} \to 0,

under the conditions outlined in subsection 3.2. Because $\sum_{i \neq j} γ_{i j}$ is negligible relative to $\sum_{i = 1}^{N} MSE ({\hat{d}}_{i})$ , we suggest that the variance be estimated without the covariance terms in practice. For this reason, we focus on estimating $MSE ({\hat{d}}_{i})$ .

In online Appendix 4, we show that the mean squared error of ${\hat{d}}_{i}$ is less than the mean squared error of ${\hat{W}}_{i}$ and thus that

\frac{1}{N^{2}} \sum_{i = 1}^{N} MSE ({\hat{d}}_{i}) \leq \frac{1}{N^{2}} \sum_{i = 1}^{N} MSE ({\hat{W}}_{i}) .

We can obtain an unbiased estimate for this upper bound, which we use to estimate the variance of $\hat{τ}$ :

\hat{Var} (\hat{τ}) = \frac{1}{N^{2}} \sum_{i = 1}^{N} {(W_{i} - {\hat{W}}_{i})}^{2} .

To compare this variance estimator to the variance estimator for the simple difference estimator, consider a special case where we estimate the average treatment effect without using covariates. In the absence of any covariate information, it would be logical to set ${\hat{a}}_{i} = {\hat{b}}_{i} = {\bar{W}}^{(- i)}$ , where ${\bar{W}}^{(- i)} = \sum_{j \neq i} W_{j} / (N - 1)$ . In this baseline approach, the P-LOOP estimator would exactly equal the simple difference estimator as ${\hat{d}}_{i} = 0.5 ({\bar{W}}^{(- i)} - {\bar{W}}^{(- i)}) = 0$ for all i. In addition, we show in online Appendix 5 that the variance estimate for the P-LOOP estimator would equal

\frac{1}{{(N - 1)}^{2}} \sum_{i = 1}^{N} {(W_{i} - {\hat{τ}}_{s d})}^{2},

which is equal to $N / (N - 1)$ times the standard variance estimate in a paired t test (e.g., see Imai, 2008).

Importantly, because $Var ({\hat{τ}}_{i}) = MSE ({\hat{d}}_{i})$ , the performance of the estimator depends directly on how well we estimate d_i . If we improve the estimate of d_i over setting ${\hat{a}}_{i} = {\hat{b}}_{i} = {\bar{W}}^{(- i)}$ , we will be able to improve precision relative to the simple difference estimator. However, improving the estimate of d_i is not necessarily trivial. Because we are interested in estimating the difference between $m_{i 1}$ and $m_{i 2}$ , it does not suffice to reduce the mean squared error for the imputed potential outcomes as in the estimator of E. Wu and Gagnon-Bartsch (2018). For example, it is possible to obtain estimates of the potential outcomes (the ts and cs) that are reasonably close to the true values while having ${\hat{d}}_{i}$ of the incorrect sign. On the other hand, we could have estimates for the potential outcomes that are far from the true values that result in ${\hat{d}}_{i}$ being close to the true d_i . We discuss imputation methods to address this concern in the next section.

4. Imputation Methods of Potential Differences in Paired Experiments

4.1. The Pair Inclusion Trade-Off

We next present an imputation method to address the pair inclusion trade-off discussed in Section 1. We first discuss this trade-off further and then propose a method for addressing the trade-off within the P-LOOP estimator. The pair inclusion trade-off is perhaps easiest to understand in the context of a linear model rather than the Neyman–Rubin model. Consider the following standard linear regression model

Y = α + T τ + P β + Z γ + ∊,

where Y is the observed outcome, T is the treatment assignment vector, Z is a covariate, and P is a $2 N \times (N - 1)$ matrix of indicator variables that encodes the pair assignments. Suppose that there are pair effects (i.e., $β \neq 0$ ) and that Z is correlated with both P and T. If we were to omit P and regress Y onto T and Z, then we would bias the estimate of $τ$ . On the other hand, suppose that the pairing is not informative ( $β = 0$ ). In this case, including P in the regression would inflate the variance for $\hat{τ}$ , and it would be preferable to omit P from the regression.

Several authors have compared the variance of the simple difference estimator for completely and pair randomized designs under the Neyman–Rubin model (e.g., see Imai, 2008; Pashley and Miratrix, 2017). The difference in variance under these designs can be either positive or negative. Similarly, it may be possible to reduce the variance of our estimate when making covariate adjustments by ignoring the pair assignments. However, Imai (2008) cautions against analyzing paired experiments as if they were completely randomized, noting that this can result in biased confidence intervals and hypothesis tests. Fortunately, this is not an issue with the P-LOOP estimator, as we always account for the paired design. We always drop both observations in each pair when estimating d_i , and the decision to ignore or include the paired structure for the remaining observations only affects the adjustment term ${\hat{d}}_{i}$ .

When we discuss the inclusion or exclusion of the paired structure when imputing potential outcomes, we refer specifically to how we treat the remaining pairs when building a prediction model. If we ignore the paired structure when imputing potential outcomes, this means we fit a model to the remaining observations as individual units. If we include the paired structure when imputing potential outcomes, this means we fit a model to the remaining observations, treating each pair as a unit. Regardless of which approach we choose, the estimator remains design-based. For a given pair i, we always leave out both observations, and we wish to use the remaining observations such that we obtain the best estimate for d_i .

Suppose we ignore the paired structure of the data when we train our imputation model for the potential outcomes. In this case, we model the relationship between the covariates and the outcome overall rather than the relationship within pairs. However, if the relationship between the covariates and outcome within pairs is sufficiently different from the relationship overall, we could obtain a ${\hat{d}}_{i}$ that is far from the truth. One situation where this could happen is when a Simpson’s paradox occurs, and the relationship between the covariates and outcome within pairs is the opposite of the overall relationship.

Consider a hypothetical experiment in which a blood pressure medication is being tested on pairs of twins, and each pair belongs to either Ethnicity A or Ethnicity B. For each participant, we record a single covariate, an indicator for the presence of a genetic mutation. On average, participants with this mutation have blood pressure that is 5 units lower. Suppose this mutation is common in Ethnicity A and rare in Ethnicity B. However, for reasons unrelated to the mutation, Ethnicity A has a baseline blood pressure that is on average 10 units higher than the baseline for Ethnicity B. In this case, the presence of the mutation would be associated with higher blood pressure as Ethnicity A is more likely to have the mutation and also has a higher baseline blood pressure. However, within pairs, the presence of the mutation will be associated with lower blood pressure. If we ignore pair assignments when estimating d_i , we would infer that the presence of the mutation is associated with a higher value of blood pressure. For a given pair, we would want the presence of the mutation to predict lower blood pressure. Thus, the prediction of the difference $d_{i} = m_{i 1} - m_{i 2}$ would be of the wrong sign, resulting in poorer performance relative to the simple difference estimator. On the other hand, if the paired structure is not predictive of the outcome, then it would be better to omit the pair assignments when imputing the potential differences.

It can be unclear whether we should account for the pair assignments when imputing the potential differences. To avoid data snooping, we propose an imputation method in the rest of this section that automatically addresses the trade-off. We first propose methods for calculating ${\hat{a}}_{i}$ and ${\hat{b}}_{i}$ that do and do not account for the pair assignments in the prediction model, producing two sets of potential differences. Having produced two estimates for each a_i and b_i , we propose a method to automatically interpolate between them.

4.2. Estimating d_i When Pairs Are Not Predictive: Impute Potential Outcomes Separately

We first estimate d_i without accounting for the pair assignments for the observations outside of pair i. To do this, we drop both observations in pair i, then fit a model on the individual observations for the remaining pairs and separately impute all four potential outcomes (i.e., $t_{i 1}, c_{i 1}, t_{i 2},$ and $c_{i 2}$ ) for pair i. Although we ignore pair assignments for the observations outside of pair i, we must drop both observations in the pair when estimating d_i to ensure that the treatment effect estimate is unbiased.

More specifically, for each pair i, we drop both observations in the pair. We then fit a prediction algorithm on the remaining observations, ignoring the pair assignments and treating each individual as a unit. For example, we could regress $Y_{k j}$ onto $T_{k j}$ and $Z_{k j}$ for $k \neq i$ . We then use this model to impute $t_{i 1}, c_{i 1}, t_{i 2},$ and $c_{i 2}$ . To obtain ${\hat{t}}_{i 1}$ , we would plug in the covariates for the first observation in pair i and a treatment indicator of 1. We would obtain estimates for the remaining potential outcomes similarly and set

{\hat{d}}_{i} = \frac{1}{2} ({\hat{t}}_{i 1} + {\hat{c}}_{i 1}) - \frac{1}{2} ({\hat{t}}_{i 2} + {\hat{c}}_{i 2}) .

4.3. Estimating d_i When Pairs Are Predictive: Impute Potential Differences Directly

Next, we propose a method that accounts for pair assignments when estimating d_i . Rather than imputing the potential outcomes ( $t_{i 1}, c_{i 1}, t_{i 2},$ and $c_{i 2}$ ), we impute a_i and b_i directly, treating each pair as an observational unit. Recall from Section 3 that a_i and b_i are analogous to the potential outcomes in an experiment with Bernoulli randomization. We can therefore apply a procedure to the paired units which is similar to the leave-one-out procedure described earlier for estimating m_i in Equation (1). For Bernoulli experiments, we would only use the control units when imputing c_i and the treatment units when imputing t_i . However, for paired experiments, a_i and b_i are determined by which unit is arbitrarily labeled $j = 1$ and are therefore effectively interchangeable. As an example, for the ith pair, we have $a_{i} = t_{i 1} - c_{i 2}$ . However, if we had instead recorded the second unit in the pair first, then the values of a_i and b_i would be switched and a_i would be $t_{i 2} - c_{i 1}$ . We can take advantage of this fact to use all observations (except those in pair i) when imputing each potential difference.

When treating the pairs as units, we have $2 q$ covariates rather than q covariates for each unit. We start by leaving out pair i. We then wish to impute a_i and b_i using the $2 q$ covariates for the remaining pairs. One way to do this would be to simply concatenate the covariate vectors for the two observations in each pair. In this case, we define $Z_{i}^{a}$ as the vector of covariates where the covariates for the treated units come first. That is, $Z_{i}^{a} = (Z_{i 1}, Z_{i 2})$ if $T_{i} = 1$ and $Z_{i}^{a} = (Z_{i 2}, Z_{i 1})$ if $T_{i} = 0$ . For example, suppose $Z_{i 1} = (1, 2)$ and $Z_{i 2} = (3, 4)$ . Then $Z_{i}^{a}$ would be $(3, 4, 1, 2)$ if $T_{i} = 0$ , and $(1, 2, 3, 4)$ if $T_{i} = 1$ . In other words, Z_i is the concatenated vector of covariates as it is ordered in the original data, while $Z_{i}^{a}$ is the concatenated vector where the covariates for the treated unit come first.

Alternatively, we may wish to transform the covariates in some way; for example, we could take the means and differences of the covariates. This is similar to the approaches used by Imbens and Rubin (2015) and Fogarty (2018). In this case, define Z_i as

(\frac{Z_{i 1} + Z_{i 2}}{2}, Z_{i 1} - Z_{i 2}) .

That is, Z_i is the vector where the first q entries are the averages of each covariate for the pair, and the second q entries are the differences (Observation 1 − Observation 2). In analogy to the concatenation example, we define $Z_{i}^{a}$ to be the means and the treatment minus control differences.

We can now estimate d_i using these combined covariates and the observed differences. After leaving out pair i, we impute a_i by creating a model using the observed outcomes W_k (for $k \neq i$ ) as our response variable and the covariates $Z_{k}^{a}$ as our predictors. This model incorporates all of the remaining $N - 1$ pairs and predicts the value of a for a given set of covariates. We plug the covariates Z_i into this model to obtain ${\hat{a}}_{i}$ . The same model can be used to impute b_i . If we had labeled the second participant in the pair as the first participant, then a_i and b_i would be reversed. We therefore use the same model to impute b_i but reverse the order of the covariates for pair i. In the concatenation example, we would plug $(Z_{i 2}, Z_{i 1})$ into the model instead of $(Z_{i 1}, Z_{i 2})$ . In the transformation example, we would plug

(\frac{Z_{i 1} + Z_{i 2}}{2}, Z_{i 2} - Z_{i 1}),

into the model. Having obtained estimates ${\hat{a}}_{i}$ and ${\hat{b}}_{i}$ , we set

{\hat{d}}_{i} = \frac{1}{2} ({\hat{a}}_{i} - {\hat{b}}_{i}) .

4.4. Interpolating Between Imputation Methods

We have proposed two methods for imputing potential outcomes. However, we often do not know ahead of time which method will perform better. We therefore adaptively interpolate between the two methods.

For each pair i, we have two estimates of a_i obtained using the two imputation methods described above. We refer to these estimates as ${\hat{a}}_{i}^{(1)}$ and ${\hat{a}}_{i}^{(2)}$ . We wish to obtain the value $α_{i}$ that minimizes the distance between a_i and the interpolation ${\hat{a}}_{i} = α_{i} {\hat{a}}_{i}^{(1)} + (1 - α_{i}) {\hat{a}}_{i}^{(2)}$ . However, we want ${\hat{a}}_{i}$ to be independent of T_i . We therefore use a leave-one-out procedure to calculate $α_{i}$ . For each i, we leave out pair i and set $α_{i}$ to the value that minimizes the mean squared error for the remaining observations. In other words, we have

α_{i} = \underset{x \in [0, 1]}{argmin} \sum_{k \in A \ i} {\{a_{k} - (x {\hat{a}}_{k}^{(1)} + (1 - x) {\hat{a}}_{k}^{(2)})\}}^{2} .

Taking the derivative with respect to x and setting equal to 0, we have

α_{i} = \frac{\sum_{k \in A \ i} (a_{k} - {\hat{a}}_{k}^{(2)}) ({\hat{a}}_{k}^{(1)} - {\hat{a}}_{k}^{(2)})}{\sum_{k \in A \ i} {({\hat{a}}_{k}^{(1)} - {\hat{a}}_{k}^{(2)})}^{2}},

which we then restrict to be in the interval $[0, 1]$ . We then set our final estimate of a_i to be ${\hat{a}}_{i} = α_{i} {\hat{a}}_{i}^{(1)} + (1 - α_{i}) {\hat{a}}_{i}^{(2)}$ . We use a similar procedure for ${\hat{b}}_{i}$ .

5. Simulation Results

We present two simulations in the next two subsections. The first simulation illustrates the pair inclusion trade-off, while the second considers a scenario with a nonlinear relationship between the covariate and potential outcomes. In both cases, we compare the performance of P-LOOP with the simple difference estimator and the estimators discussed in Fogarty (2018), which we will refer to as Regression 1 and Regression 2. Regression 1 involves the treatment minus control outcomes regressed onto the treatment minus control covariates, while Regression 2 is the same regression with the addition of the mean of the covariates in each pair. For P-LOOP, recall from earlier that we are excluding the pair assignments in our imputation method if we impute the potential outcomes ( $t_{i 1}, c_{i 1}, t_{i 2},$ and $c_{i 2}$ ) separately, while we are including the pair assignments if we impute the potential differences (a_i and b_i ) directly. We show results using each of these imputation strategies as well as the interpolation method. We use both random forests and ordinary least squares as prediction methods.

For each of the scenarios described below, we generate a single set of potential outcomes. Next, we generate 10,000 treatment assignment vectors. For each of these, we obtain a treatment effect estimate and the nominal variance (i.e., the estimated variance) using each estimator. This results in 10,000 point estimates and 10,000 variance estimates for each method, which we can use to estimate the true variance and the expectation of nominal variance for that method. We estimate the true variance as the variance of the 10,000 point estimates and the expectation of the nominal variance as the mean of the 10,000 nominal variances.

5.1. The Pair Inclusion Trade-Off

We consider a hypothetical experiment based off the scenario described in subsection 4.1, where we are interested in the effect of a blood pressure medication. We generate $N = 50$ pairs of twins, half of which are of ethnicity $E_{i} = 0$ and the other half $E_{i} = 1$ . We randomly assign one participant in each pair to treatment and assign the other to control. That is, $T_{i 1} \sim Bern (0.5)$ and $T_{i 2} = 1 - T_{i 1}$ . Next, suppose there exists a genetic mutation $Z_{i j}$ . For each participant, we set $Z_{i j} \sim Bernoulli (p_{k})$ for $E_{i} = k$ . We set $p_{1} = .9$ and $p_{0} = .5$ . That is, participants of ethnicity $E_{i} = 1$ are more likely to have the mutation. We assume that only the observed outcome $Y_{i j}$ , as well as $T_{i j}$ and $Z_{i j}$ , are recorded. Suppose that Ethnicity 1 has a higher baseline blood pressure than ethnicity 0 (for reasons unrelated to the mutation) but that the presence of the mutation is causally associated with lower blood pressure. We generate the outcome as

Y_{i j} = 80 - 10 T_{i j} - 5 Z_{i j} + 10 E_{i} + ∊_{i j},

where $∊_{i j}$ are independent $N (0, 4)$ random variables. Because participants for ethnicity $E_{i} = 1$ have higher baseline blood pressure, $Z_{i j}$ is positively correlated with blood pressure across all participants. Thus a Simpson’s paradox occurs: Overall, $Z_{i j}$ has a positive association with blood pressure, while within pairs, $Z_{i j}$ has a negative association with blood pressure. We summarize the results of this simulation in Table 1 under the column Simpson’s Paradox.

Table 1.

Simulation 1 Results.

Method	Simpson’s Paradox			Uninformative Pairs
Method	True Var	E(Nom)	Cov Pr	True Var	E(Nom)	Cov Pr
Simple difference	.343	.342	.943	.361	.365	.947
P-LOOP RF (differences)	.154	.167	.951	.151	.168	.952
P-LOOP RF (outcomes)	.440	.462	.952	.146	.154	.949
P-LOOP RF (interpolated)	.152	.170	.953	.148	.156	.948
P-LOOP OLS (differences)	.152	.160	.950	.148	.160	.950
P-LOOP OLS (outcomes)	.442	.462	.953	.146	.154	.949
P-LOOP OLS (interpolated)	.152	.164	.952	.148	.156	.949
Regression 1	.151	.150	.943	.148	.149	.944
Regression 2	.153	.148	.942	.149	.148	.940

Note. True Var is the estimate for the true variance. E(Nom) refers to the estimate for the expected value of the nominal variance. For P-LOOP, these are estimates for expression of Equation (3) and for the expected value of Equation (4), respectively. Cov Pr is the estimated coverage proportion. We provide further details on how we obtain these estimates in online Appendix 6. The Monte Carlo estimates of the true variances have standard errors ranging from .002 to .007, while the Monte Carlo estimates for the expected values of the nominal variances all have standard errors below .0002. We provide these standard errors in online Appendix 6. P-LOOP = paired leave-one-out potential outcomes; RF = random forest; OLS = ordinary least squares.

We also generate a set of potential outcomes in which the pairs contain no additional information (beyond its association with covariate $Z_{i j}$ ). We generate the observed outcome as

Y_{i j} = 80 - 10 T_{i j} + 5 Z_{i j} + ∊_{i j},

where $∊_{i j}$ are independent $N (0, 4)$ random variables. In this case, E_i is associated with outcome because it is associated with $Z_{i j}$ but otherwise has no effect on outcome. We summarize the results of this simulation in Table 1 under the column Uninformative Pairs.

We see that in the Simpson’s paradox case, imputing the potential outcomes separately (not accounting for pairs when estimating a_i and b_i ) causes inflated variance relative to the simple difference estimator, while imputing potential differences directly (accounting for pairs) results in improved performance. However, in the case where the pair assignments are uninformative, it is better to impute the potential outcomes separately. The gains in this example are relatively minor; however, we show in the later sections that the improvements can be more substantial.

5.2. A Nonlinear Scenario

In the previous example, the potential outcomes were generated from a linear model with independent, normally distributed noise. We examine a more complex scenario in this section. Consider a hypothetical experiment in which we are testing the effect of a drug on recovery time for an illness. We generate $N = 50$ pairs. For each participant, we observe a single covariate, Z, corresponding to the baseline health score for that participant. To obtain this health score, we generate $Z_{0 i} \sim Unif (0, 10)$ for each pair i. We then set $Z_{i j} = Z_{0 i} + ∊_{i j}$ , where $∊_{i j}$ are independent $N (0, 1)$ random variables. The outcome in this example will be time to recovery.

The mean recovery time under treatment and control will be determined by the following logistic functions:

μ_{c} (Z_{i j}) = 3 + \frac{10}{1 + exp (2 Z_{i j} + 12)},

and

μ_{t} (Z_{i j}) = 3 + \frac{10}{1 + exp (2 Z_{i j} + 8)} .

We then generate the control potential outcomes using γ random variables with shape parameter $30$ and rate parameter $30 / μ_{c} (Z_{i j})$ . We generate the treatment potential outcomes analogously. A higher health score is associated with quicker recovery under both treatment and control; however, this recovery is expected to occur more quickly for treated units. We show the results of this simulation in Table 2. P-LOOP with random forests outperforms the other methods. This is not surprising, as the potential outcomes are obtained using a nonlinear data generating process. In addition, all of the methods are conservative, although P-LOOP with random forests is much less conservative than the other methods. We also observe that there is considerable benefit from excluding the pair assignments when imputing potential outcomes when using random forests as the imputation method.

Table 2.

Simulation 2 Results

Method	True Var	E(Nom Var)	Cov Pr
Simple difference	.094	.373	1
P-LOOP RF (differences)	.069	.160	0.996
P-LOOP RF (outcomes)	.046	.097	0.991
P-LOOP RF (interpolated)	.046	.097	0.992
P-LOOP OLS (differences)	.068	.371	1
P-LOOP OLS (outcomes)	.062	.364	1
P-LOOP OLS (interpolated)	.065	.363	1
Regression 1	.066	.351	1
Regression 2	.066	.358	1

Note. True Var is the estimate for the true variance. E(Nom Var) refers to the estimate for the expected value of the nominal variance. For P-LOOP, these are estimates for expression of Equation (3) and for the expected value of Equation (4), respectively. Cov Pr is the estimated coverage proportion. We provide further details on how we obtain these estimates in online Appendix 6. The Monte Carlo estimates of the true variances have standard errors ranging from .0006 to .0013, while the Monte Carlo estimates for the expected values of the nominal variances all have standard errors below .0004. We provide these standard errors in online Appendix 6. P-LOOP = paired leave-one-out potential outcomes.

5.3. Remainder Terms

In this subsection, we investigate the quantity

\frac{1}{N} E \{{(\sum_{i = 1}^{N} {\tilde{d}}_{i} U_{i})}^{2}\},

for each of the data generating procedures used in the simulations discussed above. This quantity is of interest for several reasons. The convergence of Equation (5) to 0 plays an important role in proving the central limit theorem discussed in subsection 3.2. This convergence also implies that $\sum_{i \neq j} γ_{i j}$ is negligible relative to $\sum_{i = 1}^{N} MSE ({\hat{d}}_{i})$ as discussed in subsection 3.3. In online Appendix 6, we show that

| \frac{1}{N} \sum_{i \neq j} γ_{i j} | \leq \frac{1}{N} E \{{(\sum_{i = 1}^{N} {\tilde{d}}_{i} U_{i})}^{2}\} .

It follows that the convergence of Equation (5) to 0 implies the convergence of $\sum_{i \neq j} γ_{i j} / N$ to 0.

For each of the three data generating processes discussed in subsections 5.1 and 5.2, we generate potential outcomes and covariates for 1,000 pairs. We then consider each of the first $N = 50, 100, . . ., 1,000$ of these pairs. For a given N, we generate 1,000 treatment assignment vectors, which we use to estimate Equation (5) for both random forest and OLS imputation. For more details on the simulation procedure, see online Appendix 6.

In Figure 1, we plot the estimated values of Equation (5) against the sample size N (both on a log base 10 scale). For each of the data generating procedures (and for both imputation methods), we can see that the estimated values of Equation (5) shrink to 0 as N increases. For the nonlinear data generating process, this decrease occurs more slowly when using random forest imputation. Note that Equation (5) contains terms relating to both the variances and covariances of the ${\tilde{d}}_{i} U_{i}$ . With this particular data generating process, the variance of ${\tilde{d}}_{i}$ shrinks more slowly with random forest imputation. For a further discussion, see online Appendix 6.

Figure 1.

We plot the estimated values of quantity of Equation (5) (i.e., $E {(\sum_{i = 1}^{N} {\tilde{d}}_{i} U_{i})^{2}} / N$ ) against the sample size N. Both values are plotted on a log base 10 scale. The top two charts show the estimates of Equation (5) corresponding to the data generating procedures in subsection 5.1. The bottom chart shows the estimates corresponding to subsection 5.2. The values of Equation (5) are estimated for both random forest imputation (solid line) and OLS imputation (dashed line).

6. Cognitive Tutor Impact Study

We apply our method to estimate the effect of an intervention in a randomized trial involving schools in Texas. This trial (discussed in Pane et al., 2014) tested the effectiveness of a computer program, the Cognitive Tutor Algebra 1 curriculum, and included 22 pairs of schools. The outcome of interest is the passing rate of the schools on the math section of the Texas Assessment of Knowledge and Skills in 2008. Available covariates included the school type (middle or high school) and a pretest score, the passing rate from 2007. We estimate the average treatment effect using either just the pretest score or both the pretest score and school type as covariates. In Table 3, we compare the performance of P-LOOP with the simple difference estimator and the estimators discussed in Fogarty (2018). We use random forests and linear regression as imputation methods in the P-LOOP estimator. As in the case of the simulations, we show the results imputing potential differences (accounting for pairs), imputing potential outcomes separately (ignoring the pair assignments), and the interpolation between the two. Note that P-LOOP imputing potential differences with OLS most closely matches the Regression 2 method, as both methods account for pairing and use the differences and averages of the covariates for making adjustments.

Table 3.

Comparison of Methods

Method	Pretest		Pretest and School Type
Method	Point Est	Nominal Var	Point Est	Nominal Var
Simple difference	−6.82	9.82	−6.82	9.82
P-LOOP RF (differences)	−4.41	7.06	−5.62	7.86
P-LOOP RF (outcomes)	−2.82	5.72	−4.94	5.60
P-LOOP RF (interpolated)	−3.53	6.39	−5.10	5.75
P-LOOP OLS (differences)	−2.79	6.56	−2.17	4.38
P-LOOP OLS (outcomes)	−2.04	5.66	−1.81	4.13
P-LOOP OLS (interpolated)	−2.08	5.85	−2.06	4.00
Regression 1	−2.61	6.18	−2.61	6.18
Regression 2	−2.60	6.56	−2.27	4.57

Note. Point Est and Nominal Var refer to the point estimates and nominal variances for each method, respectively. P-LOOP = paired leave-one-out potential outcomes.

Both P-LOOP and the methods of Fogarty (2018) have smaller nominal variance than the unadjusted estimator. Regression 1 has lower variance than Regression 2 when the pretest score is the only covariate, but Regression 2 has lower variance when the school type is included. Both regression methods always account for the pair assignments. For the P-LOOP estimator, we see that it is better to impute the potential outcomes separately and that the interpolation method imputes values closer to the potential outcomes imputation. With the interpolation method, we do not lose out on the precision gains from ignoring the pairs in our imputation, but we are still protected against a potential Simpson’s paradox.

7. Discussion

In paired experiments, the design of the experiment helps to enforce covariate balance between the treatment and control groups. While this design is often effective, it can be useful to make covariate adjustments to further improve precision. Covariate adjustments in paired experiments share many of the issues in completely randomized experiments; for example, it can be unclear ahead of time which covariates to use. A unique issue to paired experiments is the pair inclusion trade-off, so we must take particular care when making adjustments in paired experiments. Failing to account for the pair assignments can harm performance (e.g., when a Simpson’s paradox occurs), while including the paired structure when the pair assignments are not predictive can needlessly inflate variance.

We present a design-based method for paired experiments, the P-LOOP estimator. This estimator is guaranteed to be unbiased by design. Nonetheless, the pair inclusion trade-off is still relevant because it affects the variance of the estimator. To the best of our knowledge, this method is the first to directly address the pair inclusion trade-off. Generally, other methods account for the pairing, which protects against Simpson’s paradox and other situations where the within-pair trends differ from the overall trend. However, our method imputes two sets of potential outcomes, one excluding and one including the pair assignments, and automatically interpolates between the two. As we see in the Texas Schools data, this allows for improved precision. The P-LOOP estimator is also the first method specifically for paired experiments that involves sample splitting and the use of machine learning methods to impute potential outcomes, building on the flexible approaches used in completely randomized experiments. This flexibility can be beneficial in several ways such as allowing for automatic variable selection or high-dimensional covariates. However, the leave-one-out approach can also be computationally intensive. If computation time is an issue, one can modify the procedure to leave out multiple pairs instead of single pairs at a time.

Finally, logical extensions to the P-LOOP estimator include block-randomized experiments and experiments with multiple treatments. As with paired experiments, it can be unclear whether to include the block assignments when making covariate adjustments. However, while paired experiments can be treated essentially as Bernoulli randomized experiments, this is not the case for blocked experiments, and the variance estimation procedure outlined in this article would necessarily be modified.

Supplemental Material

Supplemental Material, Appendix - Design-Based Covariate Adjustments in Paired Experiments

Supplemental Material, Appendix for Design-Based Covariate Adjustments in Paired Experiments by Edward Wu and Johann A. Gagnon-Bartsch in Journal of Educational and Behavioral Statistics

Footnotes

Acknowledgments

We would like to thank John Pane and Adam Sales for providing the data set used in Section 6.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This article has been funded by Division of Mathematical Sciences (grant no: 1646108).

ORCID iD

Edward Wu

References

Aronow

P. M.

Middleton

J. A.

(2013). A class of unbiased estimators of the average treatment effect in randomized experiments. Journal of Causal Inference, 1(1), 135–154.

Balzer

L. B.

Laan

M. J.

Petersen

M. L.

, & the SEARCH Collaboration. (2016). Adaptive pre-specification in randomized trials with and without pair-matching. Statistics in Medicine, 35(25), 4528–4545.

Balzer

L. B.

Petersen

M. L.

Laan

M. J.

, & the SEARCH Collaboration. (2016). Targeted estimation and inference for the sample average treatment effect in trials with and without pair-matching. Statistics in Medicine, 35(21), 3717–3732.

Chernozhukov

Chetverikov

Demirer

Duflo

Hansen

Newey

Robins

(2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68.

Dixon

(2016). Should blocks be fixed or random? [Conference session]. 28th Annual Conference on Applied Statistics in Agriculture, Manhattan, KS.

Fogarty

C. B.

(2018). Regression-assisted inference for the average treatment effect in paired experiments. Biometrika, 105(4), 994–1000.

Freedman

D. A.

(2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40(2), 180–193.

Imai

(2008). Variance identification and efficiency analysis in randomized experiments under the matched-pair design. Statistics in Medicine, 27(24), 4857–4873.

Imai

King

Nall

(2009). Rejoinder: Matched pairs and the future of clusterrandomized experiments. Statistical Science, 24(1), 65–72.

10.

Imbens

G. W.

(2010). Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009). Journal of Economic literature, 48(2), 399–423.

11.

Imbens

G. W.

Rubin

D. B.

(2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.

12.

Ding

(2020). Rerandomization and regression adjustment. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(1), 241–268.

13.

Lin

(2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. The Annals of Applied Statistics, 7(1), 295–318.

14.

Moore

K. L.

van der Laan

M. J.

(2009). Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation. Statistics in Medicine, 28(1), 39–64.

15.

Pane

J. F.

Griffin

B. A.

McCaffrey

D. F.

Karam

(2014). Effectiveness of Cognitive Tutor Algebra I at scale. Educational Evaluation and Policy Analysis, 36(2), 127–144.

16.

Pashley

N. E.

Miratrix

L. W.

(2017). Insights on variance estimation for blocked and matched pairs designs. ArXiv preprint ArXiv:1710.10342 . https://arxiv.org/abs/1710.10342.

17.

Pinheiro

J. C.

Bates

D. M.

(2000). Linear mixed-effects models: Basic concepts and examples. In J. Chambers, W. Eddy, W. Haerdle, S. Sheather, & L. Tierney (Eds.), Mixed-effects models in S and S-Plus. Statistics and Computing (pp. 3–56). Springer.

18.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). Sage.

19.

Robins

J. M.

(2000). Robust estimation in sequentially ignorable missing data and causal inference models. Proceedings of the American Statistical Association Section on Bayesian Statistical Science, 1999, 6–10.

20.

Robins

J. M.

Rotnitzky

Zhao

L. P.

(1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427), 846–866.

21.

Rothe

(2018). Flexible covariate adjustments in randomized experiments. (Working Paper). http://www.christophrothe.net/papers/fca_apr2020.pdf

22.

Rubin

D. B.

(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688–701.

23.

Scharfstein

D. O.

Rotnitzky

Robins

J. M.

(1999). Rejoinder. Journal of the American Statistical Association, 94(448), 1135–1146.

24.

Small

D. S.

Ten Have

T. R.

Rosenbaum

P. R.

(2008). Randomization inference in a group–randomized trial of treatments for depression: Covariate adjustment, noncompliance, and quantile effects. Journal of the American Statistical Association, 103(481), 271–279.

25.

Spiess

(2018). Optimal estimation when researcher and social preferences are misaligned. Technical Report Job Market Paper.

26.

Splawa-Neyman

Dabrowska

D. M.

Speed

T. P.

(1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statistical Science, 5(4), 465–472.

27.

Tsiatis

A. A.

Davidian

Zhang

(2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in Medicine, 27(23), 4658–4677.

28.

Wager

Taylor

Tibshirani

R. J.

(2016). High-dimensional regression adjustments in randomized experiments. Proceedings of the National Academy of Sciences, 113(45), 12673–12678.

29.

Woltman

Feldstain

MacKay

J. C.

Rocchi

(2012). An introduction to hierarchical linear modeling. Tutorials in Quantitative Methods for Psychology, 8(1), 52–69.

30.

Gagnon-Bartsch

J. A.

(2018). The LOOP estimator: Adjusting for covariates in randomized experiments. Evaluation Review, 42(4), 458–488.

31.

Frangakis

C. E.

Louis

T. A.

Scharfstein

D. O.

(2014). Estimation of treatment effects in matched-pair cluster randomized trials by calibrating covariate imbalance between clusters. Biometrics, 70(4), 1014–1022.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.42 MB

Design-Based Covariate Adjustments in Paired Experiments

Abstract

Keywords

1. Introduction

2. Background and Notation

2.1. Estimating the Average Treatment Effect

2.2. Notation for Paired Experiments

3. A Design-Based Covariate Adjustment Procedure

3.1. Estimating the Average Treatment Effect

3.2. Asymptotic Normality

3.3. Variance

4. Imputation Methods of Potential Differences in Paired Experiments

4.1. The Pair Inclusion Trade-Off

4.2. Estimating di When Pairs Are Not Predictive: Impute Potential Outcomes Separately

4.3. Estimating di When Pairs Are Predictive: Impute Potential Differences Directly

4.4. Interpolating Between Imputation Methods

5. Simulation Results

5.1. The Pair Inclusion Trade-Off

5.2. A Nonlinear Scenario

5.3. Remainder Terms

6. Cognitive Tutor Impact Study

7. Discussion

Supplemental Material

Supplemental Material, Appendix - Design-Based Covariate Adjustments in Paired Experiments

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iD

References

Supplementary Material

4.2. Estimating d_i When Pairs Are Not Predictive: Impute Potential Outcomes Separately

4.3. Estimating d_i When Pairs Are Predictive: Impute Potential Differences Directly