Abstract
Background:
An important question in the design of experiments is how to ensure that the findings from the experiment are generalizable to a larger population. This concern with generalizability is particularly important when treatment effects are heterogeneous and when selecting units into the experiment using random sampling is not possible—two conditions commonly met in large-scale educational experiments.
Method:
This article introduces a model-based balanced-sampling framework for improving generalizations, with a focus on developing methods that are robust to model misspecification. Additionally, the article provides a new method for sample selection within this framework: First units in an inference population are divided into relatively homogenous strata using cluster analysis, and then the sample is selected using distance rankings.
Result:
In order to demonstrate and evaluate the method, a reanalysis of a completed experiment is conducted. This example compares samples selected using the new method with the actual sample used in the experiment. Results indicate that even under high nonresponse, balance is better on most covariates and that fewer coverage errors result.
Conclusion:
The article concludes with a discussion of additional benefits and limitations of the method.
Keywords
In the social, educational, and medical sciences, evaluations of interventions are typically conducted using randomized experiments. Randomized experiments are preferred since they have high internal validity, ensuring that the treatment effect estimated within the experiment is the causal effect of the treatment. This random assignment to treatment conditions, however, does not help when generalizations about the effect of the treatment for units not in the experiment are desired. Since experiments very rarely select units using probability sampling from a well-defined population (Shadish, Cook, and Campbell 2002), any generalization must typically be based on qualitative judgments regarding how similar a particular population of interest is to the composition of units in the experiment (Cornfield and Tukey 1956).
Recently, statisticians have begun developing new methods for improving generalizations from completed experiments. Stuart et al. (2011) introduced propensity score-based methods for quantitatively evaluating the degree of similarity between a population and an experimental sample, while Hedges and O’Muircheartaigh (2011) developed a method for adjusting the estimate and standard errors to account for these differences using a propensity score poststratification estimator. Tipton (2013) further developed the assumptions necessary for causal generalization using propensity score methods and properties of the poststratification estimator. Furthermore, Tipton showed that these propensity score-based methods perform best when there is no coverage error. Coverage errors arise when particular segments of the population do not have relevant comparison units “like” them in the sample used in the experiment.
In contrast to these retrospective approaches, this article provides a new framework and method for sample selection in experiments that improve causal generalizations prospectively. The goal of this framework is for the sample selected for inclusion in the experiment to be compositionally similar to the inference population on a variety of important covariates that possibly explain variation in potential treatment effects. The approach can be used broadly and does not require (nor preclude) random sampling. To achieve these goals, the method uses cluster analysis techniques to classify the population into nearly homogenous strata and then provides a simple distance-based approach for selecting units within each stratum into the experiment. The method is similar to that proposed by Tipton et al. (2014) but differs in that here eligibility criteria differentiating between the inference population and the units eligible to be in the study are not required.
Since in most practical cases the units selected into an experiment are in fact clusters or aggregates of individuals—for example, schools or school districts—the method developed here is a version of a stratified cluster sampling design. The goal is to first divide the clusters (e.g. schools) into strata using cluster analysis methods and then to select clusters (e.g. schools) within each of these strata into the experiment. Note that the language used here can be confusing, since the word cluster is used differently in the field of cluster analysis than in the fields of experimental design and survey sampling. In order to clearly differentiate, we use the word cluster throughout to mean a group of aggregated individuals, for example, a school or school district. This follows practice common in the design and analysis of large-scale experiments. We then refer to the groups of these clusters that are created using cluster analysis methods as strata, since these will be used for the creation of a stratified sampling plan for generalization.
Overall, the article is organized as follows. In the first section, we frame the problem of sample selection in the model-based sampling literature and introduce the goals for our approach. In the second section, we develop a stratified sample selection method using cluster analysis to meet these goals. In the third section, we apply and evaluate this method using an example. Finally, in the last section, the article concludes with a discussion of additional benefits and extensions to the method.
Generalizations From Experiments
The Role of Models in Generalization
In developing a method for sample selection, it is helpful to begin by reviewing what would happen in an ideal study aimed at causal generalization. In this ideal study, first a well-defined inference population P of size N would be carefully enumerated and defined. For example, in a large-scale educational experiment, this might be a list of the 65,134 regular middle schools in the United States obtained from the Common Core of Data (National Center for Education Statistics). Second, a sample S of n sites would be randomly selected from this population. For example, in an educational experiment, the sample would typically include between 20 and 60 schools or districts, depending on the study design (Spybrook 2012). Third, within these n schools, units would be randomly assigned to treatment conditions. Depending upon the design of the study, this randomization might happen at the cluster level (e.g., schools) or at a lower level (e.g., classrooms or students). This dual randomization would ensure that both the site and the treatment selection processes were ignorable or noninformative, where an ignorable selection process is one in which unobserved covariates have no effect on the conditional distribution of outcomes given the observed covariates (Rubin 1976; Smith and Sugden 1988). The result would be an “experiment within a survey,” which would clearly enable causal generalizations (Smith and Sugden 1988; Shadish, Cook, and Campbell 2002).
In practice, this dual randomization procedure is generally infeasible (Bloom 2005; Rubin 1974; Shadish, Cook, and Campbell 2002). In fact, Olsen et al. (2013) found that random site selection was implemented in only 7 of the 273 experiments reported in the Digest of Social Experiments (Greenberg and Schroder 2004). Instead, study designers and analysts often choose only one level of randomization, resulting in either a probability survey or an experiment (Fienberg and Tanur 1987; Imai, King, and Stuart 2008). In experiments, while treatment is assigned randomly, the typical practice is to select a convenience sample of n sites, where here by “convenience” we mean without clear reference to a well-defined population (Shadish, Cook, and Campbell 2002).
Given the infeasibility of random site selection, in experiments causal generalizations are typically made through the use of statistical models. To see why models are necessary, note that we can decompose a population average treatment effect (PATE) as follows (Imai, King, and Stuart 2008),
where the sample average treatment effect (SATE) and nonsample average treatment effect (NATE) are the average treatment effects for those in the sample and not in the sample, respectively, the sample includes n units and there are N total units in the population. While we can estimate the SATE directly from the n units in the experiment, estimating the average treatment effect for units not in the experiment (NATE) is difficult. If the sites were selected using probability sampling, the NATE could be estimated by taking into account the selection probabilities for the units in the sample. Since the sites in an experiment are typically not selected randomly, the NATE must be predicted based upon a model relating the sample to the population.
In experiments, when generalizations beyond the sample of n sites are of interest, the results are typically analyzed using random effects models (Kirk 1995; Raudenbush 1993); these multilevel models are generalizations of the analysis of variance models typically used to analyze single-site experiments. These super-population models make generalizations through use of random effects (with an assumed distribution), not based upon selection or random assignment probabilities. Specifically, two super-population models are common: the hierarchical model (for cluster-randomized designs) or the random block model (for multisite trials). Whether estimated using traditional hierarchical linear model methods (Raudenbush and Bryk 2002) or the Neyman model (Schochet 2013), the average treatment effect estimated is therefore considered generalizable to schools or districts that are “similar” to those in the study. These “bottom-up” generalizations are difficult since what is meant by similar is typically vaguely defined.
Recent work in the analysis of experiments has focused on how to improve generalizations within this super-population framework. The methods begin by carefully defining an inference population of interest, using a population frame like the Common Core of Data or a state longitudinal data system (e.g., Stuart et al. 2011; Tipton 2013). An important feature of these population frames is that they both enumerate all units in the population and include a rich set of covariate information on these units. Next, the N schools in the population and the n schools in the sample used in the experiment are compared on a set of p covariates using a propensity score (Rosenbaum and Rubin 1983). These covariates are selected so as to meet a sample selection ignorability condition; here, ignorability is met if this set includes all covariates associated with both the site selection process and the treatment effect variability (Stuart et al. 2011; Tipton 2013). Propensity score methods are then used to reweight the sample and the population so that the two groups are balanced on this set of p covariates, where balanced means that the means (and higher order moments) of these covariates are similar for the two groups. These reweighting estimators include subclassification or poststratification estimators used in combination with the multilevel models given earlier (e.g., Hedges and O’Muircheartaigh 2011). The result is an estimate of the average treatment effect given for a well-defined population based upon a super-population model.
Retrospective generalizations, like those given earlier, are helpful in that they shift generalizations from vaguely defined to well-defined populations. However, as Tipton (2013) shows, sometimes the effectiveness of these methods is limited, particularly when there are coverage errors. Coverage errors arise when there exist segments of the population for whom there are no similar units in the experiment; this, we argue, is the lasting effect of the bottom-up generalization approach. In this article, we focus instead on developing a method for site selection that makes these generalizations “top-down.”
Instead of carefully defining the inference population after the study is completed, our goal here is to begin the study with a well-defined inference population and then design a site-selection process that makes model-based generalizations possible.
Balanced Sampling
Just as in the ideal study, the goal here is to begin by first carefully defining and enumerating an inference population P of size N and then to select the n units in the sample S strategically, using a more formal sample selection plan. The goal is for the sample selection process to be noninformative or ignorable (Rubin 1976) by which we mean that the resulting sample and population are compositionally similar on the set of covariates that explain treatment effect variability (Stuart et al. 2011; Tipton 2013).
The approach we develop here builds on theory and results found in model-based sampling in the sample survey literature (Valliant, Dorfman, and Royall 2000). Model-based sampling is a purposive sampling alternative to design-based random sampling methods; while random selection can be used within the model-based framework, it is not required. The model-based sampling approach, while not as commonly used in survey sampling, has much in common with the model-based super-population approach commonly used to analyze experiments (Fienberg and Tanur 1987; Rao 2005). As such, this method will allow inferences from a sample to super-population using the same random effects approach currently used in the analysis of large-scale experiments; the key difference here is that now the definition of the super-population will be clearly and carefully defined.
In order to develop this strategy, we first define the causal effects of interest in the potential outcomes framework. For each unit in the population P, let W = 1 if a unit is assigned to the treatment condition. Then assume that for each unit, there exists two potential outcomes, Y(1) = Y(W = 1) and Y(0) = Y(W = 0), where Y(1) is the unit’s potential outcome under treatment and Y(0) is the unit’s potential outcome under some specified alternative condition. Now for each unit in the population P, let the potential treatment effect Δ = Y(1) − Y(0). Note that as a result of the Fundamental Problem of Causal Inference, both potential outcomes—and by extension the potential treatment effect—can never be observed for a particular unit (Holland 1986). However, the goal of an experiment is generally to estimate the PATE τP = E[Δ], where the expectation is across all units in the inference population P. Finally, let Z = 1 if a unit in the population P is selected into the sample S. Since only units in the sample (Z = 1) are in the experiment, we focus here on causal impact estimators that are based only on the n units in S.
An important question is under what conditions an estimator T that is unbiased for the SATE, τ
S
= E(Δ|Z=1), is unbiased for the PATE, τ
P
= E(Δ). Stuart et al. (2011) and Tipton (2013) show that bias arises in relation to covariates that explain variation in potential treatment effects. Since we do not know, a priori, what these covariates are (since we have yet to conduct the experiment), this requires us to propose a model. We might posit a simple model
where X is a single covariate with known values for all units in the population. For example, based on theoretical or previous empirical findings, we may believe that the effect of a school-based reading intervention (Δ) linearly increases or decreases in relation to last year’s school average reading test scores (X).
If in fact the potential treatment effects do vary in relation to X, then by selecting the sample so that the average value of X in the sample and population is the same (i.e., E(X|Z = 1) = E(X)), an estimator T of the SATE τ S would be model unbiased for the PATE τ P . We call this a balanced sample since the sample and population are balanced on the covariate X. Importantly, this idea of balance is similar to the goal of balance found in retrospective methods; the key difference is that retrospective balance is achieved through reweighting, while here balance is achieved through the strategic selection of the sample.
The idea of a balanced sample is in many ways similar to the naive sense of a “representative sample” (Royall and Herson 1973), where here by representative we mean that the sample is a like a “miniature” of the population (Kruskal and Mosteller 1980). Importantly, while random sampling will result in a balanced sample on average, it is not the only or best method for achieving such balance, particularly when the sample is small. Developing a method that performs well in small samples is of particular concern here since in most cluster-randomized or multisite studies in education, the number of higher level units (e.g. schools or districts) tends to be between 20 and 60 (Spybrook 2012).
Bias-Robust Balanced Sampling
Based on the model 1.1, we might propose to develop a sample selection plan that results in a sample that is balanced for the covariate X (i.e., E(X|Z = 1) = E(X)). This necessarily leads us to ask: what happens if our model is wrong? For example, suppose that potential treatment effects actually vary in relation to the model
For example, it may be that the potential treatment effects vary nonlinearly in relation to school average pretest scores (X) and vary also in relation to the proportion of the school that is minority (W). Now the bias of an estimator T for τ
P
can be written as
Clearly, T is only unbiased for τ P if E(X 2|Z = 1) = E(X 2) and E(W|Z = 1) = E(W). This means that a sample that is balanced on X is only adequate if our former model (1) holds but not under this new model (2).
In model-based sampling, the goal is to select the sample so that the method is bias robust, where bias robust is shorthand for “bias-robust-against-model-failure” (Valliant et al. 2000). In this framework, in causal generalization this requires us to first propose multiple models relating the potential treatment effects to possible moderators, and second, to develop a sample selection plan that guards against selection of the wrong model. Here the key tool guarding against model failure is the selection of a balanced sample. Subsequently, we define this more formally.
Definition: balanced sample (of order R)
Let
When a balanced sample (of order R) is selected on p covariates, it is easy to show that an unbiased estimator T of the SATE τS is unbiased for the PATE τP
if the true model is of the form
where for h = 1 … p, β
hr
is the regression coefficient associated with the covariate Xh
r
and δ
hr
= 1 if the coefficient β
hr
is included in the model and zero otherwise. For example, if a sample is balanced (of Order 2), then T is unbiased when the true relationship is linear
quadratic,
when it includes only a subset of the covariates,
or any other combination of the models. Since we cannot and do not know the true model, in the language of model-based sampling, the goal is to select a sample using a bias-robust strategy. In causal generalization, this is a strategy that leads to an unbiased estimate of τ P = E(Δ) under a variety of possible models.
A sample selection strategy can be bias robust in two ways. First, the sample becomes more bias robust as the dimension p of X increases. This is because in practice when balance is achieved on a wide variety of covariates it is often approximately achieved on other covariates, including those that may have been omitted from the model (Stuart 2010; Smith and Sugden 1988; Royall and Herson 1973; Brewer 1999; Rubin and Thomas 1996). Second, for a fixed set of covariates X, a sample becomes more bias robust as the order of R increases. For example, a sample which is balanced on not just the first moments but also the second moments of X—a balanced sample of Order 2—will lead to an unbiased estimate whether the true model is linear or quadratic (Valliant, Dorfman, and Royall 2000).
The idea that the sample should be selected using a bias-robust strategy with the goal of achieving balance on both the means and the higher moments of multiple covariates is familiar—it is exactly the post hoc approach used in the propensity score literature (Rosenbaum and Rubin 1983). Propensity scores are commonly used to achieve balance under a variety of models when random processes are not possible (e.g., quasi-experiments; post hoc generalizations from experiments) or when they fail (e.g., attrition in experiments; nonresponse in surveys). While the method we develop here does not use a propensity score approach, the goals of the procedure are the same—to replace a random process with a model and to design the study in such a way that the results do not depend heavily on any one model.
Once the sample of n units is selected, the robustness of the sample to model misspecification can be evaluated by comparing the R moments of the p covariates under study in the realized sample and population. Ideally, these differences will be small, enabling the use of the simple multilevel random effects estimator commonly used in experimental analysis. If differences remain, post hoc methods like regression adjustment or those introduced by Hedges and O’Muircheartaigh (2011) and Tipton (2013) can be used to decrease bias. Importantly, since the sample was selected with an eye toward balance, however, the achieved sample and population are likely to be more similar than if a bias-robust selection method had not been used. The fact that the sample and population are more similar will mean that there will be fewer coverage errors and that, if adjustment is needed, the cost in terms of variance inflation will be smaller (Tipton 2013).
Stratified Sampling as a Tool for Generalization
Defining a Stratified Estimator
Recall that in the balanced-sampling framework, the goal is simply to select the sample so that it is like a “miniature” of a well-defined population. To do so, we begin by positing a model explaining variation in potential treatment effects (e.g., Model 1.1), and then propose alternative models (a bias-robust strategy). To this end, we need a method for sample selection that allows for balance on orders greater than one (i.e., not only on first but also on second or higher moments) and enables X to include a large and varied set of covariates. While many possible methods exist, in model-based sampling, the simplest method that achieves this goal is that of stratified sampling with proportional allocation (Valliant, Dorfman, and Royall 2000).
Stratified sampling is already widely used in both survey sampling and in large-scale experiments. In probability sampling, strata are used to reduce the variance of an estimate; since the focus is on variance reduction (not bias), here it is common for the strata to be created on only one or two covariates (Lohr 1999). The use of one or two covariates for strata creation is also sometimes used for nonrandom site selection in experiments. For example, experiments sometimes attempt to include sites in rural and urban locations (i.e., urbanicity) or in regions throughout the country (e.g., northeast, southeast, midwest, west). Here while strata are created in order to improve generalizations, the method and framework for making these generalizations are typically informal. In contrast, in this article we propose to create strata with the goal of reducing bias through the inclusion of many covariates—hopefully all covariates that explain treatment effect heterogeneity—and in order to balance the sample and population on both means and higher order moments. In contrast to prevailing practice, our goal is to explicitly state, develop, and evaluate a bias-robust strategy for sample selection aimed at generalization.
If a simple Model (1) is of interest, in order to create a stratified sample with proportional allocation, three steps would be involved. First, the values of the covariate X would be divided into strata. If X is categorical with k categories, then k strata could be naturally created. If X is continuous, many possible strategies for strata creation could be used; one of the simplest is to define the j = 1 … k strata so that each contains an equal portion (i.e., wpj = N/k = Nj/N) of the population units (Cochran 1968). Second, under proportional allocation, stratum j would be allocated nj = wpj × n units in the sample. This means that in stratum j, nj/Nj units would need to be selected into the experiment. Third, in stratum j, the sample would need to be selected so that E(X|Z = 1, j = j) = E(X|j = j), which is to say that the stratum-specific mean of X is the same in the population and the sample.
The main benefit of using a stratified sampling approach with proportional allocation is that the resulting sample is self-weighting. This means that the sample and population are balanced on the covariate X, since
The fact that the sample is self-weighting means that no additional adjustments are needed and that the usual multilevel random effects model can be used for estimating the average treatment effect and making generalizations. Since the standard estimator can be used, this means that the sample selection process does not impact the power analysis (used to determine the sample size n). As a result, the issues of statistical power and generalization can be separated, which is logistically helpful in designing the experiment. It is also helpful that stratified sampling with proportional allocation is conceptually easy to understand and explain, making it appealing in the policy context where many of the results of large-scale experiments are used and interpreted.
When X is a single covariate, k strata can be easily created leading to a balanced sample. However, when the dimension of
In the survey sampling literature, the problem of stratified sampling under a multivariate and continuous
Cluster Analysis Method
The goal of cluster analysis is to divide units in the inference population into strata so that units in the same stratum are more similar than units in different strata. We focus here on the k-means partitioning method, though other methods are available (Everitt et al. 2011). The basic idea of k-means clustering is to create k strata and then to assign each unit to one stratum so that a measure of similarity is maximized. There are two main steps involved in a cluster analysis, and we briefly review these here.
Choosing a Distance Metric
In this article, we use cluster analysis to group units into strata that are as close to homogeneous as possible. This means that a measure of distance or similarity is required to define “close.” There are two common distance measures that we argue are useful for our purposes, though others are available (Everitt et al. 2011). The decision to use one of these metrics over the other will largely depend on the type of covariates included in
When all of the covariates in
where each covariate Xh has weight wh , and Xih and Xi’h are the values of the hth covariate for units i and i’. One option for weights is to set wh = 1 for all covariates h, in which case de ii’ is the Euclidean distance; this gives the most weight to covariates with the largest variances. Alternatively, if there is no information regarding which covariate is a more important or better predictor of treatment effect heterogeneity, then the obvious solution is to use inverse-variance weights, where wh = 1/V(Xh ). In this framework, the weighted covariates have a common variance of one, and each covariate contributes equally to the distance metric (though other weighting methods are possible).
Alternatively, when
where dii’h
is the similarity between units i and i’ on the covariate Xh
. Note that this measure of distance, dii’h
, can differ for different variable types. For dummy and categorical variables, dii’h = 1 if the two units i and i’ have the same value and 0 otherwise. For continuous variables, it is standard to use
where |.| indicates absolute value and Rh is the range of observations for the covariate Xh . Using distance measures defined this way ensures that dii’h ∊ [0,1] for covariates of all types. Additionally, this general distance measure allows for missing data. For example, the weights can be determined so that wii’h = 0 if the outcome of the Xh is missing for either or both of units i and i’. Again, other weighting schemes can also be used, particularly if information on the importance of particular covariates in explaining potential treatment effect heterogeneity is available.
Determining Strata
In k-means clustering, once a set of covariates
One difficulty with this procedure is that the number of strata k must be determined in order to generate the strata. One solution to this is to use other cluster analysis methods—for example, hierarchical cluster analysis—to explore the structure of the population. A second solution is to generate and evaluate the strata created using different values of k, where k = 1, 2, … q for some maximum number of strata q. While it is obvious that q ≤ n, it may also be desirable to choose a manageable number of strata as the maximum, such as q = 10 or 20. After determining q, for each value of k, the optimization algorithm is used to divide the population of units into k strata.
Once results have been generated for various values of k, the results are then compared to determine which value of k is best for the particular population and experiment. One common evaluation method is simply to partition the total variability in the covariates in
In practice, choosing the optimal number of strata involves both statistical and practical criteria. Statistically speaking, the ideal number of strata can be large, since this results in more homogeneous strata and a more bias-robust sample. Practically, however, it can be difficult to achieve an adequate sample if some of the strata are too small (since the response rate for recruitment in experiments is often small). Additionally, the amount of resources (in terms of time, people, money) aimed at recruitment can be small, which leads to a desire for fewer strata.
Sample Allocation, Selection, and Evaluation
Once the number of strata k has been selected, a strategy for sample selection within each cluster must be developed. In this section, we detail the steps involved in this process.
Allocate the Sample to the Strata
In each of the k-clusters developed in the previous section, there are Nj units in the population, where N 1 + N 2 + … + Nk = N. Using proportional allocation, the sample is then allocated so that nj = [(Nj/N)n], where we use [.] to signify that each value must be rounded to the nearest integer.
Calculate and Rank Within Stratum Distances
Since the overall goal is to select a balanced sample (of some order R), a method for sample selection within each stratum is needed. The goal is to simply select a balanced sample (of Order 1) within each stratum. This means selecting a sample such that in each stratum j = 1 … k, E(Xh
|Z = 1, j = j) = E(Xh
| j = j) for each covariate Xh
in
An alternative method for selection is as follows. First, in each stratum, calculate E(Xh
| j = j) for each covariate Xh
. Then for each of the i = 1 … Nj
units in stratum j, a measure of distance is calculated. One strategy is to use the weighted Euclidean distance
where wh is the weight given to covariate Xh , E(Xh |j = j) is the average value of the hth covariate in stratum j, and Xijh is the value of the hth covariate for unit i in stratum j. Thus, each unit i in stratum j has one total combined distance measure dij . Again, just as discussed previously, different weights could be used, particularly if it is hypothesized that one covariate matters more in terms of explaining potential treatment effect heterogeneity than another covariate. Note that to the degree that the strata are homogenous (having small within-stratum variances), we would expect these distances dij not to vary by much.
Based on these within-stratum distance measures, each of the Nj population units within a particular stratum j can be ranked from smallest to largest. This ranked list can then be used for selecting the nj units into the experiment. For example, a recruiter might start at the top of the list with the unit ranked “1,” then if the unit does not agree to be in the experiment, move on to the unit ranked “2,” and so on until nj units agree to be in the experiment.
Nonresponse (refusals)
As noted earlier, it is assumed that many units will not agree to be in the experiment. For example, after ranking units within a stratum, it is possible that the first unit to successfully enter the experiment is not “1” but instead “14.” Here the concern is that the nj
units that agree to be in the sample are different than the Nj
units in the population on either the covariates in
The second concern is that units that refuse to take part in the experiment are different than those that agree to be in the experiment. This is a question regarding an omitted variable X*. Note that differences in relation to X* will only cause bias if the potential treatment effects are a function of X* (conditional on the other
Evaluation of the Sample
In practice, it is unlikely that the nj “best” units (those with the smallest distances dij ) in each stratum will all agree to be in the study. As discussed earlier, when the strata are sufficiently homogenous, the inclusion of later ranked units will typically not be problematic. Regardless, once the sample is selected, it is important to evaluate the degree of balance between the final sample S and population P. To do so, balance (of Order 1) can be assessed in each stratum j by comparing E(Xh |Z = 1, j = j) to E(Xh |j = j) for each covariate Xh . Then the overall balance of the final sample S and the population P can be evaluated by comparing E(Xh r |Z = 1) to E(Xh r ) for various values of r. In order to evaluate whether this degree of balance is adequate, both substantive criteria regarding the importance of each covariate can be used and statistical criteria like standardized mean differences or t-tests. Finally, if large residual differences are detected, these can be reduced through post hoc strategies like regression adjustment or reweighting (e.g., Hedges and O’Muircheartaigh 2011).
Example
In order to illustrate the implementation of this method and its benefits compared to the conventional bottom-up approach to generalization, in this section we present a reanalysis of an experiment evaluating a middle-school mathematics program, SimCalc. The original study included 73 schools that, while selected with an eye toward generalization, did not use any formal method for doing so (Roschelle et al. 2010). In order to better generalize from these schools to the population of noncharter schools serving seventh graders in Texas (N = 1,713), Tipton (2013) reanalyzed these data using a propensity score subclassification estimator based on 26 covariates from the state academic excellence system. Here we present a new analysis in which we ask, what would happen if we could go back in time and instead collect the sample using the sample selection method developed here? How different would this sample be from the sample actually collected in experiment?
All analyses presented here were conducted in the statistical program
Comparison of Population Means by Variable and Stratum.
Note. LEP = limited English proficiency. Bold–italics and dark gray values indicate those are the maximum value across strata for the particular covariate, while boldface and light gray indicate the row minimums.

Elbow plot of the proportion of total variance between strata.
Figure 1 is an elbow plot illustrating the proportion of variance between strata. Note that at first, adding strata dramatically increases ρ
k
, but eventually these changes become smaller. Based on this figure, we selected the k = 9 strata solution, since for this number over 80% of the total variation in the 26 variables in
Table 1 also includes, for each stratum, the proportion of the population (wpj = Nj/N), the number of units in the sample allocated using proportional allocation with rounding (nj ), and the number of units in the actual experiment (nej ). This reveals that while the actual experiment represented the population fairly well, it did greatly overrepresent schools from Stratum 1 and underrepresent schools in Stratum 3. The fact that the experiment did not include any schools from Stratum 3 is an example of a coverage error.
In order to evaluate the overall balance and bias robustness that would be achieved using this approach, in Table 2 we report the E(Xh r ) for r = 1,2,3 for each of the 26 covariates in the population. We then use our distance-based method to order the units in each stratum j, where lower ranks indicate smaller distances from the stratum mean for the population. Since we do not have a sense of which covariate is likely to have the largest impact, we set the weights so that wh = 1/V(Xh), therefore weighting each covariate equally. In order to illustrate the usefulness of the method under both ideal and high nonresponse, we include two possible samples. The first sample selects the nj highest ranked schools in each stratum; this is the “ideal” sample. The second (“nonresponse”) sample instead assumes that the first 50 units in each stratum refused to participate in the experiment and that from there the next nj schools agreed. Note that this would mean that in total 450 schools refused participation before any agreed to participate, which corresponds to a nonresponse rate of at least 83%. In each sample of schools, we then calculated E(Xh r |Z = 1) for r = 1,2,3 for each covariate. We also calculated these moments for the actual sample used in the experiment.
Comparison of Balance of Orders 1,2,3 for the Population, Planned Sample, and Completed Experiment.
Note. LEP = limited English proficiency. % bias is the absolute difference from the population, standardized by the population value. Boldface values indicate the smallest bias when comparing the planned sample to the completed experiment. The “ideal” planned sample selects the first nj units from each stratum (see Table 1), while in each stratum in the “nonresponse” planned sample, the first 50 schools refused participation and the next nj units agreed.
In order to compare the ideal and nonresponse samples, as well as the “actual” sample used in the experiment to the population, in Table 2 we report the percentage of absolute bias, where
for each r = 1,2,3, and where depending upon the column Z = 1 indicates the actual sample or the proposed sample based on the method developed here. Bolded values indicate variables for which the balance achieved is better using either the ideal or nonresponse sample selection strategies developed here than that achieved in the actual experiment. This balance is better for 19 or more variables (of 26) in terms of first and second moments and 14 or more in terms of third moments. Additionally, the maximum absolute relative difference is 0.253 for the ideal sample (over all 26 covariates) and 0.297 for the nonresponse sample, compared to 0.695 for the actual experiment. In general balance is better for both Orders 1 and 2 than Order 3, indicating that the ideal and nonresponse samples lead to approximately balanced samples of Order 2.
Since there are some imbalances remaining, post hoc propensity score methods could be used for further adjustments. In these methods, the achieved sample of 73 schools is compared to the population of 1,713 schools on the 26 covariates using a logistic regression to estimate the propensity score. Comparing the empirical densities of propensity score logits indicates how well the results might generalize (Stuart et al. 2011). When these densities are similar, generalizations are easier since reweighting approaches lead to large reductions in bias with only small increases in variance; when they differ, particularly when the region of common support is small (indicating coverage errors), reweighting approaches are less effective, leading to estimates with remaining bias and larger variance inflations (Tipton 2013).
In Figure 2, we compare the empirical densities of the propensity scores in the sample and population for each of the three samples under comparison (actual, ideal, and nonresponse). As the shapes of the densities and the lines marking the population quintiles indicate, the samples selected using methods from this article are more similar to the inference population than the actual sample used in the experiment. Importantly, the longtail in the actual sample indicates a large coverage error (Tipton 2013); this longtail is not present in either of the samples produced using the methods developed here, making additional adjustments using post hoc methods more effective.

Comparison of propensity score logit densities for three samples versus population.
Discussion and Conclusion
The purpose of this article has been twofold. First, it proposes a framework for site selection in large-scale randomized experiments (model-based sampling) that can be used in conjunction with the super-population models commonly used to estimate PATEs. Second, it provides a method for selecting a self-weighting bias-robust balanced sample within this framework through the creation of strata based on cluster analysis methods. In this section, we conclude by briefly discussing the benefits, limitations, and possible extensions of the method.
Subgroup Analyses
When treatment effects do in fact vary, a single PATE is clearly inadequate for summarizing the effectiveness of a program or an intervention. One solution is to additionally report subgroup average treatment effects. A benefit of our site-selection approach is that a separate average treatment effect can be calculated for each of the k strata. This strategy is similar to the method for subgroup creation proposed by Peck (2005) in general, and, when the covariates contained in
Eligibility Problems
It is possible that some of the N units in the population P are not eligible for inclusion in the experiment. For example, certain schools may already use the program under study, or when resources are limited, travel outside a particular area may be infeasible. The goal of bias-robust sample selection is still useful here, particularly since it requires study planners to determine whether the reasons for ineligibility explain variations in treatment effects (warranting inclusion in
Nonresponse Analysis
One of the biggest practical concerns in sample selection in experiments is the fact that many sites will not agree to be in the experiment. A benefit of this approach is that by articulating at the outset a set of covariates (
Relationship to Post Hoc Methods
The goals and framework for balanced sampling have much in common theoretically with the goals of propensity score matching for post hoc adjustments for generalization. Even when using the stratified sampling method introduced here, differences may remain between the achieved sample and the population, particularly when nonresponse is high. However, as illustrated in the example, these remaining imbalances can be more easily adjusted if generalization is planned for; this is both because any remaining imbalances are typically smaller (Table 2) and since coverage errors are greatly reduced (Figure 2), both of which make reweighting procedures more effective.
Random Selection and Design-Based Inference
Given the infeasibility of random sampling in experiments, this article has provided a method for site selection that is strategic, model based, and nonrandom. However, the method we develop for site selection—stratified sampling with proportional allocation based on cluster analysis—can also be used in a design-based framework with probability sampling.
Data Frame Concerns
A potential weakness of the stratified selection method developed here is that it requires an available sampling frame. This results in two limitations to our approach. First, like the post hoc methods available for generalization, this sampling frame needs to include a rich set of covariates on all units in the population, where here the covariates that matter are those that explain variation in treatment effects. Since most censuses of schools and districts focus on demographics, using any model-based approach could result in omitted variable bias. An important feature of this method is that it requires a thoughtful discussion of the benefits and limitations of the achieved sample for generalizations and to whom, where, and under what conditions or assumptions these generalizations are most warranted.
Second, the fact that the sampling frame must enumerate all N units in the population means that the method we have developed here is most useful when the sampling units are aggregates, since data on aggregates (e.g., schools) are typically publically available, while data on individuals are not. However, while the strata creation method developed here may not be practical when the unit of analysis is the individual, we argue that the bias-robust framework is still useful. Stratified sampling is only one method for creating a balanced sample in the bias-robust framework. Future research should investigate the practicality of methods that do not require such detailed population frames—for example, quota sampling and respondent-driven sampling (e.g., Smith 1983; Watters and Biernacki 1989).
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author received an NSF grant 1118978.
