Analysis of covariance (ANCOVA) is commonly used to adjust for potential confounders in observational studies of intervention effects. Measurement error in the covariates used in ANCOVA models can lead to inconsistent estimators of intervention effects. While errors-in-variables (EIV) regression can restore consistency, it requires surrogacy assumptions for the error-prone covariates that may be violated in practical settings.
Objectives:
The objectives of this article are (1) to derive asymptotic results for ANCOVA using EIV regression when measurement errors may not satisfy the standard surrogacy assumptions and (2) to demonstrate how these results can be used to explore the potential bias from ANCOVA models that either ignore measurement error by using ordinary least squares (OLS) regression or use EIV regression when its required assumptions do not hold.
Results:
The article derives asymptotic results for ANCOVA with error-prone covariates that cover a variety of cases relevant to applications. It then uses the results in a case study of choosing among ANCOVA model specifications for estimating teacher effects using longitudinal data from a large urban school system. It finds evidence that estimates of teacher effects computed using EIV regression may have smaller bias than estimates computed using OLS regression when the data available for adjusting for students’ prior achievement are limited.
Knowledge of the causal effects of interventions, programs, treatments, and even individuals (e.g., teachers or physicians) is of great value to policy and decision makers. Learning about such effects from data often involves observational studies, in which naturally occurring variation in exposure to treatments1 is used to estimate the effects of those treatments on outcomes of interest. A well-known limitation of such studies is the potential for units with differing levels of the exposure also to differ on other factors that are related to the outcomes. These other factors can confound the estimated causal effects. A common method to reduce the potential for confounding is to make adjustments for “covariates,” observed measures of the factors that are related to the outcomes and that might vary among units with different exposures.
Historically, a primary method for adjusting for covariates has been ordinary least squares (OLS) linear regression of the outcome on the covariates and indicator (dummy) variables for treatment status. This method is commonly referred to as “analysis of covariance” (ANCOVA). Although there has been substantial research on alternatives to ANCOVA for the estimation of causal effects (e.g., Imbens & Rubin, 2015; Pearl, Glymour, & Jewell, 2016; Shadish, Cook, & Campbell, 2002), it remains widely used. For example, it is commonly used to estimate treatment effects from data arising from pretest–posttest designs (e.g., Yang & Tsiatis, 2001). It is also an appealing choice in settings with a potentially large number of treatment groups, each with a relatively small number of units, which poses practical difficulties for alternatives to ANCOVA such as matching or weighting estimators. A recent, high-profile example of such an application is so-called value-added modeling for estimating the effects of teachers or schools on student achievement using longitudinal test score data (Harris, 2011; McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004).
A key assumption required for ANCOVA to yield unconfounded estimates of treatment effects is the assumption of “selection on observables” or “no unobserved confounders” (Imbens, 2004; Rosenbaum & Rubin, 1983). This assumption means that conditional on the covariates included in the ANCOVA model, there is no variable that both predicts the outcomes and has different distributions across treatment groups. Provided this assumption holds, and provided the functional form assumptions of the ANCOVA model are correct and other standard regularity conditions hold, then treatment effect estimators from the ANCOVA model will be consistent.
Since as early as 1959, measurement error in covariates has been recognized as a complication for making accurate inferences from ANCOVA models (Scheffe, 1959). Specifically, if the selection-on-observables assumption would hold conditional on a set of adjustment variables , but the observed covariates measure any of the components of with error, then the selection-on-observables assumption generally will not hold conditional on (e.g., Steiner, Cook, & Shadish, 2011; Yi, Ma, & Carroll, 2012). Thus, an ANCOVA model in which observed covariates are used in place of their corresponding latent variables can lead to confounded treatment effect estimators (e.g., Culpepper & Aguinis, 2011; Lockwood & McCaffrey, 2014).
A standard method for restoring consistent estimation in linear models with error-prone covariates is errors-in-variables (EIV) regression (Carroll, Ruppert, Stefanski, & Crainiceanu, 2006; Fuller, 2006). If information about the variances and covariances of the measurement errors in error-prone covariates is known, or can be estimated, then EIV regression can yield consistent estimators of treatment effects when covariates are measured with error, provided that other assumptions hold. One of the critical assumptions involves what is called “surrogacy” in the measurement error literature (Buonaccorsi, 2010; Carroll et al., 2006; Lockwood & McCaffrey, 2016). In ANCOVA models, the surrogacy assumption is that measurement errors are uncorrelated with the latent values of error-prone covariates, any other error-free covariates in the model, the treatment group indicators, and the residual error term of the model.
Unfortunately, practical settings pose various opportunities for failure of the surrogacy assumption. First, measurement errors may be related to the levels of treatment individuals receive. For example, test scores from achievement assessments measure latent achievement with error (Lord, 1980), and school staff might use test scores, along with other variables correlated with achievement, to select students for a reading intervention or to assign students to advanced or remedial classes. This can result in correlations between measurement errors and treatment status. Second, outcomes might depend on the measurement errors. For example, blood pressure is known to be routinely measured with error (Burkard et al., 2018), and a particularly high reading might encourage people to change their diets and increase exercise in ways that could affect their future health status. Both of these scenarios violate the surrogacy assumption and generally will cause EIV regression to yield biased estimates of treatment effects. Although there are multiple sources for analytic derivations of the bias in ANCOVA using error-prone covariates without a correction for measurement error (e.g., Culpepper & Aguinis, 2011), we are unaware of the corresponding results for the bias when using EIV regression in ANCOVA models when the required assumptions do not hold.
This study has two objectives. First, we derive asymptotic results for ANCOVA using EIV regression when measurement errors may not satisfy the standard surrogacy assumptions. We show that the magnitude of asymptotic bias in estimated treatment effects depends on a combination of true model coefficients, the degree of sorting of units to treatment groups with respect to covariates, and the relationships of measurement errors to various quantities. Second, using a case study of estimating teacher effects on student achievement, we demonstrate a method for using data to explore the potential bias from using either OLS or EIV to estimate ANCOVA models when the assumptions for both methods may be violated. The results of the method can help to guide which estimation approach is likely to have smaller bias in a given setting. We conclude with a discussion of the findings, their limitations, and areas for future work.
Asymptotic Results for EIV Regression With Surrogacy Violations
This section first defines the structural model, which allows for the possibility that measurement errors in error-prone covariates may not satisfy the standard surrogacy assumptions. It then defines the EIV regression estimator of the parameters of the structural model, derives the probability limit for this estimator, and derives a scalar summary of the asymptotic bias across treatment groups.
Structural Model
Consider the ANCOVA model:
for independent and identically distributed (IID) samples from some target population. The variable is an observed outcome of interest. The vector contains dummy variables for a fixed number of groups, with component equal to 1 if unit is in group and 0 otherwise. We assume that each unit is linked to exactly one group, so that for each . Thus, the model excludes an intercept, making the parameters for the groups equal to adjusted group means. In typical settings, “treatment effects” would be defined by contrasts among the components of , such as the comparison of each group to a baseline group or to the average of all groups. Our focus on the adjusted group means directly rather than contrasts simplifies the analytical results and does not affect the substantive conclusions.
The variable in Equation 1 is a vector of length . The scalar in Equation 1 represents a variable that may belong in the model, and generally may be correlated with elements of and/or , but is not used in the estimation and thus is a potential source of bias. We refer to as an “omitted variable” since it is not included in the analysis. There is no loss of generality by assuming is a scalar. If there were multiple omitted variables, could equal the inner product of the variables and their associated model coefficients, and would then equal to 1. The error term in Equation 1 is assumed to satisfy . Without the loss of generality, we assume that and because this simplifies some later expressions. We assume that standard regularity conditions apply, so that if were observed and if , then OLS regression of on and would consistently estimate . One such regularity condition that is important for our derivations is that the probability that component of is equal to 1 satisfies for all .
This article focuses on the case where is not directly observed2 but rather is measured with error by . We assume that the vector of measurement errors satisfies . To make the notation more compact, we allow components of to be identically equal to 0, corresponding to components of that are measured without error by the corresponding components of . Thus, generally may consist of both error-free and error-prone components.
The assumptions about the measurement errors are the focus of this study. The surrogacy assumption made in standard EIV regression is that , , , and all equal . This assumption means that is uncorrelated with all quantities determining the outcomes under Model 1, including the treatment assignments, the latent covariates , and all other unobserved or unmodeled variables that affect the outcomes. This is a strong assumption, and as previously noted, there are often reasons to doubt this assumption in practical applications. The primary purpose of this study is to explore the implications of the failure of the assumption. We focus on the case where , , and equal , but where may not equal .3 This assumption corresponds to the case where measurement errors have no direct relationship to outcomes but may be related to treatment assignments. Lockwood and McCaffrey (2016) use the term “weak surrogacy” to refer to such a case. One mechanism for a relationship between measurement errors and treatment assignments would be if directly influenced the assignments of units to treatment groups, which can induce correlation between elements of and elements of . As noted previously, such a situation is plausible in many applications with error-prone covariates, so that understanding estimator behavior in this situation is valuable.
We denote the variance–covariance matrix of for unit by , a positive semidefinite matrix that is allowed to vary across units to account for the possibility of heteroscedastic measurement errors. Diagonal and corresponding off-diagonal elements of may be 0 due to degeneracy of some components of for error-free covariates. We assume that exists and is positive semidefinite. Note that because we are assuming that for all , the matrix is the marginal variance–covariance matrix of the measurement errors in the population of units. Let be the positive definite variance–covariance matrix of , let be the vector of covariances between and , and let be the variance of . Then, the marginal variance–covariance matrix of for a unit randomly sampled from the population is the matrix:
The zero blocks follow from the assumption that is uncorrelated with both and . This implies that the marginal variance–covariance matrix of for a unit randomly sampled from the population is .
The EIV Regression Estimator
We assume that the goal is to estimate the parameters in Equation 1 from observed data using EIV regression of on and . Let , and let denote the design matrix with rows . Let be a matrix with all zero entries, except for the lower right-hand block equal to , where is a working variance–covariance matrix of the measurement errors, discussed in additional detail below. Then, the EIV regression estimator of is
Specifying the working variance–covariance matrix of the measurement errors, , is the critical step of applying EIV regression. The specification of depends on what is known about the magnitude of measurement errors, and the superscripting by is used to indicate that may be a function of the observed data. This accommodates several scenarios that commonly arise in applications involving error-prone covariates, and we itemize some of those scenarios here. First, when is known for , then could be . This also covers the case of homoscedastic measurement error where is a known constant matrix and . Second, in some applications, such as those involving achievement tests, may equal for a known function (e.g., Lord, 1980). In such applications, even though is not observed for any because is latent, may be specified as an estimator of computed from the observed data using deconvolution methods (Lockwood & McCaffrey, 2014, 2017). Third, in many applications, analysts may know the “reliability” of each error-prone covariate (e.g., Crocker & Algina, 1986). Letting , , and , the reliability of for is . Note that , with for any component of that is measured without error. Also note that for each , under the assumption that . Thus, when is known and it is assumed that measurement errors for unit are uncorrelated across components , it is common to specify as a diagonal matrix with diagonal element equal to , where is an estimator of (e.g., Fuller & Hidiroglou, 1978). Finally, note that setting corresponds to OLS regression, so that EIV regression can be viewed as a generalization of OLS regression, and the asymptotic results presented later cover both estimators.
Probability Limits
We now derive the probability limit of in Equation 3. Some additional notation is needed. We assume that exists and denote this limit by . Note that may or may not equal the true population variance–covariance matrix of the measurement errors . Different combinations of and cover four situations: (1) OLS regression when all covariates are measured without error , (2) OLS regression with measurement error that is ignored , (3) EIV regression using an asymptotically correct error variance–covariance matrix , and (4) EIV regression using a misspecified error variance–covariance matrix .
We denote the conditional means of , , and for groups by , , and , respectively. We use to denote the matrix with row equal to , and analogously use to denote the matrix with row equal to . We let . We let , where is the probability that component of is equal to 1. Assuming the necessary conditions so all the limits exist, algebraic manipulations yield that:
and
Additional details on this derivation, and other derivations in this section, are provided in the Appendix. Provided the inverse of Equation 4 exists, is equal to the product of this inverse and Equation 5. This can be evaluated using the standard formula for the inverse of a block matrix. After some simplification, this yields that equals:
where
Further algebraic manipulation then yields:
There are some special cases where consistency holds, that is, where and .
Case 1: The estimator is consistent when there is no measurement error, OLS is used, and , so that there is no omitted variable. Under these conditions, , , and reduces to . Then Equation 8 reduces to , and Equation 9 reduces to .
Case 2: The estimator is consistent when there is measurement error , EIV regression is applied with a working value satisfying , the required surrogacy assumptions hold so that and . Under these conditions, expressions for the probability limits reduce as in Case 1.
Case 3: The estimator is consistent when there is measurement error, measurement error is ignored by using OLS (i.e., is set equal to a matrix of zeros), for some matrix , and . This is a special case where selection on observables holds given and where is linear in and . Thus, the fact that measures with error is inconsequential because itself is sufficient to account for all relevant differences among treatment groups.4
Outside of these special cases, consistency generally will not hold. It is clear from Equations 8 and 9 that even when , the asymptotic bias in depends on a combination of the true model coefficients, the degree of sorting of units to treatment groups with respect to the latent variables, and the extent to which violations of surrogacy lead to . The asymptotic behavior can be nonintuitive when surrogacy fails to hold. For example, it is possible for , but . This would occur if the groups were balanced on the observed, error-prone variable but not the latent variable, that is, the group means for all equal 0 but the group means for do not. In such cases, it is possible to have no omitted variables, correctly estimate the coefficients for all covariates, yet still have due to the term in Equation 9.
Asymptotic Bias in Group Effects
In this study, we are primarily interested in the bias in because this translates directly into bias in estimated treatment effects. A scalar summary of the magnitude of the asymptotic bias of across the ensemble of groups can be constructed as follows. From Equation 9, the asymptotic bias of is , which can be expressed as the product of the matrix and the vector of length given by:
Thus, the mean of the squared asymptotic bias is
where is the “between-group” variance–covariance matrix of under a probability distribution for the groups with probabilities . The matrix can be expressed as:
which follows from the assumption that the marginal means of , , and are each a conforming , implying that , , and are each a conforming .
Not all of the elements of the matrix in Equation 12 can be identified from the observed data without additional assumptions. For example, the elements corresponding to and are not identified without additional assumptions about how was used to assign units to groups. However, is subject to three constraints without additional assumptions. First, must be positive semidefinite. Second, , the “within-group” variance–covariance matrix of , must be positive semidefinite. The third constraint follows from the fact that . Specifically, define to be the matrix , so that . Denote the between-group variance–covariance matrix of by , and denote the between-group variance–covariance matrix of by , the upper left block of in Equation 12. Then, and are related by the constraint that .
In the following section, using a case study of estimating teacher effects on student achievement, we demonstrate how these constraints can be used to explore the potential magnitude of bias in estimates of treatment effects obtained using either OLS or EIV ANCOVA models.
Application to Exploring Bias in Teacher Value-Added
As noted previously, variants of ANCOVA are commonly used to estimate the effects of individual teachers on student achievement (e.g., Clotfelter, Ladd, & Vigdor, 2006; Goldhaber & Hansen, 2010; Goldhaber, Quince, & Theobald, 2018; Harris & Sass, 2011; Isenberg et al., 2013; Sass, Hannaway, Xu, Figlio, & Feng, 2012). For example, a common approach to estimating teacher effects uses ANCOVA where the outcome variable is a student’s standardized test score at the end of a given school year, the “treatment” is the teacher who taught the student that year, and the covariates include test scores from prior school years as well as other student background variables such as race or ethnicity, gender, free or reduced-price meal eligibility, special education status, or indicators for participation in other programs (McCaffrey et al., 2004).
Test scores measure student achievement with error, and the literature on estimating teacher effects contains conflicting recommendations about whether to account for this measurement error when using prior test scores as adjustment variables in ANCOVA models. Current achievement is likely to depend on students’ prior achievement, not students’ prior test scores, which motivates adjusting for measurement error in prior test scores using EIV regression or related methods (Andrabi, Das, Khwaja, & Zajonc, 2011; Lockwood & McCaffrey, 2014). On the other hand, prior test scores may be among the variables used to assign students to current-year teachers. This would violate the surrogacy assumptions required for consistency of the EIV regression estimator and could cause estimators that ignore measurement error, such as OLS regression, to have smaller bias than methods that try to account for it (Sengewald, Steiner, & Pohl, 2019; Steiner & Kim, 2016). Lockwood and McCaffrey (2019) develop empirical tests about what information is used to assign students to teachers and find evidence that assignments are influenced by both prior test scores and unobserved variables related to prior achievement. Such assignment mechanisms are likely to violate the assumptions required for consistent estimation of teacher effects using either EIV regression or OLS regression. However, because these methods are straightforward to apply using standard software, they are likely to remain common in research and policy applications of teacher value-added estimation. Thus, finding ways to facilitate the choice among these estimators in settings where both may be biased is valuable.
In this section, we use the analytical results derived in the previous section to explore the average magnitude of the asymptotic bias in teacher effects estimated using different ANCOVA models applied to data from a large urban school district. We first consider whether EIV or OLS regression appears to be preferred, that is, which method tends to yield smaller asymptotic bias, when there are multiple prior years of test score data available for adjustment. We then consider whether the preference changes as the data available for adjustment become more limited. Specifically, both research and policy applications demand the ability to estimate teacher effects for elementary-grade teachers (e.g., at least teachers in Grades 4 and 5) where there may be test scores from only a single prior year available for adjustment. Guarino, Reckase, Stacy, and Wooldridge (2013) propose modeling with a single prior year, but other authors (e.g., Lockwood & McCaffrey, 2014; Rothstein, 2009) suggest more prior years are beneficial for reducing bias. Practitioners are likely to use whatever data are readily available to estimate teacher effects, even if only a single prior-year score is available, so it is useful to provide guidance about the choice among estimators under different assumptions about what data are available for adjustment.
Data
Our analyses use longitudinal test score data from a large urban school system. The data are from groups of students who were in one of three grades (Grades 6, 7, and 8) and in one of four school years (2008–2009 to 2011–2012) for a total of 12 grade/year groups. Each group of students has math and English language arts (ELA) test score data from its corresponding school year (the “target year”) and three previous school years (the “Lag-1,” “Lag-2,” and “Lag-3” years). For example, for students in the grade/year group corresponding to Grade 6 in 2008–2009, the target year is the 2008–2009 school year, the Lag-1 year is the 2007–2008 school year when the students were in Grade 5, the Lag-2 year is the 2006–2007 school year when the students were in Grade 4, and the Lag-3 year is the 2005–2006 school year when the students were in Grade 3. Test scores are linked to students across years by a unique student identifier.
For each student, the data also contain background characteristics consisting of gender, race/ethnicity, home language, English language learner status, special education status, disability status, free and reduced-price lunch status, and an indicator of excessive absence ( of school year). Gender, race/ethnicity, and home language are time-invariant in our data, whereas the remaining background variables can vary across years within students. Finally, the students are linked to their schools and teachers of core math courses in the target year. Our analyses consider estimating the effects of math teachers on math achievement in the target year.
Table 1 summarizes the data for the 12 grade/year groups, providing the numbers of students, teachers, and schools5 as well as the reliability (coefficient ; Cronbach, 1951) of each of the math and ELA tests from the Lag-1, Lag-2, and Lag-3 years. These values were obtained from the technical reports for the state’s testing program and will be used to compute EIV regression estimators of teacher effects, discussed later in this section.
Structural Model for Teacher Effects
We conducted analyses separately for each of the 12 grade/year groups. For each grade/year group, we follow Rothstein (2009) and Lockwood and McCaffrey (2019) by focusing on estimating the effects of math teachers within schools, that is, comparing the effects of math teachers in the target grade and year relative to other math teachers in the same school, grade, and year.6 Let index students in a given grade/year group. For each student , let denote the math test score in the target year, and let denote the vector of length 6 consisting of math and ELA test scores for the Lag-1, Lag-2, and Lag-3 years. We assume that is an error-prone measure of a corresponding vector of unobserved prior achievement attributes . Let be the vector of other background variables which are assumed to be measured without error. To reduce the dimension of the space of unknown parameters in our explorations, we used to create a scalar composite equal to a linear combination of all of the time-invariant student background variables and the time-varying background variables from the target and Lag-1 years in , standardized to have mean 0 and variance 1, and oriented such that it is positively correlated with achievement. The weights in the linear combination were taken as the estimated coefficients on the components of in a OLS regression of on , , and teacher dummy variables. The replacement of by is likely to have minimal consequence on the substantive results because in the regression for current year scores, has extremely small incremental (less than .01) conditional on prior-year test scores.
Let , a vector of length , and let be the corresponding vector of observed variables. Assume that where is a vector of mean-zero measurement errors with fixed values of zero for the final component corresponding to . Let be a vector of teacher dummy variables, equal to 1 for the student’s target-year math teacher and 0 otherwise, and let be a vector of school dummy variables indicating the student’s target-year school.7 We then assume the following linear structural model for the student outcomes, similar to models commonly used in applications of estimating teacher effects on student achievement:
We assume that are IID with mean 0, finite variance, and uncorrelated with the remaining terms in the model. Note that the model assumes that current achievement depends on the latent prior achievement attributes because . Thus, outside of special cases, OLS regression in which the error-prone measures are used in place of (i.e., where is plugged in for ) will lead to biased estimators of the teacher parameters . Alternatively, EIV regression that accounts for the error in as a measure of using the test reliabilities will lead to bias if test scores contributed to the target-year teacher assignments, violating surrogacy. Lockwood and McCaffrey (2019) show that the assumptions required for either OLS or EIV to yield consistent estimation for the model in Equation 13 do not hold for these data. Thus, it is valuable to evaluate which method might yield smaller bias. In the following sections, we use the data and asymptotic results to explore the possible magnitudes of the biases for OLS and EIV regression.
Our analyses assume that the model in Equation 13 is properly specified, so that there is no relevant omitted variable , as appears in Model 1. That is, we assume that in a hypothetical situation in which of dimension , consisting of as well as three prior years of math and ELA latent achievement attributes , were observed for each student without measurement error, then the parameters of Model 13 could be estimated consistently using OLS regression with . In practice, rather than is observed, and either OLS or EIV regression using may lead to biased estimates of teacher effects even when three prior years of test scores are available. We explore the potential magnitudes of these biases in this section. We also explore the potential magnitudes of the biases that may arise with either OLS or EIV regression when some of the covariates are omitted from the model, as would be the case if only a single year of prior data were available or if only math scores were used as covariates. In those cases, is determined by the subset of the seven covariates that we omit when fitting the model. For instance, if we omit ELA scores, then will equal a linear combination of the three prior years of ELA achievement.
Procedures
In this section, we describe the procedures that we used to explore the average asymptotic bias in teacher effects estimated with either OLS or EIV regression. Separately for each of the 12 grade/year groups, we executed the steps below using code written for the R environment (R Core Team, 2019), available from the authors by request.
Step 1
We set to be a diagonal matrix of measurement error variances, computed by multiplying one minus the reliabilities in Table 1 by the marginal variance of each component of , and a fixed zero for the diagonal element corresponding to . We further assume that the test measurement errors have mean 0 in each school, so that is the variance–covariance matrix of measurement errors in each school. This assumption is based on the fact that the vast majority of students in our data attended neighborhood schools, so that family housing choices largely determine school assignments. For this process to create nonzero school means of measurement errors of test scores used as covariates, it would be necessary for a nonnegligible number of individual families to make housing decisions on the basis of their own children’s prior test scores in such a way that the group of students who ended up in school in a given year would have measurement errors of their prior test scores with a nonzero mean. It seems implausible that selection of this form would be widespread enough to create more than negligible differences in measurement error means by school. This is in contrast to teacher-within-school assignments, where the decisions of one or a small number of people can simultaneously affect the assignments of many students and where there are substantive reasons to believe that prior test scores of individual students may influence the assignment decisions.
Summary of Student, Teacher and School Counts, as well as Reliabilities of Prior Test Scores, for the 12 Grade/Year Groups Used in the Empirical Explorations.a
School Year
Grade
# Students
# Teachers
# Schools
Reliabilities
M1
M2
M3
E1
E2
E3
2008–2009
6
37,188
1,333
360
.90
.94
.89
.84
.89
.85
2008–2009
7
39,086
1,189
304
.92
.90
.94
.88
.84
.88
2008–2009
8
43,046
1,183
300
.90
.91
.90
.88
.88
.82
2009–2010
6
44,333
1,424
370
.90
.94
.88
.83
.90
.86
2009–2010
7
44,288
1,289
315
.91
.90
.94
.86
.84
.89
2009–2010
8
45,655
1,252
322
.90
.92
.90
.88
.88
.84
2010–2011
6
46,033
1,484
396
.90
.94
.89
.83
.88
.86
2010–2011
7
44,534
1,272
333
.90
.90
.94
.87
.83
.90
2010–2011
8
45,023
1,226
311
.90
.91
.90
.88
.86
.84
2011–2012
6
26,583
751
221
.91
.94
.88
.90
.86
.86
2011–2012
7
25,663
693
205
.93
.90
.94
.92
.83
.88
2011–2012
8
25,317
620
176
.92
.90
.90
.92
.87
.83
a Reliabilities for the math tests from the Lag-1, Lag-2, and Lag-3 years are labeled “M1,” “M2,” and “M3,” respectively. “E1,” “E2,” and “E3” denote the analogous reliabilities for the English language arts tests.
The matrix is treated as the known, true measurement error variance–covariance matrix given the large student samples available, and the working measurement error variance–covariance matrix for EIV regression is set equal to .
Step 2
We estimate in Model 13 using EIV regression. The resulting estimated value of , which we denote by , is used as the working value of in subsequent evaluations of average squared bias for the group effects. Of course, the average squared bias for the group effects depends on the unknown . We show that substantive findings are not sensitive to using obtained from OLS rather than EIV regression.
Step 3
We use the data to compute three variance–covariance matrices: (a) , an estimate of the within-school variance–covariance matrix of ; (b) , an estimate of the within-school, between-teacher variance–covariance matrix of ; and (c) , an estimate of the within-school variance–covariance matrix of . The matrix has upper left block from Step 1 corresponding to , lower-right block − corresponding to , and a block of zeros corresponding to the covariances between elements of and . The matrices and computed in this step are used in Step 4 to create bounds on unknown variance–covariance parameters and also in Step 5 to evaluate asymptotic bias.
Step 4
We generate feasible values of , the within-school, between-teacher variance–covariance matrix of , using empirical versions of the three constraints on noted previously. Those constraints are (1) is positive semidefinite, (2) is positive semidefinite, and (3) . We apply empirical versions of these constraints where the estimates and from Step 3 are plugged in place of their corresponding unknown parameters, so that a potential value of is considered to be feasible if (1) is positive semidefinite, (2) () is positive semidefinite, and (3) . An important consideration in this step is that in general, contains unknown parameters subject to multiple linear and nonlinear constraints. With , this equals 105 parameters. It is slightly lower in our case because is assumed to have no measurement error but is still too large to feasibly explore or visualize without additional restrictions. To make the exploration of feasible values of manageable, we placed restrictions on the parameters of the matrix. We considered three different sets of restrictions each corresponding to a different set of assumptions about how prior scores contributed to student assignments to teachers and the correlations of teacher-level means of and . By making these restrictions, the matrices depend on two or three parameters and we then conducted a grid search of these parameters to explore potential values for . Details are provided in the Appendix.
Step 5
For each feasible , we compute the average asymptotic squared bias for the adjusted teacher means estimated with OLS, and estimated with EIV, applying the formulas derived earlier. These formulas depend on , the estimated matrices from Step 3 and from Step 2. We then repeated these calculations using selected subsets of the covariates to address the issue of how bias for these estimators changes as fewer prior years of data are available for modeling. In models where fewer than the full set of variables are used, the remaining variables combine to form the omitted variable defined in the analytical derivations. Specifically, the components of corresponding to subsets of the variables that are omitted define a linear combination of omitted variables that can be treated as with coefficient .
Results
Per Step 2 above, for each of the 12 grade/year groups, the working value of dimension obtained using EIV regression was used in place of the unknown true parameter . Figure 1 presents the values of for each of the 12 grade/year groups, showing that most of the coefficients are positive. The coefficients on prior math achievement decay with further lags but are positive even at Lag 3. The corresponding coefficients for prior ELA achievement also tend to be positive at Lags 1 and 2. Thus, there is the potential for omitted variable bias when some of the variables are excluded from the model, making our comparisons between OLS and EIV regression meaningful as we consider restricting the amount of data available for the adjustment in the ANCOVA model.
Estimated standardized coefficients on lagged achievement and the composite background variable from errors-in-variables (EIV) regression, for each of the 12 grade/year groups. Estimates from the same group are connected by line segments.
Figure 2 summarizes the average squared asymptotic bias for the adjusted teacher means estimated using OLS and EIV regression, for each feasible value of and configuration of the adjustment variables used in the estimation. On the vertical axis in each frame is the percentage of feasible scenarios in which the average squared bias for the teacher effects is larger for OLS than for EIV. Specifically, the denominator of the percentage is the number of feasible values of that we considered across the three selection scenarios and grids of unknown parameters noted previously (with details in the Appendix), and the numerator is the number of these feasible scenarios in which the average squared asymptotic bias for OLS regression exceeds that of EIV regression. The upper-left frame presents results for models that use one, two, or three lags of only math scores, ignoring both ELA scores and in the model. The upper-right frame is for models that use one, two, or three lags of math scores as well as . The lower-left frame is for models that use one, two, or three lags of both math and ELA scores, not using . Finally, the lower-right frame is for models that use one, two, or three lags of both math and ELA scores as well as . Each frame has 12 triplets of points connected by a line, one for each of the 12 grade/year groups. This configuration of models where either , ELA scores, or both are excluded was motivated by our experiences working with states and school districts to specify growth models, where depending on the purposes and intended audience, there may be substantive motivations to exclude either or both of these types of adjustment variables.
Summary of percentages of feasible scenarios where the average asymptotic bias in group effects estimated by ordinary least squares exceeds that for errors in variable (EIV). The four frames indicate different configurations of which adjustment variables are included in the models, and within each frame, results are presented for models that use only Lag-1 data, Lag-1 and Lag-2 data, and data from all three lags. Each frame presents results for all 12 grade/year groups, with the results for the same group connected by line segments.
The results show that feasible values of the unknown parameters contain cases where EIV would be preferred and cases where OLS would be preferred. EIV is preferred over more of the feasible space, particularly when only 1 year of prior data are used, but the results are mixed. The choice between OLS and EIV becomes less clear when more variables are included in the model. One possible explanation for this finding is that including additional covariates may reduce omitted variable bias8 and the potential impact of uncontrolled measurement error, in which case the potential benefits of EIV over OLS are reduced, but the risks of using EIV when surrogacy fails remain. The basic patterns in Figure 2 are replicated using an alternative value of obtained using OLS regression rather than EIV regression (AppendixFigure A1). Thus, the findings that OLS tends to have larger asymptotic bias than EIV over more of the feasible space of unknown parameters is not a consequence of using EIV to compute .
The results to this point favor EIV, but only modestly. It is thus useful to consider ways of reducing the space of unknown parameters, with the hope that comparisons among different estimators may be less ambiguous over a more restricted space. To develop a way to restrict the space, we begin with the observation that many of the feasible scenarios consist of cases where from EIV regression differs substantially from . That is, there are many feasible scenarios where the surrogacy violations are severe enough to cause the probability limit of estimated from EIV regression to deviate substantially from the assumed working value of . This arises when student–teacher assignments are strongly dependent on prior test scores, causing the average measurement error to vary substantially across teachers within schools. If there were evidence to suggest that such extreme selection mechanisms are not likely, it would provide a way to restrict the space of unknown parameters and perhaps sharpen the comparison of the bias for different estimators.
To explore this, we return to the assumption that the average measurement error in prior test scores should be approximately zero for each school. Under this assumption, prior test scores satisfy surrogacy with respect to dummy variables for school groupings, so that surrogacy violations should not be a source of bias for EIV regression that includes dummy variables for schools but excludes teacher dummy variables. We thus used EIV regression to fit a version of the model in Equation 13 that includes school dummy variables but excludes teacher dummy variables. We did this for each of the 12 grade/year groups. Figure 3 plots the estimated coefficients on the variables from the models that include only school dummy variables versus the estimated coefficients from the models that include both teacher and school dummy variables. The coefficients are remarkably insensitive to whether dummy variables for teacher groupings are included or excluded from the model. The simplest explanation for these results is that the average measurement error does not vary much across teachers; if it did, the resulting surrogacy violations for teacher dummy variables should cause larger discrepancies in estimated coefficients depending on whether teacher dummies are included or excluded from the model. While other explanations for the insensitivity of coefficients to model specification are possible, they involve distinct sources of bias happening to compensate in exactly the right ways to make the coefficients from the two model specifications line up. These results suggest that values of the unknown parameters determining the bias in estimated teacher effects that would also lead to very large distortions in from EIV regression do not appear to be likely with our data.
Comparison of estimated using errors-in-variables (EIV) regression using only school dummy variables (vertical axis), versus those obtained from EIV regression using school and teacher dummy variables (horizontal axis). Each point in the figure is for one of the seven model predictors, for one of the 12 grade/year groups, for a total of 84 points.
We thus can consider how the results in Figure 2 change if we restrict to scenarios that do not involve differing substantially from . This restriction is analogous to using an informative prior distribution over the space of unknown parameters, where we restrict attention to a subset of the space consisting of values that we believe to be more likely. For each of the 12 grade/year groups, let denote the coefficients on the adjustment variables obtained using EIV regression that includes school dummy variables but excludes teacher dummy variables. Figure 4 is analogous to Figure 2 but is restricted to feasible scenarios where the difference between and is about the same magnitude as we observe between and . Specifically, we computed the median Euclidean distance between and across the 12 grade/year groups and then restricted the feasible scenarios to those where the Euclidean distance between and did not exceed this median discrepancy. Under these restrictions, the conclusions are somewhat more clearly in favor of EIV regression, particularly when only Lag-1 data are used in the adjustment. This provides some evidence that EIV may be preferred over OLS in settings where only limited prior data are available for adjustment. There is more ambiguity when more lagged data are used in the model, although EIV is still favored in a majority of scenarios.
Analogous to Figure 2, but for a subset of feasible scenarios in which the difference between plim and is restricted to be about the same magnitude as we observe between and .
Discussion
While EIV regression, and latent variable models more generally, are popular methods for estimating causal effects with observational data, the asymptotic results make clear that assessing the magnitude and direction of biases resulting from surrogacy violations can be difficult due to dependence on a potentially large number of unidentified parameters. Although the use of the analytical results in the empirical application seemed to provide some insights regarding the choice of estimators for teacher effects, the findings are still limited by the need to reduce the number of free, unknown parameters arising from potential surrogacy violations, the need to condition on values of unknown structural parameters, and the need to decide how to aggregate results across the many feasible configurations of unknown parameters. The findings are also contingent on the specific distributions of variables and measurement errors in our data and thus may not generalize. Our application also benefited from having multiple prior years of data, which allowed us to study the impacts of explicitly defined omitted variables when teacher effects were estimated using fewer prior years of data for adjustment. In other applications of the empirical methods, it may be necessary to make assumptions about the properties of omitted variables that are not informed by the data in order to examine the trade-offs of various EIV versus OLS regression model specifications, which introduces even more unknown parameters. Also, if there are large discrepancies between estimates of the coefficients from OLS and EIV regression, then the results about bias may be sensitive to which is used to compute . In these cases, additional sensitivity analyses may be necessary to explore the impact of different values for .
In pretest–posttest design settings where the pretest and posttest are measured on the same scale, a common alternative to ANCOVA is the analysis of change or “gain scores” (e.g., Allison, 1990). A standard application of this type of analysis would regress the difference between the outcomes measured before and after the intervention on treatment group indicators and possibly other covariates. Situations in which the resulting OLS regression estimator will or will not be consistent could be assessed using developments analogous to those used here. Violations of the surrogacy assumption, such as when treatment assignments are in part determined by the error-prone pretest score, can cause treatment effect estimators from gain score models to be inconsistent.
We focused on the impacts of surrogacy violations in linear models. Recent research has considered how to adapt other common estimators of causal effects, such as weighting, matching, and doubly robust estimation, to the case of error-prone confounders (Kuroki & Pearl, 2014; Lockwood & McCaffrey, 2015, 2016; McCaffrey, Lockwood, & Setodji, 2013; Sengewald et al., 2019; Webb-Vargas, Rudolph, Lenis, Murakami, & Stuart, 2017; Yi et al., 2012). Most of these methods also rely on the assumption of surrogacy of the error-prone measures. While the analytical results derived here apply only to the linear model, it may be possible to extend the basic logic of using data to explore potential biases to these alternative estimators. Another worthwhile extension may be to linear models with multiple levels of nesting. For example, while we focused estimating teacher effects within schools, many policy applications of teacher value-added estimation compare teachers both within and among schools. Extending the analytical results to that case would require considering a three-level variance decomposition for , , and into within-teacher, between-teacher within-school, and between-school components. This case would be more complicated to evaluate than the case considered here due to both additional moment constraints and additional unknown parameters but would be worth future consideration, given its relevance to teacher evaluation policies.
This article focused on EIV regression as a method for addressing measurement error in covariates in ANCOVA models. EIV regression requires that the variance–covariance matrix of the measurement errors is known or can be estimated. There are settings in which such information may not be easily obtained. An alternative to EIV regression that is commonly applied in the case of error-prone covariates is instrumental variables (IVs) estimation (e.g., Greene, 2003; Wooldridge, 2002). IV estimation does not require knowing the variance–covariance matrix of the measurement errors and can yield consistent treatment effect estimators with error-prone covariates provided appropriate IVs are available and other assumptions hold. IV can yield consistent estimators of treatment effects under some violations of surrogacy considered in this article. The primary limitation is finding IVs that satisfy the appropriate assumptions required for consistent estimation, which generally can be difficult, and may be particularly difficult in the setting considered here where error-prone measures may directly influence treatment assignments. For example, in cases where there is concern that an error-prone measure may have been used in treatment assignments, a variable that provided an alternative measurement of the same latent variable but was known not to be used in treatment assignments could potentially meet the requirements to be an IV. However, situations in which such a variable exists and can be obtained for estimation may be rare.
The analytical results derived here cover two distinct types of surrogacy violations: one in which measurement error influences treatment assignments and one in which measurement error influences outcomes (in the Appendix). A third type of violation can occur when measurement errors are related to other covariates in the ANCOVA model. For example, there is evidence that people in different demographic groups have different proclivities to choose extreme values on survey items even when they have equal values of the latent construct(s) being measured by the survey (Weech-Maldonado, Elliott, Oluwole, Schiller, & Hays, 2008). More generally, “differential item functioning” (e.g., Holland & Wainer, 1993) exists when people with the same value of a latent construct being measured by a given item, but who are in different demographic groups, do not have the same distribution of item responses. These situations can lead to correlation between measurement errors and indicator variables for demographic groups that may be included in the ANCOVA model. Extensions to the analytical results presented here to cover this case would be straightforward. However, the methods for exploring the asymptotic bias with real data would be more tenuous because the correlations between measurement errors and other covariates introduce additional unknown parameters that affect the asymptotic bias of different estimators.
Finally, our asymptotic results and associated empirical methods focused on bias, but in applications, errors in treatment effect estimates due to variance in parameter estimates also are relevant. Thus, the choice among estimators may involve criteria such as mean squared error (MSE) for true treatment effects that combine bias and variance. Model misspecification for either OLS or EIV regression may lead to inconsistent variance estimators under some circumstances, and the bias may depend on unknown parameters just as it does for the bias in estimators of the regression model coefficients. Future work could consider analytical results for variance estimators and associated empirical methods for comparing MSE for different model specifications.
Footnotes
Appendix
Authors’ Note
The opinions expressed are those of the authors and do not represent views of ETS, the Institute of Education Sciences, or the U.S. Department of Education.
Acknowledgments
We thank James Carlson, Katherine E. Castellano, Matthew S. Johnson, the Editor, and two anonymous referees for helpful comments on earlier drafts.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported in part by the Institute of Education Sciences, U.S. Department of Education, through Grant No. R305D140032 to ETS.
ORCID iD
J. R. Lockwood
Notes
References
1.
AllisonP. D. (1990). Change scores as dependent variables in regression analysis. Sociological Methodology, 20, 93–114. doi:10.2307/271083
2.
AndrabiT.DasJ.KhwajaA.ZajoncT. (2011). Do value-added estimates add value? Accounting for learning dynamics. American Economic Journal: Applied Economics, 3, 29–54. doi:10.1257/app.3.3.29
3.
BallouD.SandersW.WrightP. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29, 37–66. doi:10.3102/10769986029001037
4.
BuonaccorsiJ. (2010). Measurement error: Models, methods, and applications. Boca Raton, FL: Chapman and Hall/CRC Interdisciplinary Statistics.
5.
BurkardT.MayrM.WinterhalderC.LeonardiL.EcksteinJ.VischerA. S. (2018). Reliability of single office blood pressure measurements. Heart, 104, 1173–1179. doi:10.1136/heartjnl-2017-312523
6.
CarrollR.RuppertD.StefanskiL.CrainiceanuC. (2006). Measurement error in nonlinear models: A modern perspective (2nd ed.). London, England: Chapman and Hall.
7.
ClotfelterC. T.LaddH. F.VigdorJ. L. (2006). Teacher-student matching and the assessment of teacher effectiveness. Journal of Human Resources, 41, 778–820. doi:10.3368/jhr.XLI.4.778
8.
CrockerL.AlginaJ. (1986). Introduction to classical and modern test theory. Orlando, FL: Holt, Rinehart and Winston.
9.
CronbachL. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. doi:10.1007/bf02310555
10.
CulpepperS.AguinisH. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16, 166–178. doi:10.1037/a0023355
11.
FullerW. (2006). Measurement error models (2nd ed.). New York, NY: John Wiley.
12.
FullerW.HidiroglouM. (1978). Regression estimation after correcting for attenuation. Journal of the American Statistical Association, 73, 99–104. doi:10.1080/01621459.1978.10480011
13.
GoldhaberD.HansenM. (2010). Using performance on the job to inform teacher tenure decisions. American Economic Review, 100, 250–255. doi:10.1257/aer.100.2.250
14.
GoldhaberD.QuinceV.TheobaldR. (2018). Has it always been this way? Tracing the evolution of teacher quality gaps in us public schools. American Educational Research Journal, 55, 171–201. doi:10.3102/0002831217733445
GuarinoC.HamE. H.ReckaseM.StacyB.WooldridgeJ. (2013). Sending teacher value-added into a tailspin: A simulation study of measurement error and nonrandom sorting. Unpublished manuscript.
17.
HarrisD. (2011). Value-added measures in education: What every educator needs to know. Cambridge, MA: Harvard Education Press.
18.
HarrisD.SassT. (2011). Teacher training, teacher quality and student achievement. Journal of Public Economics, 95, 798–812. doi:10.1016/j.jpubeco.2010.11.009
19.
HollandP.WainerH. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.
20.
ImbensG. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. The Review of Economics and Statistics, 86, 4–29. doi:10.1162/003465304323023651
21.
ImbensG. W.RubinD. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge, England: Cambridge University Press.
22.
IsenbergE.MaxJ.GleasonP.PotamitesL.SantillanoR.HockH.HansenM. (2013). Access to effective teaching for disadvantaged students (Technical Report). Washington, DC: Institute of Education Sciences, U.S. Department of Education, NCEE 2014-4001. Retrieved fromhttps://ies.ed.gov/ncee/pubs/20144001/pdf/20144001.pdf
23.
JohnsonM. T.LipscombS.GillB. (2015). Sensitivity of teacher value-added estimates to student and peer control variables. Journal of Research on Educational Effectiveness, 8, 60–83. doi:10.1080/19345747.2014.967898
24.
KurokiM.PearlJ. (2014). Measurement bias and effect restoration in causal inference. Biometrika, 101, 423–437. doi:10.1093/biomet/ast066
25.
LockwoodJ. R.McCaffreyD. F. (2014). Correcting for test score measurement error in ANCOVA models for estimating treatment effects. Journal of Educational and Behavioral Statistics, 39, 22–52. doi:10.3102/1076998613509405
26.
LockwoodJ. R.McCaffreyD. F. (2015). Simulation-extrapolation for estimating means and causal effects with mismeasured covariates. Observational Studies, 1, 241–290.
27.
LockwoodJ. R.McCaffreyD. F. (2016). Matching and weighting with functions of error-prone covariates for causal inference. Journal of the American Statistical Association, 111, 1831–1839. doi:10.1080/01621459.2015.1122601
28.
LockwoodJ. R.McCaffreyD. F. (2017). Simulation-extrapolation with latent heteroskedastic error variance. Psychometrika, 82, 717–736. doi:10.1007/s11336-017-9556-y
29.
LockwoodJ. R.McCaffreyD. F. (2019). Using hidden information and performance-level boundaries to study student-teacher assignments: Implications for estimating teacher causal effects. Unpublished manuscript.
30.
LordF. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates.
31.
McCaffreyD.LockwoodJ.KoretzD.LouisT.HamiltonL. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29, 67–101. doi:10.3102/10769986029001067
32.
McCaffreyD.LockwoodJ.SetodjiC. (2013). Inverse probability weighting with error-prone covariates. Biometrika, 100, 671–680. doi:10.1093/biomet/ast022
33.
PapayJ.MurnaneR.WillettJ. (2016). The impact of test score labels on human-capital investment decisions. Journal of Human Resources, 51, 357–388. doi:10.3368/jhr.51.2.0713-5837R
34.
PearlJ.GlymourM.JewellN. P. (2016). Causal inference in statistics: A primer. Hoboken, NJ: Wiley.
35.
R Core Team. (2019). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria. Retrieved fromhttps://www.R-project.org/
36.
RosenbaumP.RubinD. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55. doi:10.1093/biomet/70.1.41
37.
RothsteinJ. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4, 537–571. doi:10.1162/edfp.2009.4.4.537
38.
SassT. R.HannawayJ.XuZ.FiglioD. N.FengL. (2012). Value added of teachers in high-poverty schools and lower poverty schools. Journal of Urban Economics, 72, 104–122. doi:10.1037/e721792011-001
39.
ScheffeH. (1959). The analysis of variance. New York, NY: Wiley.
40.
SengewaldM.-A.PohlS. (2019). Compensation and amplification of attenuation bias in causal effect estimates. Psychometrika, 84, 589–610. doi:10.1007/s11336-019-09665-6
41.
SengewaldM.-A.SteinerP.PohlS. (2019). When does measurement error in covariates impact causal effect estimates? Analytical derivations of different scenarios and an empirical illustration. British Journal of Mathematical and Statistical Psychology, 72, 244–270. doi:10.1111/bmsp.12146
42.
ShadishW.CookT.CampbellD. (2002). Experimental and quasi-experimental designs for generalized causal inference. New York, NY: Houghton Mifflin.
43.
SteinerP.CookT.ShadishW. (2011). On the importance of reliable covariate measurement in selection bias adjustments using propensity scores. Journal of Educational and Behavioral Statistics, 36, 213–236. doi:10.3102/1076998610375835
44.
SteinerP.KimY. (2016). The mechanics of omitted variable bias: Bias amplification and cancellation of offsetting biases. Journal of Causal Inference, 4, 20160009. doi:10.1515/jci-2016-0009
45.
Webb-VargasY.RudolphK.LenisD.MurakamiP.StuartE. (2017). An imputation-based solution to using mismeasured covariates in propensity score analysis. Statistical Methods in Medical Research, 26, 1824–1837. doi:10.1177/0962280215588771
46.
Weech-MaldonadoR.ElliottM. N.OluwoleA.SchillerK. C.HaysR. D. (2008). Survey response style and differential use of CAHPS rating scales by hispanics. Medical Care, 46, 963. doi:10.1097/mlr.0b013e3181791924
47.
WooldridgeJ. (2002). Econometric analysis of cross section and panel data. Cambridge: MIT Press.
48.
YangL.TsiatisA. (2001). Efficiency study of estimators for a treatment effect in a pretest-posttest trial. The American Statistician, 55, 314–321. doi:10.1198/000313001753272466
49.
YiG.MaY.CarrollR. (2012). A functional generalized method of moments approach for longitudinal studies with missing responses and covariate measurement error. Biometrika, 99, 151–165. doi:10.1093/biomet/asr076