On the Hedges Correction for a t -Test

Abstract

When cluster randomized experiments are analyzed as if units were independent, test statistics for treatment effects can be anticonservative. Hedges proposed a correction for such tests by scaling them to control their Type I error rate. This article generalizes the Hedges correction from a posttest-only experimental design to more common designs used in practice. We show that for many experimental designs, the generalized correction controls its Type I error while the Hedges correction does not. The generalized correction, however, necessarily has low power due to its control of the Type I error. Our results imply that using the Hedges correction as prescribed, for example, by the What Works Clearinghouse can lead to incorrect inferences and has important implications for evidence-based education.

Keywords

model misspecification cluster randomized trials multilevel models meta-analysis

Model misspecification occurs when group randomized studies are analyzed as if units were independent. This kind of model misspecification is known to lead to test statistics that can be anticonservative (see, e.g., Raudenbush & Bryk, 2002). Recently, a series of papers considering this problem in the context of meta-analysis have sought to correct these test statistics by scaling them to control their Type I error (Hedges, 2007, 2009; Hedges & Rhoads, 2011). The usefulness of these corrections is potentially limited, however, because none of these corrections were developed to accommodate the case where the model from the original study included covariates other than the fixed effect of treatment. This could be an important limitation in practice. For example, the What Works Clearinghouse (WWC), the government agency responsible for reviewing and rating education studies, has a policy that requires the use of these types of corrections to any study that randomized by group but did not account for group membership in the analysis (WWC, 2014).

To assess how frequently these types of Hedges corrections are being used, we reviewed every WWC intervention report published on their website (http://ies.ed.gov/ncee/wwc/) as of mid-May 2015. The WWC has attempted to review 558 interventions in areas ranging from math to literacy to dropout prevention by examining 10,585 research studies of those interventions. Due to the rigorous nature of the WWC research standards, only 160 of the 558 interventions have at least one study that passes the WWC research standards. Table 1 displays the breakdown of these 160 intervention reports, with at least one study by the nine topic areas considered by the WWC. Across all the topic areas, over two in five intervention reports (68 of the 160) rely on at least one corrected study and one in five (34 of the 160) rely entirely on the results of corrected studies. Furthermore, the studies corrected by the WWC are almost always of the type where, as we will show, the Hedges correction is expected to perform poorly. In a random sample of 20 corrected studies, for example, 19 included covariates other than the fixed effect of treatment in their analyses. Finally, these misspecified studies are not merely an artifact of the past: Over one third of all studies corrected by the WWC were published in the last 10 years.

Table 1.

Number of WWC Intervention Reports With at Least One Study (160 Reports) Cross-Classified by Topic Area and Extent of Corrected Studies Present Within the Intervention Report

Topic Area	0 Corrected	Some Corrected	All Corrected	Total # Reports
Literacy	21	12	14	47
Math	11	5	9	25
Early childhood education	9	9	3	21
Dropout prevention	18	1	0	19
Children/youth with disabilities	13	3	1	17
Student behavior	9	2	2	13
English-language learners	7	2	2	11
Science	2	0	3	5
Postsecondary education	2	0	0	2
Overall	92	34	34	160

Note. The “Some Corrected” category includes intervention reports with at least one corrected study but does not include intervention reports where all studies are corrected.

This article will extend the Hedges correction to a more general class of models in order to (a) characterize the performance of the Hedges correction when it is applied to situations with additional fixed effects and (b) explore the limits of correcting significance tests with the generalized correction strategy. This article is organized as follows: The next section introduces the general mixed effects model, which we will use to extend the Hedges correction. The second section presents several new results: a derivation of a generalized Hedges correction, its distributional properties, and a limited-information test statistic based on achievable bounds of the correction that can be used in practical situations. The third section explores the relationships between the Hedges, WWC, and generalized corrections, and the fourth section considers the power implications of correcting significance tests. The fifth section outlines the implication of these new results for education policy and the sixth section closes with a discussion.

A Mixed Effects Model and the Hedges Correction

Consider the following the normal mixed effects model:

Y = X β + Z u + e u \sim N (0, D) e \sim N (0, R),

where X and Z are fixed design matrices, β is a vector of regression coefficients for the fixed effects, u is a vector of random effects with covariance matrix D, and e is a vector of the residual errors with covariance matrix R. When u and e are assumed to be independent, the covariance of Y is $Σ = Var [Y] = Z D Z^{⊤} + R$ .

Consider this model for the case where there are two treatment conditions i = 1, 2 and J_i schools per treatment condition. Let the number of students in school (i, j) be denoted as K_ij and let y _ij be a K_ij × 1 vector of responses from students in school j in treatment i. The vector of student responses is then:

y_{i j} = X_{i j} β + 1_{i j} s_{i j} + e_{i j} s_{i j}^{\underset{\sim}{i i d}} N (0, σ_{b}^{2}) e_{i j}^{\underset{\sim}{i i d}} N (0, I σ_{w}^{2}),

where X _ij is a K_ij × p design matrix, β is a p × 1 parameter vector, 1 _ij is a K_ij × 1 vector of ones, s_ij is the random effect for membership in school j in treatment condition i, and e _ij is the K_ij × 1 vector of student-level residuals. Assuming s_ij and e _ij are independent for all (i, j) pairs, the variance of response vector y _ij for school j in treatment i is given by:

Σ_{i j} = Var [y_{i j}] = J σ_{b}^{2} + I σ_{w}^{2},

where J is a K_ij × K_ij matrix of ones and I is the identity matrix. Note that the main diagonal of Σ _ij is the total variance $σ^{2} = σ_{b}^{2} + σ_{w}^{2}$ . Factoring out σ² from Σ _ij gives:

V_{i j} = \frac{Σ_{i j}}{σ^{2}} = J ρ + I (1 - ρ),

where $ρ = σ_{b}^{2} / σ^{2}$ is the intraclass correlation coefficient (ICC) of school membership. Conditioning y _ij on V _ij gives:

y_{i j} | V_{i j} = X_{i j} β + ε_{i j} ε_{i j} | V_{i j} \sim N (0, V_{i j} σ^{2}),

where the random effect of school membership has induced a correlation in the error term ε, which is represented by the covariance matrix V _ij σ².

We simplify the notation by aggregating the (i, j) groups together:

Y = [\begin{matrix} y_{11} \\ y_{12} \\ ⋮ \\ y_{21} \\ y_{22} \\ ⋮ \end{matrix}] X = [\begin{matrix} X_{11} \\ X_{12} \\ ⋮ \\ X_{21} \\ X_{22} \\ ⋮ \end{matrix}] V = [\begin{matrix} V_{11} \\ V_{12} \\ ⋱ \\ V_{21} \\ V_{22} \\ ⋱ \end{matrix}] .

This allows us to express Equation 2, or any other mixed model with homogeneous variance, as:

Y | V = X β + ε ε | V \sim N (0, V σ^{2}),

without explicitly tracking the (i, j) block structure.

The model studied by Hedges (2007) is a special case of the model in Equation 7 where (a) the only fixed effect included in the model is the fixed effect of treatment and (b) the ICC of school membership is assumed known. When the value of the ICC is known, the optimal approach for parameter estimation and hypothesis testing is generalized least squares (GLS; Graybill, 1976). In the special case when the ICC is known to be zero, the ordinary least squares (OLS) estimates and tests are identical to their GLS counterparts. In the general case, however, where the ICC is not zero, OLS estimates have higher mean squared errors and OLS test statistics are anticonservative.

Hedges (2007) considered the consequences of calculating a t-test for the hypothesis of no treatment effect using OLS when the ICC is nonzero, that is, an independent two-sample t-test. For this case, Hedges derived the distribution of the OLS t-test when the data were generated with a given, known ICC denoted by ρ₀. He then used that distribution to derive a constant factor such that the constant factor times the OLS t-statistic is approximately t distributed. In the case of balanced data, where N is the total number of observations and K is the number of students in every school, the correction is given by:

c_{H} = \sqrt{\frac{(N - 2) - 2 (K - 1) ρ_{0}}{(N - 2) [1 + (K - 1) ρ_{0}]}},

such that if we denote t _OLS as the OLS t-test, then c _H⋅t _OLS is approximately t distributed with:

h = \frac{{((N - 2) - 2 (K - 1) ρ)}^{2}}{(N - 2) (1 - ρ)^{2} + K (N - 2 K) ρ^{2} + 2 (N - 2 K) ρ (1 - ρ)},

degrees of freedom. Note that h can be much less than the N − 2 degrees of freedom that would normally be assumed. For example, when N ≫ (Kρ₀)², h is approximately $N / (1 + (K - 1) ρ_{0}^{2})$ .

In the case of unbalanced data, the Hedges correction has a similar form. Let K_ij be the number of students in school (i, j), as in Equation 2. Let the number of students in treatment condition i be n_i = ∑_j K_ij and define two measures of a typical school size as:

\begin{array}{l} {\bar{K}}_{H} & = \frac{1}{2} \sum_{i j} \frac{1}{n_{i}} K_{i j}^{2} \\ {\tilde{K}}_{H} & = \frac{n_{2} \sum_{j} K_{1 j}^{2}}{n_{1} (n_{1} + n_{2})} + \frac{n_{1} \sum_{j} K_{2 j}^{2}}{n_{2} (n_{1} + n_{2})} \end{array},

such that both ${\bar{K}}_{H}$ and ${\tilde{K}}_{H}$ are equal to K in the special case of balanced data. Hedges then gives the unbalanced form of the correction as:

c_{H} = \sqrt{\frac{(N - 2) - 2 ({\bar{K}}_{H} - 1) ρ_{0}}{(N - 2) [1 + ({\tilde{K}}_{H} - 1) ρ_{0}]}},

such that c _H ⋅ t _OLS is approximately t distributed with often far fewer than N − 2 degrees of freedom (Hedges, 2007, equation 16, p. 166).

The Hedges correction, however, is somewhat impractical to use for unbalanced data because it requires that the size of every group be known, and a complete enumeration of group sizes is rarely present in study reports. Due to this complexity, the WWC uses a simplified version of c _H where ${\bar{K}}_{H}$ and ${\tilde{K}}_{H}$ are both approximated with the simple average group size $\bar{K} = N / J$ . The form of the correction is then:

c_{W} = \sqrt{\frac{(N - 2) - 2 (\bar{K} - 1) ρ_{0}}{(N - 2) [1 + (\bar{K} - 1) ρ_{0}]}},

such that a corrected t-test is simply c _W ⋅ t _OLS. The WWC assumes this corrected test has the same degrees of freedom as in the balanced case of Equation 9 except that K is replaced with $\bar{K}$ . Additionally, the WWC assumes that ρ₀ = .20 for significance tests of academic outcomes and ρ₀ = .10 for behavioral outcomes (WWC, 2014).

To illustrate the potential consequences of this type of misspecification in the context of a group randomized experiment, we present an educational intervention that was analyzed incorrectly by the original authors and corrected by the WWC. In this case, the simplified Hedges correction used by the WWC is a poor approximation to a reanalysis of the data for two reasons: First, the WWC chose the ICC two times larger than the ICC that would have been estimated from original data, and second, the originally reported test statistic accounted for fixed teacher effects and the Hedges correction does not account for fixed effects beyond that of treatment.

The specific experiment we consider was conducted by Carnegie Learning, Inc. in Moore, Oklahoma, to study the effectiveness of their Cognitive Tutor Algebra I curriculum during the 2000–2001 school year. For this motivating example, we focus on the main academic outcome measured in the study, the end of year student scores on the Educational Testing Service (ETS) Algebra I exam. The study included 255 students taught in 16 classrooms by six teachers in three schools. Further details on the experiment can be found in the initial report by Morgan and Ritter (2002). We thank Steve Ritter of Carnegie Learning, Inc. for providing us with a copy of the original data.

Notably, the experiment used a within-teacher design. Each class period that a participating teacher taught was randomly assigned to either a Cognitive Tutor condition or a control condition. Once the teacher-to-period assignments were made, the schools used their “standard procedures” to enroll students in periods where Algebra was offered. Empirically, this kind of registrar-based assignment is equivalent to group randomized assignment, as the students who are assigned to the same class period by the registrar may share characteristics (Slavin, 2008). This clustering of students within classrooms may lead to correlated outcomes which will violate the assumptions of commonly used analysis of variance and analysis of covariance models. A class of models that accounts for clustering of students within classrooms are mixed effects models, specifically the two-level hierarchical model with a nonzero ICC for the effect of classroom membership in Equation 2.

We consider two different approaches to analyzing the Moore experiment: Approach A, a marginal analysis that considers only treatment assignment and ignores the within-teacher nature of the experimental design; and Approach B, a within-teacher analysis that models teacher effects as fixed. Within each approach, we consider two ways to calculate a test statistic for the hypothesis of no treatment difference: Case 1, a test statistic based on the (mistaken) assumption that the ICC (ρ) is zero; and Case 2, a test statistic that accounts for the nonzero ICC estimated from the data using a two-level hierarchical model $(\hat{ρ})$ . We calculate OLS t-tests of no treatment difference for Case 1, and restricted maximum likelihood (REML) t-tests of no treatment difference for Case 2. For Case 1, we assume the standard OLS degrees of freedom of N − p, where N is the number of observations and p is the number of columns of the design matrix, that is, 253 for Case A1 and 248 for Case B1. For Case 2, we use the typical Donner and Klar (2000, p. 115) measure of the degrees of freedom of J − p, where J is the number of groups, that is, 14 for Case A2 and 9 for Case B2.

Table 2 displays the results of Cases 1 and 2 as columns and Approaches A and B as rows. As expected, the misspecified t-statistics in the Case 1 column are much larger than the REML t-statistics in the Case 2 column. This anticonservative effect of model misspecification is most noticeable for the Approach B t-statistics in the second row. The misspecified Case 1B test statistic is significant at the .01 level, while the REML Case 2B t-statistic is only significant at the .10 level.

Table 2.

The t-Statistics Test No Treatment Difference and Are Coded Graphically for Significance Based on a Two-Sided Test

Table of t-Statistics
	Case 1 (Misspecified)	Case 2 (REML)	Case 3 (WWC/Hedges)
Approach	ρ = 0	ρ estimated	ρ₀ = .20	ρ₀ = .10
A: Marginal	2.266*	${1.464}^{} \hat{ρ} = 0.096$	1.121	1.426
B: Within-teacher	2.772**	${1.886}^{†} \hat{ρ} = 0.084$	1.372	1.745

Note. REML = restricted maximum likelihood; WWC = What Works Clearinghouse.

Significance levels: ^† p < .10. *p < .05. **p < .01.

Table 2 also displays the results of the WWC’s simplified Hedges correction c _W applied to the Case 1 t-statistic as Case 3 (WWC/Hedges). We report two versions of the correction to illustrate the two reasons why the WWC/Hedges correction broke down for this example: the first, where we calculate the correction with ρ₀ = .20, as the WWC did due to the ETS exam being an academic outcome; and the second, where we assume the ICC is near the estimated value of ρ₀ = .10 to isolate the differences between Approaches A and B. Note that the ρ₀ = .20 column demonstrates the sensitivity of the correction to the accuracy of the ICC; the corrected tests are very conservative relative to the REML test statistics if the ICC assumed by the WWC is twice as large as the estimated ICC.

If the ICC is correctly specified, however, for Case 3A, the WWC/Hedges correction is able to scale the misspecified t-statistic from Case 1A to match the REML t-statistic from Case 2A. This may be expected because Approach A represents the situation for which the WWC/Hedges correction was derived, and the assumed ICC is close to the estimated ICC. For Case 3B, however, the WWC/Hedges corrected t-statistic does not match the REML t-statistic from Case 2B, even though the assumed ICC is close to the estimated ICC. Further note that, in this example, the test statistics are also of differing significance levels, with the corrected test statistic of Case 3B not achieving statistical significance at the .10 level. These differences between the Case 2B and Case 3B test statistics are not surprising, given that the anticonservative effect of model misspecification is mediated by the presence or absence of fixed teacher effects in this example, and neither the Hedges correction nor its simplified WWC variant was derived to account for such additional fixed effects.

The results of Hedges (2007) demonstrate that without access to the original data, a misspecified test statistic can be scaled to approximate the distribution of the correct statistical test, for example, the corrected Case 3A test statistic matches that of the REML Case 2A test statistic when the ICC is known. The Hedges correction, however, was derived for a very specific model, the special case of the model in Equation 7 where the only fixed effect included in the model is the fixed effect of treatment. In the next section, we will consider a generalization of the Hedges correction to the full model, in order that more complex analyses, such as the within-teacher model of Approach B, can also be corrected.

A Generalization of the Hedges Correction

As first discussed by Hedges (2007), an approach for dealing with model misspecification in the context of meta-analysis is to develop a scaling factor such that the misspecified t-statistic is scaled to control its Type I error. For our extension of the Hedges correction, we consider two practical features of a scaling factor approach: (a) that the distribution of the corrected t-statistic approximates the distribution of the t-statistic that would have been used if the model were not misspecified and (b) that the calculation of the correction would not require access to individual-level data, Y and X, since it is unlikely they would be available for a meta-analyst to use.

Formally, we assume that a given experiment was conducted such that the data could reasonably be modeled as being produced by a data generating model $(M_{DG})$ of the form:

M_{DG} : Y \sim N (X β, σ^{2} V),

where 𝒱 is a positive definite matrix that may or may not be precisely known. We further assume that the original authors did not realize that $M_{DG}$ was the most appropriate model for the experiment, and they instead used an “originally reported model” ( $M_{O}$ ) of the form:

M_{O} : Y \sim N (X β, σ^{2} I),

where I is the identity matrix, for both their initial power analysis and their originally reported analysis. Our goal is to update the test statistics derived under the originally reported model to match the data generating model by using an updated model $(M_{U})$ of the form:

M_{U} : Y \sim N (X β, σ^{2} V_{U}),

where V _U is chosen to be equal to 𝒱 when 𝒱 is known and approximate 𝒱 when it is not known.

We consider correcting hypothesis tests that can be expressed as a linear combination of the fixed effects being equal to zero.

H_{0} : l^{⊤} β = 0,

with either one-sided or two-sided alternatives and restrict our attention to studying the differences between the OLS and GLS estimators and test statistics when $V = V_{U}$ is known a priori.

Under $M_{U}$ , the uniform minimum variance unbiased estimators (UMVUE) of β and σ² are the familiar GLS estimators:

\begin{array}{l} {\hat{β}}_{U} & = (X^{⊤} V_{U}^{- 1} X)^{- 1} X^{⊤} V_{U}^{- 1} Y \\ \hat{σ_{U}^{2}} & = \frac{1}{N - p} Y^{⊤} (V_{U}^{- 1} - V_{U}^{- 1} X {(X^{⊤} V_{U}^{- 1} X)}^{- 1} X^{⊤} V_{U}^{- 1}) Y . \end{array}

A commonly used test for the hypothesis in Equation 12 under $M_{U}$ is the GLS t-test:

t_{U} = \frac{l^{⊤} {\hat{β}}_{U} - 0}{\sqrt{\hat{σ_{U}^{2}} / {\tilde{N}}_{U}}},

where ${\tilde{N}}_{U}$ is the effective sample size of the estimator $l^{⊤} {\hat{β}}_{U}$ given $M_{U}$ . Its form is:

{\tilde{N}}_{U} = \frac{1}{l^{⊤} {(X^{⊤} V_{U}^{- 1} X)}^{- 1} l} .

Under $M_{U}$ , t _U has a central t distribution with N − p degrees of freedom under the null hypothesis and a noncentral t distribution under the alternate hypothesis. If we define the effect size as:

δ = \frac{l^{⊤} β}{σ},

then the distribution of t _U under $M_{U}$ under the alternative hypothesis is:

t_{U} \sim t (δ \cdot \sqrt{{\tilde{N}}_{U}}, N - p),

where $δ \cdot \sqrt{{\tilde{N}}_{U}}$ is the noncentrality parameter and N − p is the degrees of freedom. As N goes to infinity, the distribution of t _U under $M_{U}$ is well approximated as a normal distribution (Johnson & Welch, 1940):

t_{U} ∻ N (δ \cdot \sqrt{{\tilde{N}}_{U}},1) .

We note that as the effective sample size ${\tilde{N}}_{U}$ increases, the power of the test in Equation 12 increases.

The originally reported model $M_{O}$ is a special case of $M_{U}$ where V _U = I. We define:

\begin{array}{l} {\hat{β}}_{O} & = (X^{Τ} X)^{- 1} X^{⊤} Y & © & \hat{σ_{O}^{2}} & = \frac{1}{N - p} Y^{⊤} (I - X {(X^{⊤} X)}^{- 1} X^{⊤}) Y \\ {\tilde{N}}_{O} & = \frac{1}{l^{Τ} {(X^{Τ} X)}^{- 1} l} & © & t_{O} & = \frac{l^{⊤} {\hat{β}}_{O} - 0}{\sqrt{\hat{σ_{O}^{2}} / {\tilde{N}}_{O}}} \end{array},

so that under $M_{O}$ , ${\hat{β}}_{O}$ and $\hat{σ_{O}^{2}}$ are UMVUE, t _O has a central t distribution with N − p degrees of freedom under the null hypothesis, and t _O has a noncentral t distribution under the alternate hypothesis:

t_{O} \sim t (δ \cdot \sqrt{{\tilde{N}}_{O}}, N - p),

where $δ \cdot \sqrt{{\tilde{N}}_{O}}$ is the noncentrality parameter and N − p is the degrees of freedom.

Under $M_{U}$ , however, t _O is no longer t distributed, so performing hypothesis tests with t _O and its original rejection region will no longer have the same operating characteristics as under $M_{O}$ . We derive a new test statistic that is based on t _O.

First, we note that, under $M_{O}$ , the test statistic t _O can be derived as a Wald test. For example, the variance of $l^{⊤} {\hat{β}}_{O}$ under $M_{O}$ is given by:

\begin{array}{l} Var [l^{⊤} {\hat{β}}_{O} | M_{O}] & = l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤} \cdot Var [Y | M_{O}] \cdot X {(X^{⊤} X)}^{- 1} l \\ = l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤} \cdot I σ^{2} \cdot X {(X^{⊤} X)}^{- 1} l \\ = l^{⊤} {(X^{⊤} X)}^{- 1} l \cdot σ^{2} \\ = \frac{σ^{2}}{{\tilde{N}}_{O}} . \end{array} ©

If we choose to replace σ² with its UMVUE under $M_{O}$ , then a Wald test of $l^{⊤} {\hat{β}}_{O}$ under $M_{O}$ is the OLS t-test:

W_{O} = \frac{l^{⊤} {\hat{β}}_{O} - 0}{\sqrt{Var [l^{⊤} {\hat{β}}_{O} | M_{O}, σ^{2} = \hat{σ_{O}^{2}}]}} = t_{O} .

Second, we construct a Wald-like test for $l^{⊤} {\hat{β}}_{O}$ under $M_{U}$ . The estimator $l^{⊤} {\hat{β}}_{O}$ is unbiased under $M_{U}$ and its variance is:

\begin{array}{l} Var [l^{⊤} {\hat{β}}_{O} | M_{U}] & = l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤} \cdot Var [Y | M_{U}] \cdot X {(X^{⊤} X)}^{- 1} l \\ = l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤} \cdot V_{U} σ^{2} \cdot X {(X^{⊤} X)}^{- 1} l \\ = \frac{σ^{2}}{{\tilde{N}}_{O|U}} \end{array},

where we have defined the effective sample size of $l^{⊤} {\hat{β}}_{O}$ under $M_{U}$ as:

{\tilde{N}}_{O|U} = \frac{1}{l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤} \cdot V_{U} \cdot X {(X^{⊤} X)}^{- 1} l} .

Since $\hat{σ_{O}^{2}}$ under $M_{U}$ is a biased estimator of σ², we construct an unbiased and consistent estimator of σ² under $M_{U}$ for use in the Wald-like test statistic. We define:

\hat{σ_{O|U}^{2}} = \frac{N - p}{trace [V_{U} (I - X {(X^{⊤} X)}^{- 1} X^{⊤})]} \cdot \hat{σ_{O}^{2}},

as the bias-corrected variance estimator. Theorem 3 in Appendix A, available in the online version of the journal, gives that $\hat{σ_{O|U}^{2}}$ is an unbiased estimator of σ under $M_{U}$ and, under mild regularity conditions, $\hat{σ_{O|U}^{2}}$ is a consistent estimator of σ² as the number of groups increases. A Wald-like test for $l^{⊤} {\hat{β}}_{O}$ under $M_{U}$ is then:

\begin{array}{l} W_{O|U} & = \frac{l^{⊤} {\hat{β}}_{O} - 0}{\sqrt{Var [l^{⊤} {\hat{β}}_{O} | M_{U}, σ^{2} = \hat{σ_{O|U}^{2}}]}} . \end{array}

We express W _O|U in terms of t _O by multiplying the numerator and denominator of W _O|U by the denominator of t _O so that:

\begin{array}{l} W_{O|U} & = \frac{l^{⊤} {\hat{β}}_{O} - 0}{\sqrt{Var [l^{⊤} {\hat{β}}_{O} | M_{U}, σ^{2} = \hat{σ_{O|U}^{2}}]}} \cdot \sqrt{\frac{Var [l^{⊤} {\hat{β}}_{O} | M_{O}, σ^{2} = \hat{σ_{O}^{2}}]}{Var [l^{⊤} {\hat{β}}_{O} | M_{O}, σ^{2} = \hat{σ_{O}^{2}}]}} & = c \cdot t_{O} \end{array},

where we have defined

\begin{array}{l} c & = \sqrt{\frac{Var [l^{⊤} {\hat{β}}_{O} | M_{O}, σ^{2} = \hat{σ_{O}^{2}}]}{Var [l^{⊤} {\hat{β}}_{O} | M_{U}, σ^{2} = \hat{σ_{O|U}^{2}}]}} & = \sqrt{\frac{{\tilde{N}}_{O|U}}{{\tilde{N}}_{O}} \cdot \frac{trace [V_{U} (I - X {(X^{⊤} X)}^{- 1} X^{⊤})]}{N - p}} . \end{array}

We can simplify the form of c by defining $P_{X} = X {(X^{⊤} X)}^{- 1} X^{⊤}$ as the projection matrix onto the column space of X and noting that the effective sample size term is the inverse of a Rayleigh quotient. Specifically, for a vector z and a symmetric real matrix M, the Rayleigh quotient is defined as (Rao, 1973):

R [z^{⊤}, M] = \frac{z^{⊤} M z}{z^{⊤} z},

so that

\frac{{\tilde{N}}_{O}}{{\tilde{N}}_{O|U}} = \frac{l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤} \cdot V_{U} \cdot X {(X^{⊤} X)}^{- 1} l}{l^{⊤} {(X^{⊤} X)}^{- 1} l} = R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, V_{U}],

and c can be expressed as:

c = \sqrt{\frac{1}{R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, V_{U}]} \cdot \frac{trace [V_{U} (I - P_{X})]}{N - p}} .

One interpretation of c is as a scaling factor to update the Wald test under $M_{O}$ (t _O) to a Wald-like test under $M_{U}$ (c ⋅ t _O). This interpretation suggests c ⋅ t _O will be asymptotically normally distributed and well approximated by a t distribution in small samples. Theorem 4 in Appendix A, available in the online version of the journal, gives that c ⋅ t _O is normally distributed in large samples under mild regularity conditions. Specifically, if (1) X is full rank, (2) V _U is block diagonal, and (3) the ratio of the largest and smallest eigenvalues of V _U remains bounded as the number of groups increases, then for large numbers of groups, the distribution of c ⋅ t _O under $M_{U}$ is approximately:

c \cdot t_{O} ∻ N (δ \cdot \sqrt{{\tilde{N}}_{O|U}},1) .

We note that under $M_{U}$ , the asymptotic distribution of c ⋅ t _O given in Equation 18 only differs from the asymptotic distribution of t _U given in Equation 14 by the effective sample size terms. This implies that the level α rejection region for t _U is also the level α rejection region for c ⋅ t _O in large samples.

One possible small sample approximation of the distribution of c ⋅ t _O is:

c \cdot t_{O} ∻ t (δ \cdot \sqrt{{\tilde{N}}_{O|U}}, h) h = \frac{{[trace [V_{U} (I - P_{X})]]}^{2}}{trace [{[V_{U} (I - P_{X})]}^{2}]},

which is a Satterthwaite approximation, where the derivation of the degrees of freedom h follows directly from the cumulants of $\hat{σ_{O|U}^{2}}$ given in Theorem 3 of Appendix A, available in the online version of the journal, and Theorem 3.1 of Box (1954). Note that h can be used to determine whether the asymptotic limit of Equation (18) applies.

The results of Equations 18 and 19, however, cannot be used in practical situations by a meta-analyst. For these results to be used, either X must be known or sufficiently detailed information about group means and variances must be known, so that c can be calculated using Equation 17, and such knowledge of X is unlikely since neither X nor the necessary group-level information is likely to be given in published reports.

We propose a feasible correction based on a lower bound of c that is (a) equal to c in special cases and (b) can feasibly be calculated by a meta-analyst. These new results rely on a second interpretation of c; specifically, that it is the square root of the bias under $M_{U}$ of the observed Fisher information (Efron & Hinkley, 1978) of the maximum likelihood estimator of a linear combination of regression coefficients that were calculated under $M_{O}$ .

Under this second interpretation, c is well studied. For example, Swindel (1968) gives achievable bounds on c that depend on the eigenvalues of V _U and hold for all full rank X matrices and positive definite V _U matrices. Swindel’s results are based on the ordered eigenvalues of the known matrix V _U. Since V _U is assumed to be an N × N positive definite matrix, it has N eigenvalues that are greater than zero, which we denote as:

λ_{1} \geq λ_{2} \geq \dots \geq λ_{N} > 0.

Following Swindel (1968), we factor and bound c from Equation 17 into two terms. First, Swindel quotes the min–max theorem (Rao, 1973) to bound the Rayleigh quotient by the largest and smallest eigenvalues of V _U:

λ_{N} \leq R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, V_{U}] \leq λ_{1} .

Second, Swindel proves that the term arising from the bias of $\hat{σ_{O}^{2}}$ under $M_{U}$ is bounded by the sum of the N – p smallest and N – p largest eigenvalues divided by the quantity N – p:

\frac{1}{N - p} \cdot \sum_{i = p + 1}^{N} λ_{i} \leq \frac{trace [V_{U} (I - P_{X})]}{N - p} \leq \frac{1}{N - p} \cdot \sum_{i = 1}^{N - p} λ_{i} .

Using the identity $trace [V_{U}] = \sum_{1}^{N} λ_{i}$ , we can express the bounds of c as:

c_{L} = \sqrt{\frac{1}{λ_{1}} \cdot \frac{trace [V_{U}] - \sum_{i = 1}^{p} λ_{i}}{N - p}} \leq c \leq c_{U} = \sqrt{\frac{1}{λ_{N}} \cdot \frac{trace [V_{U}] - \sum_{i = N - p + 1}^{N} λ_{i}}{N - p}},

where N and p are the dimensions of X. The degrees of freedom h can be similarly bounded. Note that Equation 47 in the proof of Theorem 3 in Appendix A, available in the online version of the journal, gives an upper bound of 1/h in the special case of s = 2. Its reciprocal, therefore, gives the lower bound:

h_{L} = \frac{{(N - p)}^{2}}{N} \cdot {(\frac{λ_{N}}{λ_{1}})}^{2} \leq h,

which can be used to judge whether or not the large sample normal approximation applies.

The lower bound of the corrected test is then:

c_{L} \cdot t_{O} \leq c \cdot t_{O} .

We note that c _L ⋅ t _O is calculated without access to the value of X; it depends only on the known matrix V _U.

We refer to c ⋅ t _O as the full-information corrected test statistic and c _L ⋅ t _O as the limited-information corrected test statistic. The limited-information corrected test has the property of always controlling its Type I error relative to c ⋅ t _O for hypothesis tests performed with the rejection region of c ⋅ t _O. To see this, note that under the null hypothesis, Equation 18 implies that:

c_{L} \cdot t_{O} = \frac{c_{L}}{c} \cdot c \cdot t_{O} ∻ N (0, {[\frac{c_{L}}{c}]}^{2}) .

Since Equation 22 implies that c _L/c ≤ 1, the mass of the distribution of c _L ⋅ t _O is more concentrated around zero than c ⋅ t _O. The tail area of the rejection region, therefore, is always less than that of c ⋅ t _O, that is, c _L ⋅ t _O has a Type I error less than or equal to c ⋅ t _O.

Given these two interpretations of c, the generalization of the Hedges correction to mixed effects models has two parts: (a) a full-information correction given in Equation 17, which can be used in situations where either X or the appropriate functions of X have been given in the misspecified study report and (b) a limited-information correction given in Equation 22, which can be used when less information is available, that is, only the information the meta-analyst used to construct V _U is available. In the next section, we consider the situations when these corrections coincide with the Hedges correction and the WWC correction and when they do not.

Relationships Between the WWC, Hedges, and Generalized Corrections

To examine the relationships between the various corrections, we partition the model of Equation 7 into four cases of relevance to estimating treatment effects in group randomized studies. Case I considers a model with only treatment fixed effects and balanced group sizes, Case II a model with treatment fixed effects and unbalanced group sizes, Case III a model with an additional fixed effect and unbalanced group sizes, and Case IV a model with multiple additional fixed effects and unbalanced group sizes.

In Case I, all four corrections, the WWC, Hedges, generalized, and limited-information corrections, are equal to Equation 8. They are, consequently, feasible to calculate, as N and K are likely to be given in the original report and ρ₀ is assumed to be known. Furthermore, we show that the corrected test controls its Type I error.

In Case II, the Hedges and generalized corrections are equal to Equation 10 while the WWC and limited information each have their own forms. The Hedges and generalized corrections cannot be calculated because they require knowledge of every group size, which is unlikely to be given in the original report. Although the WWC and limited-information corrections are feasible to calculate in this case, we show that the WWC correction fails to control the Type I error of the corrected test, while the limited-information corrected test controls its Type I error.

In Cases III and IV, which are the most common cases encountered in practical settings, the form of each correction is unique. We show that only the generalized and limited-information corrected tests control their Type I errors. We also show that the generalized correction is infeasible to calculate because it relies on knowing the degree to which randomization did or did not balance covariates between groups. Additionally, the limited-information correction can be difficult to calculate because it requires knowledge of the size of the p largest groups, which is three for Case III and can much larger for Case IV. For these situations, we suggest a modified form of the limited-information correction that always controls its Type I error, only requires knowledge of the size of the largest group, and in large samples is equal to c _L.

To derive these results, we consider design matrices that are extensions of the design matrix studied by Hedges (2007) to derive c _H in Equation 10:

\begin{array}{l} X_{H} = [\begin{matrix} 1_{1} & 0_{1} \\ 0_{2} & 1_{2} \end{matrix}] \end{array},

where 1 _i is a n_i × 1 vector of ones, 0 _i is an n_i × 1 vector of zeros. The total dimension of X _H is N × p where N = n ₁ + n ₂ and p = 2. We consider the same variance–covariance structure studied by Hedges (2007), specifically that of V from Equation 7 with ρ = ρ₀. We can express V = V _U compactly with the matrix direct sum ⊕:

\begin{array}{l} V_{U} = (1 - ρ_{0}) I + ρ_{0} (\underset{i j}{\oplus} J_{i j}) \end{array},

where I is an identity matrix of order N and the J _ij are K_ij × K_ij matrices of ones.

Note that the regularity conditions from the convergence result of Equation 18 are met. Conditions 1 and 2 follow directly. To show Condition 3, that the ratio of eigenvalues is bounded, we give the ordered eigenvalues of V _U. Note that the eigenvalues of V _U are the eigenvalues of the (i, j) blocks. Let J = J ₁ + J ₂ be the total number of groups and let $K_{(1)} \geq K_{(2)} \geq \dots \geq K_{(J)}$ be the order statistics of the K_ij group sizes. The ordered eigenvalues of V _U are then $λ_{m} = 1 + (K_{(m)} - 1) ρ_{0}$ for integers m ≤ J and $λ_{J + 1} = λ_{J + 2} = \dots = λ_{N} = 1 - ρ_{0}$ otherwise. Similarly, the ordered eigenvalues of $(\oplus_{i j} J_{i j})$ are ξ_m = K_(m) for integers m ≤ J and zero otherwise. Condition 3, therefore, is met if both the ICC remains fixed, and the largest group size K ₍₁₎ is bounded as the number of groups increases. In education research, this is generally met: We often model the ICC as fixed and consider making experiments larger by adding additional schools to an experiment, not by adding additional students to the schools in the experiment.

In all cases, we consider the null hypothesis of interest as that of a zero treatment effect, that is, $l^{⊤} = [1 - 1]$ for models with X = X _H and $l^{⊤} = [1 - 1 0_{p - 2}^{⊤}]$ , where $0_{p - 2}^{⊤}$ is an 1 × (p − 2) vector of zeros, for models with $X = [X_{H} M_{p - 2}]$ for some N × p − 2 matrix M.

Case I: Treatment Fixed Effects and Balanced Groups

Case I was studied by Hedges (2007) to derive the balanced Hedges correction of Equation 8. When the data are balanced, X = X _H is composed of vectors of length n where 1 ₁ = 1 ₂ and 0 ₁ = 0 ₂. Note that X _H has N = n + n rows and p = 2 columns.

This balance gives the matrices X and V _U a special structure: The columns of X are eigenvectors of V _U, each column of X has the same eigenvalue, and this eigenvalue is the largest eigenvalue of V _U. Specifically, $X^{⊤} \cdot V_{U} = λ_{1} X^{⊤}$ and $V_{U} \cdot X^{⊤} = λ_{1} X$ .

Under these conditions, all four corrections are equal. First, we show that c = c _H. Note the Rayleigh quotient term of c is equal to λ₁:

R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, V_{U}] = \frac{l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤} \cdot V_{U} \cdot X {(X^{⊤} X)}^{- 1} l}{l^{⊤} {(X^{⊤} X)}^{- 1} l} = λ_{1} .

Further note that (a) $V_{U} \cdot P_{X} = λ_{1} P_{X}$ , (b) trace [V _U] = N, and (c) trace[P _X] = p = 2. This implies that:

trace [V_{U} (I - P_{X})] = trace [V_{U}] - trace [V_{U} P_{X}] = N - 2 λ_{1} .

Therefore, the full-information correction can be expressed as:

c = \sqrt{\frac{1}{R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, V_{U}]} \cdot \frac{trace [V_{U} (I - P_{X})]}{N - p}},

= \sqrt{\frac{1}{λ_{1}} \cdot \frac{N - 2 λ_{1}}{N - 2}},

= \sqrt{\frac{(N - 2) - 2 (K - 1) ρ_{0}}{(N - 2) (1 + (K - 1) ρ_{0})}},

which is equal to the Hedges correction of Equation 8: Second, c is equivalent to the WWC correction of Equation 11, because $\bar{K} = K$ implies that c _W = c _H. Finally, c is equivalent to the limited-information correction c _L because λ₁ = λ₂:

c_{L} = \sqrt{\frac{1}{λ_{1}} \cdot \frac{trace [V_{U}] - \sum_{i = 1}^{p} λ_{i}}{N - p}} = \sqrt{\frac{1}{λ_{1}} \cdot \frac{N - 2 λ_{1}}{N - 2}} .

For Case I, therefore, all four corrections are equal: c = c _H = c _L = c _W. Furthermore, all the corrections can be calculated in practice by a meta-analyst: (a) it is likely that K and N are given in the original research report and (b) the value of ρ₀ is assumed to be known a priori.

Case II: Treatment Fixed Effects and Unbalanced Groups

Case II was studied by Hedges (2007) to derive the Hedges correction of Equation 10. The design matrix is X = X _H. Note that X _H has N = n ₁ + n ₂ rows and p = 2 columns.

For Case II, the full-information correction c is equal to the Hedges correction of Equation 10. The Rayleigh quotient term of c can be expressed in a functional form that is similar to an eigenvalue of V _U by (a) noting that $R [z^{⊤}, M + N] = R [z^{⊤}, M] + R [z^{⊤}, N]$ for any symmetric real matrices M and N and (b) by substituting in the definition of V _U:

\begin{array}{l} R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, V_{U}] = (1 - ρ_{0}) \cdot R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, I] \\ + ρ_{0} \cdot R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, (\underset{i j}{\oplus} J_{i j})], = 1 + ({\tilde{K}}_{H} - 1) ρ_{0} \end{array}

where we have defined:

{\tilde{K}}_{H} = R [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, (\underset{i j}{\oplus} J_{i j})] = \frac{\frac{1}{n_{1}^{2}} \sum_{j} K_{1 j}^{2} + \frac{1}{n_{2}^{2}} \sum_{j} K_{2 j}^{2}}{\frac{1}{n_{1}} + \frac{1}{n_{2}}},

as the effective group size, which is itself a Rayleigh quotient. Note that we chose the notation ${\tilde{K}}_{H}$ because this factor is algebraically equivalent to the factor used in Equation 10.

The trace term of c can be expressed in terms of Rayleigh quotients by noting that (a) the projection matrix can be split into two rank one projections and (b) the cyclic permutation property of the trace operator allows the rank one projections to be permuted into Rayleigh quotients. Specifically, if we define $X_{1}^{⊤} = [1_{1}^{⊤} 0_{1}^{⊤}]$ and $X_{2}^{⊤} = [0_{2}^{⊤} 1_{2}^{⊤}]$ , then $X = [X_{1} X_{2}]$ . We define $P_{i} = X_{i} {(X_{i}^{⊤} X_{i})}^{- 1} X_{i}^{⊤}$ for i = 1, 2 and note that P _X = P ₁ + P ₂ because $X_{1}^{⊤} X_{2} = 0$ . Then by noting that ${(X_{i}^{⊤} X_{i})}^{- 1} = 1 / X_{i}^{⊤} X_{i}$ because it is a scalar and using the distribution and cyclic properties of the trace operator, we have:

trace [V_{U} P_{i}] = \frac{trace [V_{U} X_{1} X_{1}^{⊤}]}{X_{i}^{⊤} X_{i}} = \frac{trace [X_{1}^{⊤} V_{U} X_{1}]}{X_{i}^{⊤} X_{i}} = R [X_{i}^{⊤}, V_{U}],

such that

trace [V_{U} (I - P_{X})] = trace [V_{U}] - R [X_{1}, V_{U}] - R [X_{2}, V_{U}] .

We can express this in the form given by Hedges by noting that:

R [X_{i}, V_{U}] = 1 + (R [X_{i}, (\underset{i j}{\oplus} J_{i j})] - 1) ρ,

and defining

{\bar{K}}_{i} = R [X_{i}, (\underset{i j}{\oplus} J_{i j})] = \frac{1}{n_{i}} \sum_{j} K_{i j} \cdot K_{i j},

where ${\bar{K}}_{i}$ is a weighted average of group sizes such that the K_ij are both the weights and the objects being averaged. We further define:

{\bar{K}}_{H} = \frac{{\bar{K}}_{1} + {\bar{K}}_{2}}{2},

which is identical to the ${\bar{K}}_{H}$ factor from the Hedges correction of Equation 10, so that:

\begin{array}{l} trace [V_{U} (I - P_{X})] = N - (1 + ({\bar{K}}_{1} - 1) ρ_{0}) - (1 + ({\bar{K}}_{2} - 1) ρ_{0}) \\ = (N - 2) - 2 ({\bar{K}}_{H} - 1) ρ_{0} . \end{array}

The full-information correction is then

\begin{matrix} c = \sqrt{\frac{1}{R ​ [l^{⊤} {(X^{⊤} X)}^{- 1} X^{⊤}, V_{U}]} \cdot \frac{trace [V_{U} (I - P_{X})]}{N - p}} \\ = \sqrt{\frac{(N - 2) - 2 ({\bar{K}}_{H} - 1) ρ_{0}}{(N - 2) (1 + ({\tilde{K}}_{H} - 1) ρ_{0})},} \end{matrix}

which is equal to the Hedges correction from Equation 10.

The limited-information correction differs from the full-information and Hedges corrections:

c_{L} = \sqrt{\frac{1}{λ_{1}} \cdot \frac{trace [V_{U}] - \sum_{i = 1}^{p} λ_{i}}{N - p}} = \sqrt{\frac{(N - 2) - 2 (\frac{K_{(1)} + K_{(2)}}{2} - 1) ρ_{0}}{(N - 2) (1 + (K_{(1)} - 1) ρ_{0})}},

because it depends only on the largest and second largest group sizes. In general, c will differ from c _L because $\frac{K_{(1)} + K_{(2)}}{2} \neq {\bar{K}}_{H}$ and $K_{(1)} \neq {\tilde{K}}_{H}$ .

The WWC correction of Equation 11 also differs from the other corrections for much the same reason: The simple average $\bar{K}$ is not generally equal to other K values.

The relationships between the K values imply that c _L should be preferred over c _W. Since $\frac{K_{(1)} + K_{(2)}}{2} \geq {\bar{K}}_{H}$ and $K_{(1)} \geq {\tilde{K}}_{H}$ it follows that c _L ≤ c = c _H, as Swindel’s bounds would suggest. This implies that c _L ⋅ t _O will control its Type I error. In contrast, there is no consistent relationship between $\bar{K}$ with ${\bar{K}}_{H}$ or $\bar{K}$ with ${\tilde{K}}_{H}$ : depending on how the group sizes are distributed among treatment and control conditions, $\bar{K}$ can be less than both ${\bar{K}}_{H}$ and ${\tilde{K}}_{H}$ or it can be greater than or equal to either of them. For example, if there are four treatment groups of size 4 and two control groups of size 10, then a direct argument gives:

\bar{K} = \frac{4 \times 4 + 2 \times 10}{2 + 4} = 6 {\bar{K}}_{H} = \frac{4 + 10}{2} = 7 {\tilde{K}}_{H} = \frac{1 / 4 + 1 / 2}{1 / 16 + 1 / 20} \approx 6.667,

so that $\bar{K}$ is less than both ${\tilde{K}}_{H}$ and ${\bar{K}}_{H}$ . This implies that c _W > c, which further implies that the WWC test c _W ⋅ t _O will not control its Type I error.

For Case II, therefore, the corrections differ. If all of the group sizes have been reported in the original report, c = c _H can be calculated. If the largest two group sizes are available (or reasonable values for them can be assumed), then c _L can be calculated. The WWC correction, however, should be avoided, as it does not control its Type I error in all cases with unbalanced data.

Case III: An Additional Fixed Effect and Unbalanced Group Sizes

The Case III model considers a more general group randomized experiment than Case II in that a covariate has been included in the design matrix. Specifically,

\begin{array}{l} X & = [\begin{matrix} 1_{1} & 0_{1} & A_{1} \\ 0_{2} & 1_{2} & A_{2} \end{matrix}] = [\begin{matrix} X_{H} & A \end{matrix}] \end{array},

where A _i are n_i × 1 vectors of the covariate value in condition i, X _H is the design matrix of Case II, and A is the entire N × 1 vector of covariate values. We will denote A_ijk as the value of the covariate for student k in school j in condition i, ${\bar{A}}_{i j .} = \frac{1}{K_{i j}} \sum_{k} A_{i j k}$ as the average of A for group (i, j), and ${\bar{A}}_{i ..} = \frac{1}{J_{i}} \sum_{j} A_{i j .}$ as the average of A for condition i. Furthermore, we define the effect size of treatment assignment on A as:

d_{A} = \frac{{\bar{A}}_{1..} - {\bar{A}}_{2..}}{\sqrt{S_{A}^{2}}} where S_{A}^{2} = \frac{1}{n_{1} + n_{2} - 2} \cdot {\sum_{i j k} {(A_{i j k} - {\bar{A}}_{i ..})}^{2}},

is the pooled variance estimator of A conditional on treatment assignment. Note that in randomized experiments, we expect d _A to be small: On average, randomization balances the distributions of both observed and unobserved covariates between treatment and control conditions.

For Case III, the relationship between c, c _H, c _L, and c _W depends on d _A and the variation in A between groups. In some cases, c ≈ c _H ≈ c _W and in others c ≈ c _L or c ≈ c _U. In Appendix B, available in the online version of the journal, we show that,

c = \sqrt{\frac{1}{1 + ({\tilde{K}}_{A} - 1) ρ_{0}} \cdot \frac{(N - 3) - 3 (\frac{2 {\bar{K}}_{H} + {\bar{K}}_{3}}{3} - 1) ρ_{0}}{N - 3}},

where

\begin{array}{l} {\bar{K}}_{3} = \frac{\sum_{i j} K_{i j}^{2} {({\bar{A}}_{i j .} - {\bar{A}}_{i ..})}^{2}}{S_{A}^{2} (n_{1} + n_{2} - 2)} \\ {\tilde{K}}_{A} = \frac{(\frac{1}{n_{1}} + \frac{1}{n_{2}}) \cdot {\tilde{K}}_{H} + (\frac{d_{A}^{2}}{n_{1} + n_{2} - 2}) \cdot {\bar{K}}_{3} - 2 (\frac{({\bar{a}}_{1} - {\bar{a}}_{2}) d_{A}}{(n_{1} + n_{2} - 2) \sqrt{S_{A}^{2}}})}{\frac{1}{n_{1}} + \frac{1}{n_{2}} + \frac{d_{A}^{2}}{n_{1} + n_{2} - 2}} \\ {\bar{a}}_{i} = \frac{1}{n_{i}} \sum_{j} K_{i j}^{2} ({\bar{A}}_{i j .} - {\bar{A}}_{i ..}), \end{array}

such that the cross terms ${\bar{a}}_{1}$ and ${\bar{a}}_{2}$ are zero in the special case where K_ij = K for all (i, j).

To facilitate comparisons between the various corrections, we consider their large sample forms. In large samples,

c = \sqrt{\frac{1}{1 + ({\tilde{K}}_{A} - 1) ρ_{0}} \cdot (1 - \frac{(2 {\bar{K}}_{H} + {\bar{K}}_{3}) ρ_{0}}{N - 3})} \approx \sqrt{\frac{1}{1 + ({\tilde{K}}_{A} - 1) ρ_{0}}},

where large is defined as N being much larger than 3K ₍₁₎ρ₀. The large sample forms of c _W, c _H, and c _L are similar:

c_{W} \approx \sqrt{\frac{1}{1 + (\bar{K} - 1) ρ_{0}}} c_{H} \approx \sqrt{\frac{1}{1 + ({\tilde{K}}_{H} - 1) ρ_{0}}} c_{L} \approx \sqrt{\frac{1}{1 + (K_{(1)} - 1) ρ_{0}}} .

Although the large sample form of the Case III expression for c is simple, the expression for ${\tilde{K}}_{A}$ is quite complex. To gain insight, we consider a special case with a simpler expressions for ${\bar{K}}_{3}$ and ${\tilde{K}}_{A}$ : a large experiment with many, equally sized small groups. Let J ₁ = J ₂ and K_ij = K for all (i, j), so that J = 2J ₁, and N = JK. The form of ${\bar{K}}_{3}$ is:

{\bar{K}}_{3} = \frac{\sum_{i j} K^{2} {({\bar{A}}_{i j .} - {\bar{A}}_{i ..})}^{2}}{S_{A}^{2} (N - 2)} \approx K \cdot \frac{\frac{1}{J} \sum_{i j} {({\bar{A}}_{i j .} - {\bar{A}}_{i ..})}^{2}}{S_{A}^{2}} = K \cdot \hat{ρ_{A}},

where we have approximated N − 2 ≈ N and defined:

\hat{ρ_{A}} = \frac{\frac{1}{J} \sum_{i j} {({\bar{A}}_{i j .} - {\bar{A}}_{i ..})}^{2}}{S_{A}^{2}},

as an ICC-like quantity for the covariate A: It is the ratio of an estimate of the between-group variation of the covariate A divided by an estimate of the total variation of the covariate A. This suggests an interpretation of ${\bar{K}}_{3}$ for the more general case of unbalanced data: It is a group size–weighted average of group variation divided by an estimate of the total variation. Since ${\bar{K}}_{3}$ is a Rayleigh quotient of $(\oplus_{i j} J_{i j})$ , the min–max theorem implies it is bounded by the largest and smallest eigenvalues of $(\oplus_{i j} J_{i j})$ , that is, K ₍₁₎ and 0. This implies that as between-group variation increases, ${\bar{K}}_{3}$ will approach its upper bound of K ₍₁₎, and if there is very little between-group variation, ${\bar{K}}_{3}$ will approach 0.

The form of ${\tilde{K}}_{A}$ also simplifies. Note that ${\tilde{K}}_{H} = {\bar{K}}_{H} = K$ , $\frac{1}{n_{1}} + \frac{1}{n_{2}} = \frac{4}{J K}$ , $n_{1} + n_{2} - 2 = J K - 2$ , and ${\bar{a}}_{1} = {\bar{a}}_{2} = 0$ because K_ij = K, therefore:

{\tilde{K}}_{A} = \frac{(\frac{4}{J K}) \cdot K + (\frac{d_{A}^{2}}{J K - 2}) \cdot {\bar{K}}_{3}}{\frac{4}{J K} + \frac{d_{A}^{2}}{J K - 2}} \approx \frac{4 K + d_{A}^{2} \cdot {\bar{K}}_{3}}{4 + d_{A}^{2}} = K \cdot \frac{4 + d_{A}^{2} \cdot \hat{ρ_{A}}}{4 + d_{A}^{2}},

where we have approximated JK − 2 ≈ JK.

If the between-group variation in A is large, $\hat{ρ_{A}}$ will be near 1, and we would expect ${\tilde{K}}_{A}$ to be close to K regardless of the effect size of A. If the between-group variation is small, then ${\tilde{K}}_{A}$ is near K if the effect size is much smaller than 2, and ${\tilde{K}}_{A}$ is near 0 if the effect size is much greater than K. For the unbalanced case, the relationship is similar: ${\tilde{K}}_{A}$ will approach K ₍₁₎ (and c will approach c _L) when the effect size of A is small or the variation in A between groups is large, ${\tilde{K}}_{A}$ will approach 0 when the variation in A between groups is small and the effect size of A is large. Note that if ${\tilde{K}}_{A}$ approaches zero, then c approaches its upper bound of c _U.

In Case III, the WWC-corrected test will not control its Type I error relative to c ⋅ t _O for much the same reason as in Case II: There is no consistent relationship between $\bar{K}$ and ${\tilde{K}}_{A}$ . Similarly, the Hedges corrected test c _H ⋅ t _O does not control its Type I error either. The relationship between ${\tilde{K}}_{A}$ and ${\tilde{K}}_{H}$ depends on d _A and ${\bar{K}}_{3}$ . If d _A ≈ 0, then ${\tilde{K}}_{A} \approx {\tilde{K}}_{H}$ , the large sample corrections are similar, and c _H ⋅ t _O will approximately control its Type I error. If, however, the two averages differ substantially, as may happen if the randomization of the experiment failed to balance covariate values for this particular realization, then ${\tilde{K}}_{H}$ can be less than ${\tilde{K}}_{A}$ , and the Hedges correction will not control its Type I error.

In Case III, therefore, neither the WWC-corrected test nor the Hedges-corrected test will control their Type I errors, only the full-information and limited-information corrected tests will control their Type I errors. Furthermore, the full-information correction is unlikely to be able to be calculated because it is likely that neither the group sizes necessary to calculate ${\bar{K}}_{H}$ nor the covariate-level information necessary to calculate δ_A, ${\bar{K}}_{3}$ , or ${\tilde{K}}_{A}$ will be available in the original study report. This implies that c _L is the only correction that controls its Type I error that is feasible to calculate.

Case IV: Additional Fixed Effects

The Case IV model considers a more general group randomized experiment than Case III in that additional covariates can be included in the design matrix. Specifically,

\begin{array}{l} X & = [\begin{matrix} X_{H} & A & B & C & \dots \end{matrix}] \end{array},

where B, C, and so on, are additional covariate vectors.

Although the extension of Case III to Case IV is conceptually straightforward, it is notationally burdensome and the outcome is the same as Case III. Namely, that c ⋅ t _O and c _L ⋅ t _O are the only corrected tests to control their Type I error and that, of these, only c _L ⋅ t _O has a reasonable chance of being feasible to calculate.

Note that if the number of covariates included in the model is too large, then c _L may be difficult to calculate directly because the size of the p largest groups may not have been given in the original report. In this case, we suggest a lower bound of c _L, which we define by replacing every eigenvalue of V _U with the largest eigenvalue λ₁ such that,

c_{L}^{⋆} = \sqrt{\frac{1}{λ_{1}} \cdot \frac{trace [V_{U}] - \sum_{i = 1}^{p} λ_{1}}{N - p}} = \sqrt{\frac{1}{1 + (K_{(1)} - 1) ρ_{0}} \cdot (1 - \frac{K_{(1)} \cdot p}{N - p} \cdot ρ_{0})} .

Note that since $c_{L}^{⋆} \leq c_{L}$ , the corrected test $c_{L}^{⋆} \cdot t_{O}$ will control its Type I error relative to c _L ⋅ t _O and c ⋅ t _O. Furthermore, $c_{L}^{⋆}$ and c _L approach the same limit in large samples.

For Case IV, therefore, we recommend c _L ⋅ t _O if the p largest group sizes are available and $c_{L}^{⋆} \cdot t_{O}$ if only the largest group size is available.

The Power of Corrected Tests

In this section, we show that the power of corrected tests, however, is necessarily low. There are two reasons for this: First, the power of a reanalysis of the data is necessarily lower than that of the original study, and second, for practical situations, controlling the Type I error of the corrected test requires the test to be more conservative than a reanalysis, implying the power is less than that of a reanalysis. We will show, for example, that in the context of education research, the power of a corrected test is often less than one third and can approach zero in some cases.

To simplify our discussion, we assume that the original authors planned the study for a one-sided test, and the resulting sample sizes are sufficiently large enough that all the test statistics we consider are approximately normally distributed. Specifically, we assume the original analysts planned their study to detect a minimally detectable effect size (MDES), say MDES_O, at a Type I error of α and a power of 1 − β. Let Φ[] be the cumulative distribution function of a standard normal, and let z _α and z _β be critical values such that Φ[−z_α ] = α and Φ[−z _β] = β. The original authors would have chosen samples sizes and covariates such that,

{\tilde{N}}_{O} \geq {(\frac{z_{α} + z_{β}}{{MDES}_{O}})}^{2},

because Equation 15 implies that for this choice of ${\tilde{N}}_{O}$ , the power of t _O under $M_{O}$ is at least 1 − β, that is,

Pr (t_{O} \geq z_{α} | M_{O} and δ = {MDES}_{O}) = 1 - Φ [z_{α} - {MDES}_{O} \cdot \sqrt{{\tilde{N}}_{O}}] \geq 1 - β .

The power of a reanalysis of the data, that is, the power of t _U under $M_{U}$ , is necessarily lower than 1 − β. For example, consider Case I from the previous section, where the only fixed effects in the model are those of treatment such that X = X _H, and the data are balanced such that K_ij = K for all (i, j) pairs and J ₁ = J ₂. In this case, a direct argument gives that ${\tilde{N}}_{U}$ of Equation 13 is:

\begin{array}{l} {\tilde{N}}_{U} = \frac{{\tilde{N}}_{O}}{1 + (K - 1) ρ_{0}} \end{array},

so that for the same Type I error and MDES, Equation 14 implies the power of t _U under $M_{U}$ is:

Pr (t_{U} \geq z_{α} | M_{U} and δ = {MDES}_{O}) = 1 - Φ [z_{α} - {MDES}_{O} \cdot \sqrt{{\tilde{N}}_{U}}],

= 1 - Φ [z_{α} - {MDES}_{O} \cdot \sqrt{\frac{{\tilde{N}}_{O}}{1 + (K - 1) ρ_{0}}}],

\geq 1 - Φ [z_{α} - \frac{| z_{α} + z_{β} |}{\sqrt{(1 + (K - 1) ρ_{0})}}],

where the tightness of the bound depends on the tightness of Equation 35, that is, whether or not the study was overpowered to detect MDES_O. Note that if the original study was planned to be exactly powered to detect MDES_O, then as K grows, the power of the test approaches $1 - Φ [z_{α}] = 1 - (1 - Φ [- z_{α}]) = α$ , the Type I error.

To put this in context, we consider how low the power can be for a study modeled after the Cognitive Tutor study. If we assume the experiment was planned under $M_{O}$ with a Type I error of 5% and an $\tilde{N_{O}}$ chosen to exactly power the study at 80%, then the bound of Equation 39 is tight. If the group size was 16 for every group, and the data were generated under $M_{U}$ such that the true ICC is 0.20, then plugging in the relevant values to Equation 39 gives the power of a reanalysis of the data as 34.4% instead of 80%. This performance is quite poor: In nearly two of the three replications, a reanalysis of the data will fail to detect the treatment effect of a truly effective intervention.

The performance of a corrected test is necessarily worse than this. Recall that, in practical situations, the only corrected test that both controls its Type I error and is feasible to calculate by a meta-analyst is c _L ⋅ t _O. We show in Appendix C, available in the online version of the journal, that the large sample power of c _L ⋅ t _O with a one-sided alternative under $M_{U}$ is bounded by:

π_{L} \leq Pr (c_{L} \cdot t_{O} \geq z_{α} | M_{U} and δ = {MDES}_{O}) \leq π_{U},

where

\begin{array}{l} π_{L} = 1 - Φ [z_{α} \cdot \sqrt{\frac{1 + (K_{(1)} - 1) ρ_{0}}{1 - ρ_{0}}} - \frac{{MDES}_{O}}{\sqrt{1 - ρ_{0}}} \cdot \sqrt{{\tilde{N}}_{O}}] \\ π_{U} = 1 - Φ [z_{α} - \frac{{MDES}_{O} \sqrt{{\tilde{N}}_{O}}}{\sqrt{1 + (K_{(1)} - 1) ρ_{0}}}] . \end{array}

Note that the upper bound on the power of the corrected test is the power of reanalysis of the data, for example, in the special case of a study designed to exactly achieve a power of 1 − β such that if the group sizes were balanced, then π_U is equal to Equation 39.

To put these bounds in context, we continue the Cognitive Tutor example. Morgan and Ritter (2002) reports the size of the largest group as 26 students. Under the same assumptions as before, the formulas for π_L and π_U imply that the power of c _L ⋅ t _O under $M_{U}$ can range from an upper bound of 26.4% to a lower bound of 4.2%, which are both much less than 80%.

Figure 1 extends the Cognitive Tutor example by comparing the nominal power of t _O under $M_{O}$ with both the upper and lower bounds on the power of c _L ⋅ t _O for two ICC values and various group sizes. We chose to display an ICC of 0.20 and an ICC of 0.10 both because the WWC uses these specific values when correcting studies, and empirical research on typical ICC values (Hedberg & Hedges, 2014; Hedges & Hedberg, 2007) suggests that these are reasonable bounds on the ICC. In both panels of the figure, the nominal power of t _O is a horizontal line at 80%. The other curves, the upper and lower bounds on the power of c _L ⋅ t _O when the ICC is 0.10 and when the ICC is 0.20, are calculated, assuming that the $\tilde{N_{O}}$ of the original study was chosen to precisely achieve a power of 80% at the 5% level, that is, Equation 35 is tight with z _α = 1.64 and z _β = 0.84. Additionally, we have shaded the regions of the figure where the power is less than or equal to one third and have noted the group size for which the corrected tests cross this threshold.

Figure 1.

Comparison of upper (π _U) and lower (π _L) bounds on the power of c_L ⋅ t_O under $M_{U}$ assuming the original experiment was designed for a one-sided hypothesis test at the 5% level to be exactly powered at 80%.

Figure 1 implies that for group sizes and ICC values that are common in education research, the power of a corrected test is often less than one third and can approach zero in some cases. For example, if the ICC is 0.20, the power of c _L ⋅ t _O will be less than one third if the size of the largest group is greater than 18 students. The power can even approach zero: The lower bound of the power of c _L ⋅ t _O quickly approaches zero, implying that, in some cases, the power of c _L ⋅ t _O will also approach zero because the bound is tight. If the ICC is 0.10, the power of c _L ⋅ t _O improves, but the main message is the same: The power of a corrected test is low and can approach zero in some cases.

Implications for Education Policy

This research implies that the statistical methods the WWC uses for meta-analysis prevent it from finding effective interventions. In this section, we will show that adding any low power study, for example, either a corrected test statistic or a test statistic based on a reanalysis of the data, to an intervention report will lower the overall power of the intervention report relative to simply excluding the study. Since at least two out of five intervention reports contain corrected, that is, necessarily low power studies, the current WWC meta-analysis methods are inappropriate.

The WWC currently uses a variant of vote-counting meta-analysis to classify an intervention into one of the six categories, ranging from “positive effects” to “negative effects” (WWC, 2014, pp. 27–28, table IV.3). To simplify our discussion, we consider a vote-counting meta-analysis with only two categories: positive effects or not positive effects. Let there be m identical studies that were designed to detect a nonzero treatment effect with a power of π for a one-sided test. Each of the m studies are sorted into two bins according to the outcome of the significance test: a bin for tests that are statistically significant and a bin for tests that fail to reject the null hypothesis. The rejection rule for the vote-counting procedure is to reject the overall null hypothesis of no treatment effect if there are more of the m tests in the statistically significant bin than otherwise. Formally, if we let U be the number of studies in the statistically significant bin, then we reject the overall null hypothesis if U/m > 1/2.

Hedges and Olkin (1980) demonstrated that for such a vote-counting meta-analysis, the power of the meta-analysis can go to zero if the power of each of the m studies is less than the threshold given in the rejection rule. In this case, if π < 1/2, then the power of the meta-analysis decreases as m increases. Briefly, their argument is that U has a binomial distribution, so the central limit theorem gives the large sample distribution of U/m as:

\frac{U}{m} ∻ N (π, \frac{π (1 - π)}{m}) .

This implies that as m increases, U/m will become concentrated around π. Since the vote-counting meta-analysis rejects the null when U/m > 1/2, the power goes to zero in the case where π < 1/2.

Although the vote-counting procedure used by the WWC is more complex than the simple example given here, it shares a common property: If the power of a test statistic is less than the threshold used to reject the null hypothesis, then including this test statistic in the meta-analysis will lower the power of the meta-analysis. The WWC threshold is one third: Their recommendation of an intervention depends on whether or not the majority of studies included in the meta-analysis found a statistically significant and positive effect, statistically significant and negative effect, or did not achieve statistical significance. Since the power of both corrected and reanalyzed significance tests is commonly less than one third, including either in an intervention report will lower the power of the intervention report relative to simply excluding that study. We recommend, therefore, that the WWC change from vote-counting meta-analysis to a method that is able to synthesize the results of low power studies such as fixed effects meta-analysis.

Discussion

The aim of the evidence-based education movement is twofold: (a) to determine the best practices from scientifically rigorous studies and (b) to apply those best practices to educational decision-making (Shavelson & Towne, 2002). Throughout its history, however, the evidence-based education movement has struggled with the low quality of education research (Lagemann, 2000). For example, a common error is that an experiment will be designed to randomize entire schools to treatment and control conditions but then is analyzed ignoring the grouped nature of the randomization (Song & Herman, 2010). This error is well known to lead to invalid conclusions because it overstates the statistical significance of the treatment effect.

This error also leads to low power. If the original, incorrectly calculated test statistic is corrected, then the corrected test will control its Type I error, but as we have shown, it will necessarily have low power. The prevalence of incorrectly analyzed studies in education research, therefore, requires the WWC to use methods capable of combining the results of low power studies. Their current method cannot: adding any low power test (corrected or reanalyzed) to an intervention report will lower its overall power, that is, its ability to detect what works. The WWC, therefore, should cease attempting to correct significance tests and instead adopt a method of meta-analysis that, at a minimum, could effectively combine results from low power, for example, reanalyzed, studies.

More generally, the low power of a reanalysis makes it difficult to interpret the results of a corrected test statistic in the context of a single study. Since the power of a reanalysis is often approximately one third, at least two thirds of the time a corrected test will fail to reject the null even though the intervention has a positive effect. This implies that the majority of the time a corrected test statistic will fail to reject the null (95% if the null is true, at least 66% if the null is false) regardless of the effectiveness of the intervention. If the most likely outcome is a null result, what guidance about the effectiveness of an intervention can be given to policy makers, especially if a null result is observed?

Given these operating characteristics, it is difficult to recommend correcting significance tests for general use. Instead, we recommend simply discarding studies that would require correction until either a reanalysis using individual-level data and models that appropriately account for the design is available or new methods for the meta-analysis of incorrectly analyzed studies are developed. Given the prevalence of incorrectly analyzed studies, such methods are an important area for future research.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The views and conclusions contained in this document are solely those of the individual creator(s) and should not be interpreted as representing official policies, either expressed or implied, of the Software Engineering Institute, Carnegie Mellon University, the U.S. Air Force, the U.S. Department of Defense, or the U.S. Government.

Funding

This work was supported in part by Institute of Education Sciences training grants to Carnegie Mellon University (#R305B090023) and Northwestern University (#R305B100027).

References

Box

G. E. P.

(1954). Some theorems on quadratic forms applied in the study of analysis of variance problems I: Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 25, 290–302.

Donner

Klar

(2000). Design and analysis of cluster randomization trials in health research. London, England: Arnold.

Efron

Hinkley

D. V.

(1978). Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher Information. Biometrika, 65, 457–482.

Graybill

F. A.

(1976). Theory and application of the linear model. North Scituate, MA: Duxbury Press.

Hedberg

E. C.

Hedges

L. V.

(2014). Reference values of within-district intraclass correlations of academic achievement by district characteristics: Results from a meta-analysis of district-specific values. Evaluation Review, 38, 546–582.

Hedges

L. V.

(2007). Correcting a significance test for clustering. Journal of Educational and Behavioral Statistics, 32, 151–179.

Hedges

L. V.

(2009). Adjusting a significance test for clustering in designs with two levels of nesting. Journal of Educational and Behavioral Statistics, 34, 464–490.

Hedges

L. V.

Hedberg

E. C.

(2007). Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis, 29, 60–87.

Hedges

L. V.

Olkin

(1980). Vote-counting method in research synthesis. Psychological Bulletin, 88, 359–369.

10.

Hedges

L. V.

Rhoads

C. H.

(2011). Correcting an analysis of variance for clustering. The British Journal of Mathematical and Statistical Psychology, 64, 20–37.

11.

Johnson

N. L.

Welch

B. L.

(1940). Applications of the non-central t-distribution. Biometrika, 31, 362–389.

12.

Lagemann

E. C.

(2000). An elusive science: The troubling history of education research. Chicago, IL: University of Chicago Press.

13.

Morgan

Ritter

(2002). An experimental study of the effects of cognitive tutor algebra I on student knowledge and attitude. Pittsburgh, PA: Carnegie Learning.

14.

Rao

C. R.

(1973). Linear statistical inference and its applications (2nd ed.). New York, NY: John Wiley.

15.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage.

16.

Seber

G. A. F.

Lee

A. J.

(2003). Linear regression analysis (2nd ed.). Hoboken, NJ: John Wiley.

17.

Shavelson

R. J.

Towne

(Eds.). (2002). Scientific research in education. Washington, DC: National Academies Press.

18.

Slavin

R. E.

(2008). Perspectives on evidence-based research in education-what works? Issues in synthesizing educational program evaluations. Educational Researcher, 37, 5–14.

19.

Song

Herman

(2010). Critical issues and common pitfalls in designing and conducting impact studies in education: Lessons learned from the What Works Clearinghouse (Phase I). Educational Evaluation and Policy Analysis, 32, 351–371.

20.

Swindel

B. F.

(1968). On the bias of some least-squares estimators of variance in a general linear model. Biometrika, 55, 313–316.

21.

What Works Clearinghouse. (2014). Procedures and standards handbook (Version 3.0). Retrieved from http://ies.ed.gov/ncee/wwc/pdf/reference_resources/wwc_procedures_v3_0_standards_handbook.pdf

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.24 MB