Undesirable optimality results in multiple testing?

Abstract

A number of authors have considered the problem of making multiple comparisons among level-one parameters in multilevel models. This is a setting in which Bayesian procedures have a natural sampling theory interpretation, and where a natural justification for methods that control a directional version of the false discovery rate may be found. However, a basic desirable characteristic of multiple comparison procedures, namely that they should be more conservative than corresponding ‘per-comparison’ procedures, appears to be violated by some optimal procedures that have been developed in a multilevel setting. This concern is illustrated in the context of a very simple multilevel model, namely one-way, random-effects analysis of variance.

Keywords

Analysis of variance Bayesian decision theory false discovery rate multilevel models multiple comparisons random effects

1 Intuitions about multiple testing

Since the earliest days of their study of the testing of multiple hypotheses, many statisticians have had the strong feeling that multiple tests need to be more conservative than individual tests, in the sense of identifying fewer null hypotheses as false. Phrased in terms of error rates, the feeling has been that controlling a per-comparison Type I error rate for a set of comparisons is not sufficient, and that it is more appropriate to control a familywise or a per-family error rate (see, for instance, Hochberg and Tamhane, 1987). More recently, interest has focused on controlling the false discovery rate (FDR; see Benjamini and Hochberg, 1995) for a set of comparisons. Procedures that control the FDR are more conservative than those that control per-comparison rates, but less conservative than ones that control per-family rates.

While most of the work on multiple comparisons has been restricted to fixed-effect models, a number of researchers have considered the problem of making multiple comparisons for multilevel (random effects) models. Duncan (1965) and Waller and Duncan (1969) adopted a Bayesian decision theory framework in their early work on this problem. Gelman and Tuerlinckx (2000), Lewis and Thayer (2004) and Shaffer (1999) compared Bayesian and sampling theory approaches to comparing multiple random effects. The latter two papers discussed the FDR in this setting. More recently, Sarkar and Zhou (2008) also considered a Bayesian version of the FDR for comparing random effects. In this paper, we wish to revisit some of this work and highlight optimality results that appear to be undesirable from a traditional perspective. To keep the discussion as simple as possible, we will restrict our attention to a one-way random-effects analysis of variance model with both variance components assumed to be known. However, the issues we raise apply equally to general multilevel models, where the goal is to make multiple comparisons among lower level parameters (i.e., random effects). All of the results given in this paper may also be found in the earlier papers that have been cited, usually for more general models than the one considered here. The main point of this paper is to review and critically evaluate these multilevel results in the context of traditional multiple comparisons thinking.

2 One-way random-effects ANOVA setup

Suppose we have m groups, with population means µ′ = ( µ ₁, … , µ _m), and sample means (each based on a sample of n observations) denoted by ${\bar{y}}^{'} = ({\bar{y}}_{1}, . . ., {\bar{y}}_{m})$ . The first level model is given by

{\bar{y}}_{j} | μ_{j} ~N (μ_{j}, σ^{2} / n)

(2.1)

and the second level model (or prior) by

μ_{j} ~N (θ, τ^{2}) .

(2.2)

We will treat $θ, σ^{2} and τ^{2}$ as known, so that we may focus attention on inference for the $μ_{j}$ . This simplification also has the effect of avoiding a discussion of the differences in sampling theory vs Bayesian approaches to the estimation of these other quantities.

For each group j, we have two standard results. The marginal distribution of the sample mean is

{\bar{y}}_{j} ~N (θ, τ^{2} + σ^{2} / n)

(2.3)

The conditional (posterior) distribution of the population mean, given the sample mean, is

μ_{j} | {\bar{y}}_{j} ~N ({\hat{μ}}_{j}, v)

(2.4)

with

{\hat{μ}}_{j} = (n τ^{2} {\bar{y}}_{j} + σ^{2} θ) / (n τ^{2} + σ^{2})

(2.5)

and

v = τ^{2} σ^{2} / (n τ^{2} + σ^{2})

(2.6)

Now consider all pairwise comparisons between group means, taking the form

ψ_{i} = μ_{j} - μ_{j^{'}}

(2.7)

Marginal and conditional distributions for these comparisons are

ψ_{i} ~N (0, 2 τ^{2})

(2.8)

{\bar{y}}_{j} - {\bar{y}}_{j^{'}} | ψ_{i} ~ N (ψ_{i}, 2 σ^{2} / n)

(2.9)

{\bar{y}}_{j} - {\bar{y}}_{j^{'}} ~N (0, 2 τ^{2} + 2 σ^{2} / n)

(2.10)

and

ψ_{i} | {\bar{y}}_{j} - {\bar{y}}_{j^{'}} ~ N ({\hat{ψ}}_{i}, 2 v)

(2.11)

with

{\hat{ψ}}_{i} = {\hat{μ}}_{j} - {\hat{μ}}_{j^{'}} = n τ^{2} ({\bar{y}}_{j} - {\bar{y}}_{j^{'}}) / (n τ^{2} + σ^{2})

(2.12)

for $i = 1, . . ., m (m - 1) / 2 = m * .$

3 Decision theory framework for a single comparison

In this section, a decision theoretic approach to multiple comparisons very similar to that proposed by Lehmann (1950, 1957a, 1957b) is adopted. It is important to note that the focus of this approach is on correctly identifying the sign of each comparison, rather than determining if the comparison is zero. This modification to traditional hypothesis testing is discussed by many authors, including Jones and Tukey (2000) and, in the context of multiple comparisons, by Shaffer (2002) and Williams et al. (1999).

First, consider a single pairwise comparison. For this pairwise comparison $ψ_{i}$ , take action $a_{i}$ : $a_{i} = + 1$ (3.1) $ψ_{i}$ $a_{i} = - 1$ $ψ_{i}$ $a_{i} = 0$ $ψ_{i}$

Next define two components that will be used to construct loss functions: $L_{1} (ψ_{i}, a_{i}) = 1$ (3.2) $ψ_{i}$ $a_{i}$ disagree $L_{1} (ψ_{i}, a_{i}) = 0$

(used to indicate wrong sign declarations) and $L_{2} (ψ_{i}, a_{i}) = 1$ (3.3) $a_{i} = 0$ $L_{2} (ψ_{i}, a_{i}) = 1$

(used to indicate signs not determined).

Consider a loss function for declaring the sign of $ψ_{i}$ given by the following combination of these two components:

L_{P C i} (ψ_{i}, a_{i}) = L_{1} (ψ_{i}, a_{i}) + (α / 2) L_{2} (ψ_{i}, a_{i})

(3.4)

Here the subscript ‘PC’ refers to ‘per-comparison’. $L_{P C i}$ treats a wrong-sign declaration to be $2 / α$ times as serious as being unable to determine the sign by giving $L_{1}$ a weight of 1.0 and $L_{2}$ a weight of $α / 2$ . Traditional choices for $α$ would be 0.05 or 0.01, so $2 / α$ would be 40 or 200.

Bayesian decision theory identifies the optimal decision rule for this loss function, $δ_{P C i} ({\hat{ψ}}_{i})$ , as the rule $δ ({\hat{ψ}}_{i})$ that minimizes the posterior expected loss $E_{ψ_{i} | {\hat{ψ}}_{i}} [L_{P C i} (ψ_{i}, δ ({\hat{ψ}}_{i})) | {\hat{ψ}}_{i}]$ for each ${\hat{ψ}}_{i}$ . To find the posterior expected loss, it will be helpful to introduce some notation. If $\Pr (ψ_{i} > 0 | {\hat{ψ}}_{i}) > 0.5$ , define

a *_{i} = + 1

p_{i} = \Pr (ψ_{i} < 0 | {\hat{ψ}}_{i})

(3.5)

if $\Pr (ψ_{i} > 0 | {\hat{ψ}}_{i}) \leq 0.5$ , define

a *_{i} = - 1

p_{i} = \Pr (ψ_{i} > 0 | {\hat{ψ}}_{i})

(3.6)

It then follows that

E_{ψ_{i} | {\hat{ψ}}_{i}} [L_{P C i} (ψ_{i}, a_{i}^{*}) | {\hat{ψ}}_{i}] = E_{ψ_{i} | {\hat{ψ}}_{i}} [L_{1} (ψ_{i}, a_{i}^{*}) | {\hat{ψ}}_{i}] = p_{i} .

(3.7)

If $a_{i} = 0,$ we have

L_{P C i} (ψ_{i}, 0) = L_{1} (ψ_{i}, 0) + (α / 2) L_{2} (ψ_{i}, 0) = α / 2 .

(3.8)

Therefore, the Bayes rule declares the sign of $ψ_{i}$ , namely

δ_{P C i} ({\hat{ψ}}_{i}) = a_{i}^{*}

p_{i} < α / 2

δ_{P C i} ({\hat{ψ}}_{i}) = 0.

(3.9)

Since the posterior expected loss for $δ_{P C i}$ is always less than or equal to $α / 2$ , it follows that the Bayes risk for $δ_{P C i}$ is also less than or equal to $α / 2$ :

E_{ψ_{i}, {\hat{ψ}}_{i}} [L_{P C i} (ψ_{i}, δ_{P C i} ({\hat{ψ}}_{i}))] \leq α / 2 .

(3.10)

Consequently,

E_{ψ_{i}, {\hat{ψ}}_{i}} [L_{1} (ψ_{i}, δ_{P C} ({\hat{ψ}}_{i}))] \leq α / 2 .

(3.11)

This last expectation is the (random-effects) probability of incorrectly declaring the sign of $ψ_{i}$ using the decision rule $δ_{P C i}$ . In other words, the Bayes rule $δ_{P C i}$ controls the random-effects per-comparison wrong sign rate for $ψ_{i}$ at $α / 2$ .

In what follows, it will be useful to have an explicit expression for $p_{i}$ :

\begin{array}{l} p_{i} = \min {\Pr (ψ_{i} > 0 | {\hat{ψ}}_{i}), \Pr (ψ_{i} < 0 | {\hat{ψ}}_{i})} \\ = Φ [\frac{- | {\hat{ψ}}_{i} |}{\sqrt{2 v}}] \\ = Φ [\frac{- | {\bar{y}}_{j} - {\bar{y}}_{j^{'}} |}{\sqrt{2 σ^{2} / n}} \sqrt{\frac{n τ^{2}}{n τ^{2} + σ^{2}}}] . \end{array}

(3.12)

For the usual fixed-effects per-comparison test for declaring the sign of $ψ_{i}$ , the ‘p-value’ is given by

p_{f i} = Φ [\frac{- | {\bar{y}}_{j} - {\bar{y}}_{j^{'}} |}{\sqrt{2 σ^{2} / n}}]

(3.13)

so $p_{i} > p_{f i} .$ Consider a fixed-effects decision rule defined by

δ_{f i} ({\hat{ψ}}_{i}) = a_{i}^{*}

p_{f i} < α / 2

δ_{f i} ({\hat{ψ}}_{i}) = 0

(3.14)

Since $p_{f i}$ is based on the distribution of ${\hat{ψ}}_{i} | ψ_{i}$ given in (2.9), it follows that

E_{{\hat{ψ}}_{i} | ψ_{i}} [L_{1} (ψ_{i}, δ_{f i} ({\hat{ψ}}_{i})) | ψ_{i}] \leq p_{f i} \leq α / 2

(3.15)

for all $ψ_{i}$ , and so

E_{{\hat{ψ}}_{i}, ψ_{i}} [L_{1} (ψ_{i}, δ_{f i} ({\hat{ψ}}_{i}))] \leq α / 2

(3.16)

The conclusion of the analysis so far is that the random-effects per-comparison rule and the fixed-effects rule both control the random-effects per-comparison wrong sign declaration rate for $ψ_{i}$ at $α / 2$ , but that the random-effects (Bayes) rule is more conservative than the fixed-effects (traditional) rule.

4 Decision theory for multiple comparisons

The definition of the per-comparison loss function may be extended to the complete set of comparisons:

L_{P C} (ψ, a) = \frac{\sum_{i = 1}^{m *} L_{1} (ψ_{i}, a_{i})}{m *} + (\frac{α}{2}) \frac{\sum_{i = 1}^{m *} L_{2} (ψ_{i}, a_{i})}{m *} = \frac{\sum_{i = 1}^{m *} L_{P C i} (ψ_{i}, a_{i})}{m *}

(4.1)

with $ψ^{'} = (ψ_{1}, . . ., ψ_{m *})$ and $a^{'} = (a_{1}, . . ., a_{m *})$ .

This new loss function equals the proportion of comparisons whose signs are incorrectly declared using a, plus $α / 2$ times the proportion of comparisons whose signs are not determined using a.

To continue, it will be useful to introduce a family of optimal action vectors $a^{(k)}$ . First, order the $p_{i}$ so that

p_{(1)} \leq . . . \leq p_{(m *)}

(4.2)

Define $a^{(k)}$ for $k = 1, . . ., m *$ as

a_{(i)}^{(k)} = a_{(i)}^{*}, for i = 1, . . ., k

a_{(i)}^{(k)} = 0, for i = k + 1, . . ., m * .

(4.3)

Define $a^{(0)}$ by

a_{(i)}^{(0)} = 0, for i = 1, . . ., m * .

(4.4)

Using this notation, it is straightforward to show that the Bayes rule for the loss function $L_{P C}$ is given by

δ_{P C} (\hat{ψ}) = a^{(k_{P C})}

(4.5)

where $k_{P C}$ is the largest value of k such that

p_{(k)} < α / 2

(4.6)

with $k_{P C} = 0$ if $p_{(1)} \geq α / 2$ .

The posterior expected loss for $δ_{P C}$ is

E_{ψ | \hat{ψ}} [L_{P C} (ψ, δ_{P C} (\hat{ψ})) | \hat{ψ}] = \frac{\sum_{i = 1}^{k_{P C}} p_{(i)}}{m *} + (\frac{α}{2}) (\frac{m * - k_{P C}}{m *})

E_{ψ | \hat{ψ}} [L_{P C} (ψ, δ_{P C} (\hat{ψ})) | \hat{ψ}] = α / 2

k_{P C} = 0

(4.7)

Since

L_{P C} (ψ, a^{(0)}) = α / 2

(4.8)

the posterior expected loss for the Bayesian decision function $δ_{P C}$ must be less than or equal to $α / 2$ , and so the Bayes risk for $δ_{P C}$ must also be less than or equal to $α / 2$ :

E_{ψ, \hat{ψ}} [L_{P C} (ψ, δ_{P C} (\hat{ψ}))] \leq α / 2

(4.9)

Consequently,

E_{ψ, \hat{ψ}} [(\frac{1}{m *}) \sum_{i = 1}^{m *} L_{1} (ψ_{i}, δ_{P C i} ({\hat{ψ}}_{i}))] \leq \frac{α}{2}

(4.10)

This last expectation is the random-effects per-comparison wrong sign rate for $ψ$ using the Bayes rule $δ_{P C}$ . Note that the Bayes rule is chosen to minimize the posterior expected loss using $L_{P C}$ . However, it also has the property of controlling a wrong sign rate.

Rewriting the bound on the posterior expected loss for $δ_{P C}$ given $\hat{ψ}$ , we have

\frac{\sum_{i = 1}^{k_{P C}} p_{(i)}}{m *} + (\frac{α}{2}) (\frac{m * - k_{P C}}{m *}) \leq \frac{α}{2}

(4.11)

where, for simplicity of notation, we define $\sum_{i = 1}^{0} p_{(i)} = 0$ .

Consequently, we may write

\frac{\sum_{i = 1}^{k_{P C}} p_{(i)}}{m *} \leq (\frac{α}{2}) (\frac{k_{P C}}{m *})

(4.12)

\frac{\sum_{i = 1}^{k_{P C}} p_{(i)}}{\min {1, k_{P C}}} \leq \frac{α}{2}

(4.13)

This last inequality in turn implies that the joint expectation

E_{ψ, \hat{ψ}} [\frac{\sum_{i = 1}^{m *} L_{1} (ψ_{i}, δ_{P C i} ({\hat{ψ}}_{i}))}{\min {1, k_{P C}}}] \leq \frac{α}{2}

(4.14)

This quantity (evaluated for any decision rule $δ$ ) is referred to by Sarkar and Zhou (2008) as the Bayesian directional false discovery rate, or BDFDR, for $δ$ . It may also be thought of as the random-effects version of the fixed-effects directional FDR introduced by Williams et al. (1999).

The above result, that $δ_{P C}$ controls the BDFDR, was given by Lewis and Thayer (2004). It was not emphasized in that paper that having a per-comparison rule control a version of the FDR is counterintuitive, at least in a traditional fixed effects multiple comparisons setting, where controlling the FDR is more conservative than controlling a per-comparison error rate (but less conservative than controlling a per-family error rate). However, the fact that the random-effects per-comparison rule is more conservative than the traditional fixed-effects per-comparison rule, as noted earlier, might address this concern to a degree. The property that makes the random-effects rule conservative, namely the degree to which the ratio $n τ^{2} / (n τ^{2} + σ^{2})$ is less than unity, is consistent in spirit with a procedure that controls the fixed-effects version of the FDR, a fact noted by Shaffer (1999). Specifically, as the variance component $τ^{2}$ increases relative to n and $σ^{2}$ , the random-effects rule approaches the fixed-effects rule.

Sarkar and Zhou (2008) propose another decision rule (here labelled $δ_{S Z}$ ) that is designed to control the BDFDR and has the property among such rules of maximizing a posterior ‘per-comparison power rate’. Specifically, $δ_{S Z} (\hat{ψ}) = a^{(k_{S Z})}$ , where $k_{S Z}$ is the largest value of k such that

\frac{\sum_{i = 1}^{k} p_{(i)}}{k} \leq \frac{α}{2}

k_{S Z} = 0

p_{(1)} \geq α / 2

(4.15)

It is straightforward to show that the decision rule $δ_{S Z}$ controls the BDFDR at $α / 2$ . Sarkar and Zhou (2008) also proved that, among (non-randomized) rules that control the BDFDR, $δ_{S Z}$ maximizes the posterior per-comparison power rate, defined using the notation of this paper as

E_{ψ | \hat{ψ}} [\sum_{i = 1}^{m *} [1 - L_{1} (ψ_{i}, δ_{i} ({\hat{ψ}}_{i})) - L_{2} (ψ_{i}, δ_{i} ({\hat{ψ}}_{i}))] / m * | \hat{ψ}]

(4.16)

Note that $δ_{S Z}$ is not a Bayes decision rule and does not minimize the posterior expectation of any specified loss function. Instead, it is defined in the tradition of sampling theory hypothesis testing, maximizing power while controlling a (directional) error rate.

Not only does $δ_{S Z}$ have more power than the (Bayesian) random-effects per-comparison rule $δ_{P C}$ , it may also have more power than the (sampling theory) fixed-effects per-comparison rule $δ_{f}$ . In other words, $δ_{S Z}$ will sometimes declare a sign for $ψ_{i}$ even when $p_{f i} > α / 2$ .

5 Too much power?

To summarize, in a multilevel model like random-effects ANOVA, Bayesian ideas have sampling interpretations. In particular, we may define a Bayesian (or random-effects) version of the FDR: The expected (over both levels) proportion of declared signs for a set of comparisons that are incorrectly declared. However, a Bayesian, random-effects per-comparison decision rule turns out to provide control of this FDR, even though it was only designed to minimize the Bayes risk for a per-comparison loss function. This result does not parallel the relationship between controlling the FDR and the per-comparison error rate in a fixed effects setting and, as such, might be considered counterintuitive.

A rule designed to control this FDR may have more power than a fixed-effects per-comparison rule. Again, this result does not parallel the FDR/per-comparison relationship for fixed-effects designs, and might also be considered counterintuitive. As interesting as the problem of multiple comparisons for random effects may be, it is clear that some additional thought is required before we proceed to the routine application of Bayesian and semi-Bayesian ideas in such settings.

There may be some analogies between these results and those discussed in Perlman and Wu (1999) who give another example of undesirable optimality. In addition, some perspective on what has been found here may be provided by Sun and Cai (2007) who build on the results of Robbins (1951), highlighting the complexity of compound decision rules.

References

Benjamini

Hochberg

(1995) Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289–300.

Duncan

(1965) A Bayesian approach to multiple comparisons. Technometrics, 7, 171–222.

Gelman

Tuerlinckx

(2000) Type S error rates for classical and Bayesian single and multiple comparison procedures. Computational Statistics, 15, 373–90.

Hochberg

Tamhane

(1987) Multiple comparison procedures. New York: John Wiley.

Jones

Tukey

(2000) A sensible formulation of the significance test. Psychological Methods, 5, 411–14.

Lehmann

(1950) Some principles of the theory of testing hypotheses. The Annals of Mathematical Statistics, 21, 1–26.

Lehmann

(1957a) A theory of some multiple decision problems. I. The Annals of Mathematical Statistics, 28, 1–25.

Lehmann

(1957b) A theory of some multiple decision problems. II. The Annals of Mathematical Statistics, 28, 547–72.

Lewis

Thayer

(2004) A loss function related to the FDR for random effects multiple comparisons. Journal of Statistical Planning and Inference, 125, 49–58.

10.

Perlman

(1999) The emperor’s new tests. Statistical Science, 14, 355–69.

11.

Robbins

(1951) Asymptotically subminimax solutions of compound statistical decision problems. Proceedings of Second Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press, pp. 131–48.

12.

Sakar

Zhou

(2008) Controlling directional Bayesian false discovery rate in random effects model. Journal of Statistical Planning and Inference, 138, 682–93.

13.

Shaffer

(1999) A semi-Bayesian study of Duncan’s Bayesian multiple comparison procedure. Journal of Statistical Planning and Inference, 82, 197–213.

14.

Shaffer

(2002) Multiplicity, directional (Type III) errors and the null hypothesis. Psychological Methods, 7, 356–69.

15.

Sun

Cai

(2007) Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102, 901–12.

16.

Waller

Duncan

(1969) A Bayes rule for symmetric multiple comparisons problems. Journal of the American Statistical Association, 64, 1484–1503.

17.

Williams

VSL

Jones

Tukey

(1999) Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. Journal of Educational and Behavioral Statistics, 24, 42–69.