Abstract
A number of authors have considered the problem of making multiple comparisons among level-one parameters in multilevel models. This is a setting in which Bayesian procedures have a natural sampling theory interpretation, and where a natural justification for methods that control a directional version of the false discovery rate may be found. However, a basic desirable characteristic of multiple comparison procedures, namely that they should be more conservative than corresponding ‘per-comparison’ procedures, appears to be violated by some optimal procedures that have been developed in a multilevel setting. This concern is illustrated in the context of a very simple multilevel model, namely one-way, random-effects analysis of variance.
Keywords
1 Intuitions about multiple testing
Since the earliest days of their study of the testing of multiple hypotheses, many statisticians have had the strong feeling that multiple tests need to be more conservative than individual tests, in the sense of identifying fewer null hypotheses as false. Phrased in terms of error rates, the feeling has been that controlling a per-comparison Type I error rate for a set of comparisons is not sufficient, and that it is more appropriate to control a familywise or a per-family error rate (see, for instance, Hochberg and Tamhane, 1987). More recently, interest has focused on controlling the false discovery rate (FDR; see Benjamini and Hochberg, 1995) for a set of comparisons. Procedures that control the FDR are more conservative than those that control per-comparison rates, but less conservative than ones that control per-family rates.
While most of the work on multiple comparisons has been restricted to fixed-effect models, a number of researchers have considered the problem of making multiple comparisons for multilevel (random effects) models. Duncan (1965) and Waller and Duncan (1969) adopted a Bayesian decision theory framework in their early work on this problem. Gelman and Tuerlinckx (2000), Lewis and Thayer (2004) and Shaffer (1999) compared Bayesian and sampling theory approaches to comparing multiple random effects. The latter two papers discussed the FDR in this setting. More recently, Sarkar and Zhou (2008) also considered a Bayesian version of the FDR for comparing random effects. In this paper, we wish to revisit some of this work and highlight optimality results that appear to be undesirable from a traditional perspective. To keep the discussion as simple as possible, we will restrict our attention to a one-way random-effects analysis of variance model with both variance components assumed to be known. However, the issues we raise apply equally to general multilevel models, where the goal is to make multiple comparisons among lower level parameters (i.e., random effects). All of the results given in this paper may also be found in the earlier papers that have been cited, usually for more general models than the one considered here. The main point of this paper is to review and critically evaluate these multilevel results in the context of traditional multiple comparisons thinking.
2 One-way random-effects ANOVA setup
Suppose we have m groups, with population means
and the second level model (or prior) by
We will treat
For each group j, we have two standard results. The marginal distribution of the sample mean is
The conditional (posterior) distribution of the population mean, given the sample mean, is
with
and
Now consider all pairwise comparisons between group means, taking the form
Marginal and conditional distributions for these comparisons are
and
with
for
3 Decision theory framework for a single comparison
In this section, a decision theoretic approach to multiple comparisons very similar to that proposed by Lehmann (1950, 1957a, 1957b) is adopted. It is important to note that the focus of this approach is on correctly identifying the sign of each comparison, rather than determining if the comparison is zero. This modification to traditional hypothesis testing is discussed by many authors, including Jones and Tukey (2000) and, in the context of multiple comparisons, by Shaffer (2002) and Williams et al. (1999).
First, consider a single pairwise comparison. For this pairwise comparison
Next define two components that will be used to construct loss functions:
(used to indicate wrong sign declarations) and
(used to indicate signs not determined).
Consider a loss function for declaring the sign of
Here the subscript ‘PC’ refers to ‘per-comparison’.
Bayesian decision theory identifies the optimal decision rule for this loss function,
if
It then follows that
If
Therefore, the Bayes rule declares the sign of
Since the posterior expected loss for
Consequently,
This last expectation is the (random-effects) probability of incorrectly declaring the sign of
In what follows, it will be useful to have an explicit expression for
For the usual fixed-effects per-comparison test for declaring the sign of
so
Since
for all
The conclusion of the analysis so far is that the random-effects per-comparison rule and the fixed-effects rule both control the random-effects per-comparison wrong sign declaration rate for
4 Decision theory for multiple comparisons
The definition of the per-comparison loss function may be extended to the complete set of comparisons:
with
This new loss function equals the proportion of comparisons whose signs are incorrectly declared using
To continue, it will be useful to introduce a family of optimal action vectors
Define
Define
Using this notation, it is straightforward to show that the Bayes rule for the loss function
where
with
The posterior expected loss for
Since
the posterior expected loss for the Bayesian decision function
Consequently,
This last expectation is the random-effects per-comparison wrong sign rate for
Rewriting the bound on the posterior expected loss for
where, for simplicity of notation, we define
Consequently, we may write
so
This last inequality in turn implies that the joint expectation
This quantity (evaluated for any decision rule
The above result, that
Sarkar and Zhou (2008) propose another decision rule (here labelled
It is straightforward to show that the decision rule
Note that
Not only does
5 Too much power?
To summarize, in a multilevel model like random-effects ANOVA, Bayesian ideas have sampling interpretations. In particular, we may define a Bayesian (or random-effects) version of the FDR: The expected (over both levels) proportion of declared signs for a set of comparisons that are incorrectly declared. However, a Bayesian, random-effects per-comparison decision rule turns out to provide control of this FDR, even though it was only designed to minimize the Bayes risk for a per-comparison loss function. This result does not parallel the relationship between controlling the FDR and the per-comparison error rate in a fixed effects setting and, as such, might be considered counterintuitive.
A rule designed to control this FDR may have more power than a fixed-effects per-comparison rule. Again, this result does not parallel the FDR/per-comparison relationship for fixed-effects designs, and might also be considered counterintuitive. As interesting as the problem of multiple comparisons for random effects may be, it is clear that some additional thought is required before we proceed to the routine application of Bayesian and semi-Bayesian ideas in such settings.
There may be some analogies between these results and those discussed in Perlman and Wu (1999) who give another example of undesirable optimality. In addition, some perspective on what has been found here may be provided by Sun and Cai (2007) who build on the results of Robbins (1951), highlighting the complexity of compound decision rules.
