Approximations of the power functions for Wald,likelihood ratio,and score tests and their applications to linear and logistic regressions

Abstract

Traditionally, asymptotic tests are studied and applied under local alternative. There exists a widespread opinion that the Wald, likelihood ratio, and score tests are asymptotically equivalent. We dispel this myth by showing that These tests have different statistical power in the presence of nuisance parameters. The local properties of the tests are described in terms of the first and second derivative evaluated at the null hypothesis. The comparison of the tests are illustrated with two popular regression models: linear regression with random predictor and logistic regression with binary covariate. We study the aberrant behavior of the tests when the distance between the null and alternative does not vanish with the sample size. We demonstrate that these tests have different asymptotic power. In particular, the score test is generally asymptotically biased but slightly superior for linear regression in a close neighborhood of the null. The power approximations are confirmed through simulations.

Keywords

Effective sample size GLM linear regression logistic regression local alternative sample size determination

1. Introduction

Uniformly most powerful tests exist only in rare statistical models – usually they exist for linear model with fixed/nonrandom predictors and normal distribution (Aivazian et al. 1985; Lehmann & Romano, 2005). On the other hand, asymptotic tests, such as the Wald, likelihood ratio, and score tests can be applied to a much wider variety of statistical distributions and models when the sample size, $n$ , increases to infinity. For a short review of these test we refer the reader to a recent book by the author (Demidenko, 2020). There exists a widespread opinion that the three tests are asymptotically optimal and equivalent for large $n$ , as stated by Rayner (1997), Young and Smith (2005), among many others. Because of this opinion, not much preference is given to what test to use to determine the desired sample size.

Usually, the treatment of asymptotic tests is reduced to the analysis with local alternatives, or in the terminology of Cox and Hinkley (1974), contiguous alternatives, based on the concept of contiguity (Lehmann & Romano, 2005; Casella & Berger, 1990; Shao, 2010). It is a textbook result that the Wald and likelihood ratio tests are equivalent for alternatives at the distance of $O(n^{-1/2})$ from the null value. The following warning on page 156 from a book by Robert Serfling (1980) is the impetus to the present work: “Therefore, under appropriate regularity conditions, the statistics $\lambda_{n}$ , $W_{n}$ and $V_{n}$ are asymptotically equivalent in distribution, both under the null hypothesis and under local alternatives converging sufficiently fast. However, at fixed alternatives these equivalences are not anticipated to hold.” (Here, $\lambda_{n}$ , $W_{n}$ and $V_{n}$ are likelihood ratio, Wald and score statistics, respectively.) Moreover, we argue that the study with contiguous alternatives is not appealing from the practical point of view. For example, it’s not applicable to the sample size determination where the alternative is fixed.

The goal of the present work is to analyze the two-sided tests in the presence of nuisance parameters using the power function with the emphasis on the global (or fixed) alternatives. Respectively, the local properties are expressed in terms of the first and second derivative evaluated at the null value of the parameter. The tests are then illustrated with linear regression and normally distributed predictor when the variance is unknown and logistic regression with Bernoulli covariate for which the asymptotic power function is derived in closed form.

It is natural to expect that the rate of rejecting the null hypothesis increases as the alternative gets farther from the null. However, Hauck and Donner (1977) and Væth (1985) showed that the Wald test may be aberrant in this respect – we investigate the aberrant behavior of the likelihood ratio and score tests as well.

Much of the effort, with some controversy, has been made on the comparison of the tests using the method of contiguous alternatives. In particular, several claims have been made that the Rao’s score test is superior to Wald and likelihood ratio test (Chandra & Joshi, 1983; Chandra & Mukerjee, 1984). We confirm this claim for a linear model with random predictors in a close proximity to the null. Otherwise, these tests are different and it is impossible to claim an overall champion. Moreover, the score test is usually biased.

An important distinction between approaches taken by previous authors and the present work is the assumption on regressors/predictors: many studies assume fixed regressors but here the regressors are assumed random. For example, Shieh (2005), Kulinskaya et al. (2008), Lemonte and Ferrari (2012) work under assumption that regressors $x_{1},\ldots,x_{p}$ are fixed/nonrandom, but in the present work they are observed along with the dependent variable. There are fundamental and practical implications of the difference in these assumptions. Fundamentally, asymptotic properties of maximum likelihood are not applicable to regression models with fixed regressors because observations are not identically distributed. For example, if regressors in the linear model are fixed the OLS estimator is normally distributed but if regressors are random the distribution of the OLS estimator is not normal – see detail in Demidenko (2020, pp. 665–669). Practically, random regressors are more convenient to deal with when the power function is used for the sample size determination because their distribution is easy to specify. Many authors fail to recognize the difference between the two schema. Although for large sample size and close alternative the difference may be not visible, the power functions and respective conclusion on the comparison of the tests may be not the same for specific members of the GLM family. More relevant discussion is found in the last section.

The organization of the paper is as follows. After introducing notation and definitions, we derive the power function for the Wald, likelihood ratio, and score tests in each of the following sections and illustrate it with linear and logistic regression models. The three tests are compared in terms of power.

2. Notation and definitions

Throughout the paper, we deal with iid observations $\{\bm{z}_{i},i=1,\ldots,n\}$ having common density $f=f(\bm{z};\bm{\theta})$ dependent on a vector parameter $\bm{\theta}=(\beta,\bm{\gamma})$ ; formally, $\theta_{1}=\beta$ . The first component, $\beta$ , is treated as the parameter of interest and $\bm{\gamma}$ is treated as a $p$ -dimensional vector of nuisance parameters (Pawitan, 2001). The null hypothesis is composite, $H_{0}:\beta=\beta_{0}$ with the two-sided alternative, $H_{A}:\beta\neq\beta_{0}$ . For expository purposes, we shall assume that $\beta_{0}=0$ , so that the null hypothesis takes the form

$\displaystyle H_{0}:\beta=0.$ (1)

If $T=T(\bm{z}_{1},\ldots,\bm{z}_{n})$ is a test statistic such that the null is rejected when $T>c$ , the power function for Eq. (1) can be expressed as a function of the alternative,

$\displaystyle P(\beta;\bm{\gamma})=\Pr(T>c|\bm{\theta}).$ (2)

Here, $P(0;\bm{\gamma})=\alpha$ is the the significance level or the size of the test (typically, $\alpha=0.05$ ) and $|\bm{\theta}$ means that the probability Eq. (2) is computed under the assumption that the true parameter is $\bm{\theta}$ . We vary $\beta$ but fix $\bm{\gamma}$ , so we treat Eq. (2) as a function of $\beta$ . In this paper, the power function Eq. (2) is computed, or more precisely, approximated, for large $n$ . This, fortunately, implies that $P(0)$ does not depend on nuisance parameters. We assume that the necessary regularity conditions for asymptotic maximum likelihood estimation and hypothesis testing are fulfilled, such as the support of $f$ does not depend on the parameter, differentiation under the expectation is valid and the power function is twice differentiable (Casella & Berger, 1990; Schervish, 1995; Bickel & Doksum, 2001).

The log-likelihood function is

$\displaystyle l(\beta,\bm{\gamma})=\sum_{i=1}^{n}\ln f(\bm{z}_{i};\beta,\bm{% \gamma}).$ (3)

The maximum likelihood estimator (MLE), $(\hat{\beta}_{\textit{ML}},\hat{\bm{\gamma}}_{\textit{ML}})$ , is the solution of $1+p$ score equations

$\displaystyle\frac{\partial l(\beta,\bm{\gamma})}{\partial\beta}=0,\frac{% \partial l(\beta,\bm{\gamma})}{\partial\bm{\gamma}}=\bm{0.}$ (4)

The Fisher expected information matrix is defined as

$\displaystyle\left[\begin{array}[]{cc}I_{11}&\bm{I}_{12}^{\prime}\\ \bm{I}_{12}&\bm{I}_{22}\end{array}\right]=-E\left[\begin{array}[]{cc}{% \displaystyle\frac{\partial^{2}\ln f}{\partial\beta^{2}}}&{\displaystyle\frac{% \partial^{2}\ln f}{\partial\beta\partial\bm{\gamma}}}\\ {\displaystyle\frac{\partial^{2}\ln f}{\partial\bm{\gamma}\partial\beta}}&{% \displaystyle\frac{\partial^{2}\ln f}{\partial\bm{\gamma}^{2}}}\end{array}% \right].$ (5)

(As a part of regularity conditions we assume that the information matrix is nonsingular.) Asymptotically, the variance of $\hat{\beta}_{\textit{ML}}$ can be derived from the $(1,1)$ th element of the inverse matrix,

$\displaystyle\text{var}(\hat{\beta}_{\textit{ML}})=\frac{1}{n}V(\beta),$ (6)

where

$\displaystyle V(\beta)=\frac{1}{I_{11}-\bm{I}_{12}^{\prime}\bm{I}_{22}^{-1}\bm% {I}_{12}}$ (7)

is the variance function evaluated at the alternative, $\beta$ . In this formula, $V$ is evaluated at the true value of the nuisance parameter, $\bm{\gamma}$ . When studying the likelihood ratio and score tests, we shall evaluate $V$ at other values which will be indicated as $V(\beta|\bm{\gamma}_{0})$ . Sometimes, it is more convenient (especially for the Wald test) to deal with the standard deviation (SD) function, $\sigma_{\beta}=\sqrt{V(\beta)}$ .

Now we describe the local properties of the tests in terms of the first and second derivatives of the power function evaluated at the null, $\beta=0;$ see Rao (1973, p. 454). First, since we require the tests to have size $\alpha$ , we assume

$\displaystyle P(\beta)|_{\beta=0}=\alpha.$ (8)

Second, we want the test to be asymptotically (locally) unbiased, which means that the first derivative at the null is zero,

$\displaystyle P^{\prime}(0)=\frac{dP}{d\beta}\bigg{|}_{\beta=0}=0.$ (9)

Third, the local power for an unbiased test is determined by the second derivative,

$\displaystyle P^{\prime\prime}(0)=\frac{d^{2}P}{d\beta^{2}}\bigg{|}_{\beta=0}.$ (10)

The higher the value of the second derivative, the more powerful the test is for local alternatives. In fact, we can approximate the power as $P(\beta)=\alpha+\frac{1}{2}P^{\prime\prime}(0)\beta^{2}$ in the neighborhood of the null. Finally, the test is consistent if $\lim_{n\rightarrow\infty}P(\beta)=1$ for all $\beta\neq 0$ .

Now we derive the power functions for the three tests and illustrate them with popular statistical models: linear and logistic regressions.

3. Wald test

The Wald test is based on the fact that under the null and large $n$ , the ratio of the MLE to its standard error, sometimes called the $Z$ -score, is normally distributed,

$\displaystyle\frac{\hat{\beta}_{\textit{ML}}}{\textit{SE}(\hat{\beta}_{\textit% {ML}})}\sim\mathcal{N}(0,1)$

for large $n$ . The test statistic is the absolute value of the $Z$ -score with $c=Z_{1-\alpha/2}$ , the $(1-\alpha/2)$ quantile of the standard normal distribution, namely, $Z_{1-\alpha/2}=\Phi(1-\alpha/2)$ . The power function of the Wald test in large sample can be approximated as follows

$\displaystyle P_{W}(\beta)\simeq\Phi\left(-Z_{1-\alpha/2}+\frac{\beta}{\sigma_% {\beta}}\sqrt{n}\right)+\Phi\left(-Z_{1-\alpha/2}-\frac{\beta}{\sigma_{\beta}}% \sqrt{n}\right),$ (11)

where sign $\simeq$ means the approximation for large $n$ . The validity of the approximations follows from Slutsky’s theorem and uniform convergence follows from continuity of derivatives (see details in Bickel & Doksum, 2001; Demidenko, 2004, p. 644). Apparently, the larger the $n$ the better approximation.

Property Eq. (8) holds for the Wald power. It is easy to see that Wald power is symmetric about the null, $\beta=0$ . Consequently, the Wald power is asymptotically unbiased, which can be verified directly by evaluating the derivative at zero,

$\displaystyle\frac{dP_{W}}{d\beta}\bigg{|}_{\beta=0}=\left[\phi(-Z_{1-\alpha/2% })\frac{\sqrt{n}}{\sigma_{0}}-\phi(-Z_{1-\alpha/2})\frac{\sqrt{n}}{\sigma_{0}}% \right]\times\frac{dV}{d\beta}\bigg{|}_{\beta=0}=0,$

where $\phi=\Phi^{\prime}$ , the standard normal density. The test is consistent (or better to say asymptotically consistent) because $\lim_{n\rightarrow\infty}P_{W}(\beta)=1$ if $\beta\neq 0$ .

The second derivative of the power function Eq. (11) at the null takes the form

$\displaystyle\frac{d^{2}P}{d\beta^{2}}\bigg{|}_{\beta=0}=\frac{2n}{V(0)}\phi(Z% _{1-\alpha/2})Z_{1-\alpha/2},$ (12)

so approximately in the neighborhood of zero $P_{W}(\beta)\simeq\alpha+n\phi(Z_{1-\alpha/2})Z_{1-\alpha/2}\beta^{2}/V(0)$ .

As follows from Eq. (11), the Wald power simplifies under $z$ -parametrization,

$\displaystyle P_{W}(z)=\Phi(-Z_{1-\alpha/2}+z\sqrt{n})+\Phi(-Z_{1-\alpha/2}-z% \sqrt{n}),$ (13)

where

$\displaystyle z=\frac{\beta}{\sigma_{\beta}}$ (14)

can be treated as a theoretical counterpart of the $Z$ -score statistic. We refer to Eq. (14) as the $z$ -ratio at the alternative, $\beta$ . In applications, the $z$ -ratio is called the effect size (Cohen, 1994). It is easy to show that the Wald power is an increasing function for $\beta>0$ and a decreasing function of $z$ for $\beta<0$ . In other words, the larger the absolute value of the $z$ -function, the greater the power. The $z$ -parametrization reappears in the power functions of the likelihood ratio and score tests for a linear model.

3.1 Linear regression

We illustrate the calculation of the Wald power with linear regression under normal distribution. Note that typically, the power analysis is conducted for the case when predictor $x$ is fixed/nonrandom. We however, develop the power analysis for a more complicated model when $x$ is random, or more precisely, normally distributed. Thus, vector $\bm{z}$ has a multivariate normal distribution; the first component is the dependent variable, $y$ ; the second component is the variable of interest, $x$ ; the rest are covariates combined in vector $\bm{u}$ . We test the significance of the coefficient, $\beta$ , at the variable of interest, $x$ . Without loss of generality, we can assume that the means of all regression variables are zero and the marginal variances and covariances are known:

$\displaystyle\left[\begin{array}[]{c}x\\ \bm{u}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}0\\ \bm{0}\end{array}\right],\left[\begin{array}[]{cc}\sigma_{x}^{2}&\bm{\sigma}_{% xu}^{\prime}\\ \bm{\sigma}_{xu}&\bm{\Sigma}_{\rm u}\end{array}\right]\right).$

The linear regression is specified through conditional mean as

$\displaystyle E(y|x,\bm{u})=\beta x+\bm{\tau}^{\prime}\bm{u},$ (15)

where $\text{var}(y|x,\bm{u})=\sigma^{2}$ . The conditional (error) variance, $\sigma^{2}$ , is unknown and subject to estimation along with the vector of regression coefficients. Linear regression is used as an example of hypothesis testing in many texts, but $\sigma^{2}$ is assumed known (Cox & Hinkley, 1974). A more realistic set up with unknown variance brings up several surprises, as we shall see later.

Since the distribution is normal and marginal variance-covariance parameters are known, the log density of $\bm{z}$ , up to a constant, is given by

$\displaystyle\ln f(\bm{z};\beta,\bm{\gamma})=-\frac{1}{2\sigma^{2}}(y-\beta x-% \bm{\tau}^{\prime}\bm{u})^{2}-\frac{1}{2}\ln\sigma^{2},$ (16)

where $\bm{\gamma}=(\bm{\tau,}\sigma^{2})$ . The information matrix takes the form

$\displaystyle\frac{1}{\sigma^{2}}\left[\begin{array}[]{ccc}\sigma_{x}^{2}&\bm{% \sigma}_{xu}^{\prime}&0\\ \bm{\sigma}_{xu}&\Sigma_{\rm u}&0\\ 0&0&\frac{1}{2}\sigma^{-2}\end{array}\right],$

so that the Wald power is given by Eq. (11) with

$\displaystyle\sigma_{\beta}=\frac{\sigma}{\sqrt{\sigma_{x}^{2}-\bm{\sigma}_{xu% }^{\prime}\Sigma_{\rm u}^{-1}\bm{\sigma}_{xu}}}.$ (17)

The Wald power function has a familiar V-shape, and it monotonically increases with $|\beta|$ and approaches 1 – no surprises so far.

3.2 Logistic regression with binary covariate

In general regression problems, the distribution of $\bm{z}$ is defined through conditional distribution as $f(\bm{z};\bm{\theta})=f(y|\bm{x};\bm{\theta})f(\bm{x};\bm{\theta})$ , where $y$ is treated as the dependent variable and $\bm{x}$ as the vector of covariates. As in the previous example, statistical hypothesis is concerned with testing of the slope coefficient, $\beta$ .

Here, we illustrate the Wald test with logistic regression and binary covariate. This means that both $y$ and $x$ are binary. The probability of the dependent variable is defined through conditional probability as

$\displaystyle\Pr(y=1|x)=\frac{e^{\gamma+\beta x}}{1+e^{\gamma+\beta x}}.$

The marginal probability of $x$ is specified as $\Pr(x=1)=p_{x}$ , which can be assumed known without loss of generality. Logistic regression is a member of the family of generalized linear models (GLM, Bickel & Doksum, 2001). After elementary calculations, the information matrix takes the form

$\displaystyle\left[\begin{array}[]{cc}{\displaystyle\frac{GBp_{x}}{(1+GB)^{2}}% }+{\displaystyle\frac{G(1-p_{x})}{(1+G)^{2}}}&{\displaystyle\frac{GBp_{x}}{(1+% GB)^{2}}}\\ {\displaystyle\frac{GBp_{x}}{(1+GB)^{2}}}&{\displaystyle\frac{GBp_{x}}{(1+GB)^% {2}}}\end{array}\right],$

where to shorten the notation, we let $G=e^{\gamma}$ and $B=e^{\beta}$ . Applying formula Eq. (7), we derive the variance function:

$\displaystyle V(\beta)=\frac{p_{x}(1+G)^{2}B+(1-p_{x})(1+GB)^{2}}{p_{x}(1-p_{x% })GB}.$ (18)

For a small alternative, the Wald power can be approximated quadratically as

$\displaystyle P_{W}(\beta)\simeq\alpha+\frac{p_{x}(1-p_{x})G}{(1+G)^{2}}n\phi(% Z_{1-\alpha/2})Z_{1-\alpha/2}\beta^{2}.$ (19)

As follows from this formula, maximum power for small alternatives occurs for a symmetrically distributed covariate, $p_{x}=1/2$ . Using elementary calculus, one can show that symmetric distribution of the dependent variable leads to maximum power, $G=1$ and $P(y=1)=1/2$ . Combining these facts, we obtain an upper bound for the Wald power, $P_{W}(\beta)\leqslant\alpha+(n/16)\phi(Z_{1-\alpha/2})Z_{1-\alpha/2}\beta^{2}$ . With a popular choice, $\alpha=0.05$ , we have $P_{W}(\beta)\leqslant 0.05+0.0036n\beta^{2}$ . For example, to detect $\beta=1$ with power $0.8$ , one needs at least $n=208$ observations.

The trouble with Wald power Eq. (11) begins when it is computed for large alternatives, because as follows from Eq. (18), the $z$ -ratio Eq. (14) vanishes at infinity, $\lim_{\beta\rightarrow\pm\infty}z=0$ . The fact that the Wald power for logistic regression may fall for large alternative was noticed by Hauck and Donner (1977); they termed it aberrant behavior. This negative property of the Wald test was later studied in a broader context of a family of exponential distributions by Væth (1985), and for a linear model by Fears et al. (1996). For large alternatives the Wald power decreases to the size of the test – see Fig. 1 for a geometrical illustration. The power function is computed for $\alpha=$ 5% and $\Pr(y=1|\beta=0)=G/(1+G)=0.25$ , and $p_{x}=\Pr(x=1)=0.2$ with $n=100$ . The quadratic approximation Eq. (19) is valid in a close neighborhood of the origin. The Wald power falls at $\beta=-2.4$ and beyond this point decreases back to $\alpha$ when $\beta\rightarrow-\infty$ . We show four simulation results for alternatives $\beta=-3,-2$ , $1/2$ , and $1$ (the number of experiments equals 5,000). For large negative values of $\beta,$ the power approximation Eq. (11) is not valid because $n=100$ is too small. To account for change in the variance function at the alternative versus the variance at the null, we introduce the effective sample size,

$\displaystyle n_{\rm e}=\frac{V(0)}{V(\beta)}n.$ (20)

The effective sample size can be interpreted as the sample size modification for detection of a large alternative versus a local alternative due to the change of variance. For example, the effective sample size to detect alternative $\beta=-2$ with $n=100$ , as follows from Eq. (18), is

$\displaystyle n_{\rm e}=\frac{B(1+G)^{2}}{p_{x}(1+G)^{2}B+(1-p_{x})(1+GB)^{2}}% n=32.$

The effective sample size, one third the size of the original, explains why simulations at $\beta=-2$ in Fig. 1 do not quite match the power: $n_{\rm e}=32$ is simply not large enough for the Central Limit Theorem to give a good approximation.

Figure 1.

The Wald power and its quadratic approximation for logistic regression with a binary covariate, $\alpha=0.05$ , $\Pr(y=1|\beta=0)=0.25$ , $\Pr(x=1)=0.2$ , and $n=100$ . The Wald power falls at $\beta=-2.4$ . Simulations ( $\blacktriangle$ ) are good match of the power approximation especially for positive alternatives ( $N_{\exp}=$ 5,000).

As follows from $z$ -parametrization Eq. (13), the value of the alternative where the monotonicity of the Wald power falls satisfies the equation

$\displaystyle\beta\frac{d\ln\sigma_{\beta}}{d\beta}=1,$ (21)

which will be referred to as the Wald break-down equation. In Fig. 2, we show the region within which the Wald power behaves well, meaning that it increases for $\beta>0$ and decreases for $\beta<0$ . The boundary of the shaded region is where its derivative turns zero as defined by Eq. (21). Roughly, one could say that the Wald power for logistic regression is well-behaved for alternatives in the interval $(-2,2)$ .

Figure 2.

The power function is increasing for positive alternative and decreasing for negative alternative for $(\beta,p_{x})$ within the gray region. Approximately, the Wald power is an increasing function of the distance from the null if $|\beta|\leqslant 2$ .

4. Likelihood ratio test

The test statistic of the likelihood ratio (LR) test is

$\displaystyle T=2\left[\max_{\beta,\bm{\gamma}}l(\beta,\bm{\gamma})-\max_{\bm{% \gamma}}l(0,\bm{\gamma})\right],$

which under the null has $\chi^{2}$ -distribution with one degree of freedom, $\chi^{2}(1)$ . Hence, according to the likelihood ratio test, we reject the null if $T>q_{1-\alpha}$ , where $q_{1-\alpha}$ is the $(1-\alpha)$ th quantile. Under the alternative, the maximizer of $l(0,\bm{\gamma})$ , which is referred to as a profile nuisance parameter, $\hat{\bm{\gamma}}_{0}$ , yields a biased estimator of $\bm{\gamma}$ . To derive the power function, we represent the rejection probability in the following way (Self et al., 1992),

$\displaystyle\Pr(T>q_{1-\alpha})=\Pr\left\{2\left[(\max_{\beta,\bm{\gamma}}l(% \beta,\bm{\gamma})-l(\beta,\bm{\gamma}))+(l(\beta,\bm{\gamma})-\max_{\bm{% \gamma}}l(0,\bm{\gamma}))\right]>q_{1-\alpha}\right\}.$

Using the biased estimation equation theory, one can prove that the maximizer of $l(0,\bm{\gamma})$ converges to $\bm{\gamma}_{0}=\bm{\gamma}_{0}(\beta,\bm{\gamma})$ , which is the solution to the equation

$\displaystyle E_{(\beta,\bm{\gamma})}\left(\frac{\partial\ln f(\bm{z};0,\bm{% \gamma}_{0})}{\partial\bm{\gamma}_{0}}\right)=\bm{0},$ (22)

which will be referred to the limit profile nuisance parameter. Since

$\displaystyle\lim_{n\rightarrow\infty}\frac{1}{n}(l(\beta,\bm{\gamma})-\max_{% \bm{\gamma}}l(0,\bm{\gamma})=E_{(\beta,\bm{\gamma})}(\ln f(\bm{z};\beta,\bm{% \gamma})-\ln f(\bm{z};0,\bm{\gamma}_{0}))$

with probability 1, the distribution of $T$ under the alternative is $\chi^{2}(1)$ with the noncentrality parameter $2n\eta(\beta)$ , where

$\displaystyle\eta(\beta)=E_{(\beta,\bm{\gamma})}(\ln f(\bm{z};\beta,\bm{\gamma% })-\ln f(\bm{z};0,\bm{\gamma}_{0}))$ (23)

will be referred to as the $\eta$ -function. Note that the $\eta$ -function is nonnegative, specifically, $\eta(\beta)>0$ for all $\bm{\gamma}\neq\bm{\gamma}_{0}$ . Notice that for a local alternative ( $\beta\rightarrow 0$ ), we have $\bm{\gamma}_{0}\rightarrow\bm{\gamma}$ and $\eta(\beta)\rightarrow 0$ . The noncentrality parameter depends on $\beta$ and $\bm{\gamma}$ , but we fix the latter and consider $\eta$ only as a function of $\beta$ .

Finally, the power function of the LR test can be expressed via $\Phi$ as

$\displaystyle P_{\rm LR}(\beta)=\Phi(-Z_{1-\alpha/2}+\sqrt{2n\eta(\beta)})+% \Phi(-Z_{1-\alpha/2}-\sqrt{2n\eta(\beta)}).$ (24)

An advantage of expressing the power in terms of $\Phi$ , is that one can apply it for a signed LR test, which is useful for one-sided hypotheses (Severeni, 2000). Note that the power is completely specified by function $\eta$ .

The local equivalence of the Wald and LR tests has been established using the method of local alternatives, $\beta=O(0)$ , Serfling (1980). We prove this by showing that the second derivatives of the power functions evaluated at $\beta=0$ are the same.

Theorem 2. The Wald and LR tests are asymptotically unbiased (the first derivative of the power function vanishes at the null) and locally equivalent,

$\displaystyle\frac{d^{2}P_{W}}{d\beta^{2}}\bigg{|}_{\beta=0}=\frac{d^{2}P_{% \textit{LR}}}{d\beta^{2}}\bigg{|}_{\beta=0}.$

The proof is found in the Appendix.

Note that if the $\eta$ -function Eq. (23) is an increasing function of $\beta$ then the likelihood ratio power increases with distance from the null. Now we derive the power function for the two regressions, as we did for the Wald power.

4.1 Linear regression

We derive the likelihood ratio power for linear regression from Section 3.1. The regression is specified by Eq. (15) with the log density Eq. (16). First, we find the limit profile nuisance parameter, $\bm{\gamma}_{0}=\bm{\gamma}_{0}(\beta,\bm{\gamma})$ as the solution to Eq. (22), which for linear regression is equivalent to the pair of equations, $E[(y-\bm{\tau}_{\ast}^{\prime}\bm{u})\bm{u}]=\bm{0}$ and $\sigma_{\ast}^{2}=E(y-\bm{\tau}_{\ast}^{\prime}\bm{u})^{2}$ . In this section, we omit the subscript $(\beta,\bm{\gamma})$ on the expectation for brevity. Using some algebra, we obtain

$\displaystyle E[(y-\bm{\tau}_{\ast}^{\prime}\bm{u})\bm{u}]=E[(y-\beta x-\bm{% \tau}^{\prime}\bm{u}+\beta x-(\bm{\tau}_{\ast}-\bm{\tau)}^{\prime}\bm{u})\bm{u% }]=E[(\varepsilon+\beta x-(\bm{\tau}_{\ast}-\bm{\tau})^{\prime}\bm{u})\bm{u}]=% \beta\bm{\sigma}_{xu}-\bm{\Sigma}_{\rm u}(\bm{\tau}_{\ast}-\bm{\tau}),$

which yields the solution

$\displaystyle\bm{\tau}_{\ast}=\bm{\tau+}\beta\bm{\Sigma}_{\rm u}^{-1}\bm{% \sigma}_{xu},$ (25) $\displaystyle\sigma_{\ast}^{2}=\sigma^{2}+\beta^{2}(\sigma_{x}^{2}-\bm{\sigma}% _{xu}^{\prime}\bm{\Sigma}_{\rm u}^{-1}\bm{\sigma}_{xu}).$ (26)

Now we find the noncentrality parameter. Specifically, we want to express function Eq. (23) through unknown parameters $\beta$ , $\bm{\tau,}$ and $\sigma^{2}$ . From Eq. (16) we obtain

$\displaystyle\eta(\beta)=-\frac{1}{2}+\frac{1}{2\sigma_{\ast}^{2}}E(y-\bm{\tau% }_{\ast}^{\prime}\bm{u})^{2}+\frac{1}{2}\ln\frac{\sigma_{\ast}^{2}}{\sigma^{2}% }=\frac{1}{2}\ln\left(1+\frac{\beta^{2}}{\sigma_{\beta}^{2}}\right),$

where $\sigma_{\beta}$ is given by Eq. (17). Finally, the power function of the LR test on the $z$ -scale Eq. (14) can be approximated as

$\displaystyle P_{\rm LR}(z)=\Phi(-Z_{1-\alpha/2}+\sqrt{n}\sqrt{\ln(1+z^{2})})+% \Phi(-Z_{1-\alpha/2}-\sqrt{n}\sqrt{\ln(1+z^{2})}).$ (27)

Clearly, the Wald and LR power are different for linear regression contrary to the case when $\sigma^{2}$ is known. Further test comparisons are found in Section 6.

4.2 Logistic regression with binary covariate

Now we apply the LR test to the logistic regression with a binary covariate from Section 3.2. For this statistical model, we have

$\displaystyle\ln f(y,x;\beta,\gamma)=y\gamma+\beta(xy)-\ln(1+e^{\gamma+\beta x% }).$ (28)

To calculate the noncentrality parameter, we need the expectation:

$\displaystyle E[\ln f(y,x;\beta,\gamma)]=\gamma\Pr(y=1)+\beta\Pr(y=1,x=1)-p_{x% }\ln(1+GB)-(1-p_{x})\ln(1+G)=\gamma\left[\frac{GB}{1+GB}p_{x}+\frac{G}{1+G}(1-% p_{x})\right]+\beta\frac{GB}{1+GB}p_{x}-p_{x}\ln(1+GB)-(1-p_{x})\ln(1+G).$

To find $\gamma_{0}$ , we take the derivative of Eq. (28) at $\gamma=\gamma_{0}$ and $\beta=0,$

$\displaystyle\frac{d\ln f(y,x;0,\gamma_{0})}{d\gamma}=y-\frac{e^{\gamma_{0}}}{% 1+e^{\gamma_{0}}}.$

We find the limit profile nuisance parameter, $\gamma_{0}$ , as the solution to the equation

$\displaystyle E_{(\gamma,\beta)}\left(\frac{d\ln f(y,x;0,\gamma_{0})}{d\gamma}% \right)=E_{(\gamma,\beta)}(y)-\frac{e^{\gamma_{0}}}{1+e^{\gamma_{0}}}=0$

or equivalently

$\displaystyle\frac{GB}{1+GB}p_{x}+\frac{G}{1+G}(1-p_{x})=\frac{G_{0}}{1+G_{0}},$

where following our convention $G_{0}=e^{\gamma_{0}}$ . After some simplification, we express $G_{0}$ as a function of true $\beta$ and $\gamma$ as

$\displaystyle G_{0}=G_{0}(\gamma,\beta)=G\frac{1+GB+(B-1)p_{x}}{1+GB-(B-1)Gp_{% x}}.$ (29)

Finally, the $\eta$ -function takes the form

$\displaystyle\eta(\beta)=(\gamma-\gamma_{0})\left[\frac{GB}{1+GB}p_{x}+\frac{G% }{1+G}(1-p_{x})\right]+\beta\frac{GB}{1+GB}p_{x}-p_{x}\ln(1+GB)-(1-p_{x})\ln(1% +G)+\ln(1+G_{0}).$

From elementary calculus, it is easy to show that the $\eta$ -function has finite limits when $\beta\rightarrow\infty$ or $\beta\rightarrow-\infty$ , meaning that the power does not approach $1$ when the alternative goes to infinity. However, for mild values of $\gamma$ and $p_{x}$ , those limits are very close to $1$ when $n$ is fairly large. More details are found in Section 6.

5. The score test

The score test was originally developed by Rao (1948). This test is especially simple when the null hypothesis is simple. However, in the presence of nuisance parameters we need to derive the MLE for $\bm{\gamma}$ under the restrictive model $\beta=0$ (see Pawitan (2001) for more detail). The idea of the score test is easy: if the null hypothesis is true then the derivative of the log-likelihood function evaluated at null should be close to zero. Specifically, let $\hat{\bm{\gamma}}_{0}$ return the maximum of the log-likelihood function under $\beta=0$ . In other words, $\hat{\bm{\gamma}}_{0}$ is the MLE of the profile likelihood. It is proven that the distribution of the normalized derivative,

$\displaystyle S=\sqrt{\frac{V(0|\hat{\bm{\gamma}}_{0})}{n}}\frac{\partial l(0,% \hat{\bm{\gamma}}_{0})}{\partial\beta}$ (30)

is asymptotically normal with zero mean and unit variance, where $V(0|\hat{\bm{\gamma}}_{0})$ is the asymptotic variance Eq. (7) evaluated at $\beta=0$ and $\bm{\gamma}=\hat{\bm{\gamma}}_{0}$ .

The derivation of the power function of the score test is similar to the derivation of the likelihood ratio test, mainly because both use the concept of the profile likelihood. Let $\bm{\gamma}_{0}=\bm{\gamma}_{0}(\beta,\bm{\gamma})$ be the solution to Eq. (22). Then the power function of the score test is derived from the following approximations somewhat similar to what we used for the Wald test:

$\displaystyle P_{\rm S}(\beta)=\Pr(|S|>Z_{1-\alpha/2})=\Pr\left\{\sqrt{\frac{V% (0|\hat{\bm{\gamma}}_{0})}{n}}\bigg{|}\frac{\partial l(\beta,\bm{\gamma})}{% \partial\beta}+\left(\frac{\partial l(0,\hat{\bm{\gamma}}_{0})}{\partial\beta}% -\frac{\partial l(\beta,\bm{\gamma})}{\partial\beta}\right)\bigg{|}>Z_{1-% \alpha/2}\right\}\simeq\Pr\left\{\bigg{|}\sqrt{\frac{V(0|\hat{\bm{\gamma}}_{0}% )}{V(\beta)}}Z+\sqrt{n}\sqrt{V(0|\hat{\bm{\gamma}}_{0})}\frac{1}{n}\left(\frac% {\partial l(0,\hat{\bm{\gamma}}_{0})}{\partial\beta}-\frac{\partial l(\beta,% \bm{\gamma})}{\partial\beta}\right)\bigg{|}>Z_{1-\alpha/2}\right\},$

where $Z\sim\mathcal{N}(0,1)$ . In terms of $\Phi$ , the power function can be approximated as

$\displaystyle P_{\rm S}(\beta)=\Phi(-r(\beta,\gamma)Z_{1-\alpha/2}-\sqrt{n}% \sqrt{V(\beta)}\delta(\beta))+\Phi(-r(\beta,\gamma)Z_{1-\alpha/2}+\sqrt{n}% \sqrt{V(\beta)}\delta(\beta)),$ (31)

where $r(\beta,\gamma)=\sqrt{V(\beta)/V(0|\bm{\gamma}_{0})}$ , and

$\displaystyle\delta(\beta)=E_{(\beta,\bm{\gamma})}\left(\frac{d\ln f(\bm{z};0,% \bm{\gamma}_{0})}{d\beta}\right)$ (32)

will be referred to as the delta-function. When $\beta\rightarrow 0$ , we have $V(0|\bm{\gamma}_{0})\rightarrow V(0)$ and $\delta(\beta)\rightarrow 0$ , implying that $P_{\rm S}(\beta)\rightarrow 2\Phi(-Z_{1-\alpha/2})=\alpha$ , the power of the score test at the null equals the size of the test Eq. (8). Regarding the unbiasedeness of the score test, it is easy to see that

$\displaystyle\frac{dP_{\rm S}}{d\beta}\bigg{|}_{\beta=0}=-\frac{\phi(-Z_{1-% \alpha/2})Z_{1-\alpha/2}}{\sqrt{V(\beta)V(0|\bm{\gamma}_{0})}}\times\frac{dV}{% d\beta}\bigg{|}_{\beta=0},$ (33)

so that it is unbiased if and only if the derivative of the variance is zero. For linear regression $V$ does not depend on $\beta$ and therefore the score test is unbiased. However in general the score test is asymptotically biased. We shall find out in Section 6 that this test is not equivalent to the Wald or likelihood ratio test even in the neighborhood of the null.

5.1 Linear regression

We illustrate the score power with linear regression as we did earlier for the Wald and likelihood ratio tests. The needed $\bm{\tau}_{\ast}$ and $\sigma_{\ast}^{2}$ are defined by Eqs (25) and (26). Now we express function Eq. (32) through parameters $\beta$ , $\bm{\tau}$ and $\sigma^{2}$ . Assuming that the expectation is taken under the true parameters $(\beta,\bm{\tau},\sigma^{2})$ , we obtain $\delta(\beta)=-\beta/(\sigma_{\beta}^{2}+\beta^{2})$ implying that on the $z$ -scale the power takes the form

$\displaystyle P_{\rm S}(z)=\Phi\left(-\frac{Z_{1-\alpha/2}}{\sqrt{1+z^{2}}}-% \frac{z}{1+z^{2}}\sqrt{n}\right)+\Phi\left(-\frac{Z_{1-\alpha/2}}{\sqrt{1+z^{2% }}}+\frac{z}{1+z^{2}}\sqrt{n}\right).$ (34)

It is easy to see that $P_{\rm S}(0)=\alpha$ , and condition Eq. (8) holds. After some algebra, one may check that the score test is asymptotically unbiased-condition Eq. (9) holds. We shall compare this power approximation to that of the other two tests in Section 6.

5.2 Logistic regression

As in Section 4.2, the limit profile nuisance parameter $G_{0}$ is found as the solution to Eq. (22) provided by expression Eq. (29). Thus, to find the power, it suffices to find the delta-function Eq. (32). After some algebra and using Eq. (28), we obtain

$\displaystyle\delta(\beta)=E_{(\gamma,\beta)}\frac{\ln f(y,x;0,\gamma_{0})}{% \partial\beta}=\frac{GB}{1+GB}-\frac{G_{0}}{1+G_{0}}p_{x}.$ (35)

The variance function, $V$ was derived earlier and is given by Eq. (18); the variance at null is

$\displaystyle V(0|\bm{\gamma}_{0})=\frac{(1+G_{0})^{2}}{p_{x}(1-p_{x})G_{0}}.$

Unlike for linear regression, the score test is asymptotically biased for logistic regression. Namely, the first derivative at $\beta=0$ given by Eq. (33) is not zero. We shall compare this test to the other two tests in Section 6.

6. Tests comparison

In this section, we compare the Wald, LR, and score tests using two regression models as examples: linear regression with an unknown $\sigma^{2}$ under normal distribution and logistic regression with binary covariate.

We start our analysis with linear regression. It is convenient to compare the tests on the $z$ -scale using representations Eqs (13), (27) and (34). From Eq. (14), it is easy to see that for each test the second derivatives on the original $\beta$ -scale and the $z$ -scale are related through a coefficient, which is the reciprocal of the variance at the null,

$\displaystyle\frac{d^{2}P}{d\beta^{2}}\bigg{|}_{\beta=0}=\frac{1}{V(0)}\frac{d% ^{2}P}{dz^{2}}\bigg{|}_{z=0}.$

Theorem 1. All three tests are asymptotically unbiased for linear regression under a normal distribution with an unknown $\sigma^{2}$ (the first derivative of the power function evaluated at the null vanishes). The power function of the Wald test is uniformly greater than that of the LR test ( $\beta\neq 0$ ). The second derivatives of the three tests at the null are as follows:

$\displaystyle\frac{d^{2}P_{W}}{dz^{2}}\bigg{|}_{z=0}=\frac{d^{2}P_{\rm LR}}{dz% ^{2}}\bigg{|}_{z=0}=2\phi(Z_{1-\alpha/2})Z_{1-\alpha/2}n,$ (36) $\displaystyle\frac{d^{2}P_{\rm S}}{dz^{2}}\bigg{|}_{z=0}=2\phi(Z_{1-\alpha/2})% Z_{1-\alpha/2}(n+1).$ (37)

Thus, the score test is slightly superior in a close neighborhood of the null.

Proof The asymptotic unbiasedness of the tests (the first derivatives vanish at zero) have been noted previously. The superiority of the Wald over the LR test follows from an elementary inequality $\sqrt{\ln(1+z^{2})}<|z|$ for all $z\neq 0$ . The second derivative of the Wald test at $\beta=0$ has been derived earlier Eq. (12). The second derivative of the LR test follows from Eq. (34). The evaluation of the second derivative of the score test Eq. (37) is somewhat tedious but straightforward. A slight superiority of the score test in the neighborhood of the null follows from the fact that the difference of the derivatives is $2\phi(Z_{1-\alpha/2})Z_{1-\alpha/2}$ , although this difference rapidly diminishes on the relative scale as $n\rightarrow\infty$ .

Figure 3.

Power of three tests for linear regression with unknown $\sigma^{2}$ on the $z$ -scale ( $n=100,\sigma_{x}=\sigma=1,\rho_{xu}=0.9,\tau=1$ with $\alpha=0.05$ ). All three tests have the same size ( $\alpha$ ) and the first derivative vanishes at $\beta=0$ . The Wald and likelihood ratio tests have the same second derivatives at the null, but the second derivative of the score test is higher (actually invisible). However, globally, the Wald test is superior.

We illustrate the power approximation of the three tests in Fig. 3 with linear regression $y=\beta x+\tau u+\varepsilon$ , where all random variables have normal distributions with zero mean, $\text{var}(x)=\sigma_{x}^{2}$ , $\text{var}(u)=\sigma_{\rm u}^{2}$ , $\text{var}(\varepsilon)=\sigma^{2}$ . The power functions of the Wald, LR, and score tests are plotted as functions of the $z$ -ratio Eq. (14) with $n=100$ , $\sigma_{x}=\sigma=1$ , $\rho_{xu}=0.9$ and $\tau=1$ , assuming that the significance level $\alpha=$ 5%. Notice that the power functions are very close in the interval $|z|<0.2$ with the maximum difference around $z=0.5$ .

Figure 4.

Three power functions for logistic regression with a binary covariate, $p_{x}=0.2$ , $G=1/9,n=500$ with $\alpha=5\%$ . The first derivative of the Wald and likelihood ratio tests vanishes, but the derivative of the score test power is positive at $\beta=0$ . Simulations for $\beta=-1$ and $0.5$ are in good agreement with the power approximations.

Now we compare the three tests for logistic regression with a binary covariate. See Fig. 4 as an example with $\alpha=0.05$ , $p_{x}=\Pr(x=1)=0.2$ , $\Pr(y=1|x=0)=0.1$ ( $G=1/9)$ , and $n=500$ . We intentionally use a rare parameter set up ( $p_{x}=0.2$ ) because at symmetry ( $p_{x}=0.5$ and $G=1)$ the power functions are practically identical. Recall the Wald and LR power functions have zero derivatives and the same second derivative at $\beta=0$ . However, the first derivative of the score test is positive at $\beta=0$ meaning that the power of the score test is slightly less in the neighborhood to the left of the null. Recall that the Wald power falls to $\alpha$ when $\beta\rightarrow-\infty$ or $\beta\rightarrow\infty$ which is not true for the LR and score tests. Although the LR power does not approach 1 when $\beta\rightarrow-\infty$ the limit is very close to 1. In order to test the power approximations, we conducted simulations for $\beta=-1$ and $\beta=0.5$ (the number of experiments is 5,000) – simulations and power approximations are in good agreement.

7. Discussion and summary points

Asymptotic tests are at the heart of statistical inference. Early authors studied which test is superior. For example, Madansky (1989) raised this question, and in early edition of Rao’s book it was conjectured that the score test is locally more powerful than the other two. (There is no such conjecture in the latest, 1973 edition of the book!) Yes, as follows from Theorem 6, there exists a slight superiority of the score test for linear regression in a close proximity to the null. However, for other statistical models this superiority disappears – moreover, the score test is generally biased (the first derivative is not zero) which means that for a double-sided test the score power will be less on one side of the null.

The main points of the paper are as follows:

•

The Wald test may exhibit aberrant behavior for some statistical models, namely, the power increases and then drops to the size of the test as the alternative approaches negative infinity. For logistic regression with a binary covariate, the Wald test behaves well for alternatives in the interval $(-2,2)$ . The chance that the likelihood ratio or score test have aberrant behavior is smaller, but still exists.

•

The Wald and LR tests are asymptotically unbiased (the first derivative of their power function vanishes at the null), but the score test is generally biased. The Wald and LR tests are locally equivalent (the second derivatives are the same at the null), but the score test is not.

•

It is well known that all three tests are the same for a linear model under the normal distribution when the error variance, $\sigma^{2}$ is known. However, when $\sigma^{2}$ is unknown (nuisance parameter) the tests are different. The power of the Wald test is higher than that of the LR test for all alternatives, but locally the tests are equivalent. The score test for this model is also asymptotically unbiased and has slightly higher power in close proximity to the null.

•

The behavior of the tests changes from distribution to distribution and from one statistical model to another. Even members of the same family of generalized linear model (linear and logistic) have quite different test properties. Especially drastic are the differences for low probability and/or larger true values of nuisance parameters.

•

Simulations show a good agreement with power approximations especially for moderate parameter values.

•

No test is superior over another in the whole range of statistical models and distributions. For example, tests are close for moderate parameter values, particularly when $p_{x}$ is around 1/2. However, tests may be very different with extreme parameter values, such as for rare events in logistic regression ( $p_{x}\simeq 0$ ). Thus, different tests may be optimal for different statistical models, moreover for different ranges of the alternative.

Besides theoretical interest, the power analysis has an important application for study design, particularly in epidemiology and clinical trials. Usually, one wants to determine the sample size required to achieve a specified power or minimum detectable difference (the alternative beta value). Several commercial and noncommercial software packages are available on the market. However, there is no consensus on what test to use as the basis for the sample size determination – some authors use the Wald test (Whittemore, 1981), some the likelihood ratio test (Self et al., 1992 and Shieh, 2000) and some the score test (Self & Mautitzen, 1988) mainly because of the popular opinion that asymptotic tests are all the same. True, when alternatives are close to null and nuisance parameters do not take extreme values, the powers look alike. However, in some cases, we plan studies for large alternatives such as in epidemiologic studies with gene–gene or gene–environment interaction or observational studies of rare diseases (cancer), see Duchateau (1998) and Gauderman (2002) for examples. In such studies, the choice of the test becomes crucial. The test used for the sample size determination should be the same as the test in future significance testing (Demidenko, 2007, 2008).

Several misconceptions are widespread when it comes to studying asymptotic properties of generalized linear models (GLMs) and power functions particularly. First, as I mentioned in the Introduction, many authors do not distinguish between fixed and random predictors/covariates that explains the differences in tests comparison. For example, Lemonte and Ferrari (2012) claimed that the likelihood ratio and score tests have similar power behavior, however, we discovered differences even for linear regression. We attribute these dissimilarities due to the fact that we studied power functions under random predictor assumption but they assumed fixed regressors: “The covariate values were selected as random draws from the $U(0,1)$ distribution and for fixed n those values were kept constant throughout the experiment.” Second, it’s believed that since GLMs belong to the same family of distributions they may be studied under one methodological umbrella. However, as we showed in this paper, the GLMs may exhibit quite different properties. For example, logistic regression cannot be estimated (the MLE does not exist) with positive probability or the power function may decrease when the alternative goes away from the null value (aberrant behavior) especially for large negative regression coefficients.

Much additional work is to be done in the following areas: (1) Extend our power analysis to one-sided tests and the multivariate null hypothesis; (2) identify what asymptotic tests are optimal for specific statistical models, such as members of generalized linear model family, and (3) improve power approximations for extreme values of alternative and nuisance parameters, particularly when the effective sample size is small.

Finally, the study of the tests in this paper, as well as in many others, relies on standard conditions on the density function, such as independence of the density support on the parameter. In my private conversation with Professor Sergei Aivazian he told me that one of the problems he considered in his doctoral dissertation was studying superefficient estimation and the respective statistical inference when the support of the distribution depends on the unknown parameter, like in the uniform distribution with unknown upper limit.

Footnotes

Acknowledgments

I am grateful to the reviewer for his comments and suggestions to compare the results of tests comparison with conclusions derived by other authors. This promoted an extended discussion on the differences between fixed and random predictors schema.

Appendix. Proof of Theorem 1

The fact that the first derivative of the Wald power vanishes at $\beta=0$ was proven in Section 3. For the LR power, we have

$\displaystyle\frac{dP_{\rm LR}}{d\beta}=\sqrt{\frac{n}{2}}[\phi(-Z_{1-\alpha/2% }+\sqrt{2n\eta(\beta)})-\phi(Z_{1-\alpha/2}+\sqrt{2n\eta(\beta)})]\frac{d\eta/% d\beta}{\sqrt{\eta(\beta)}}.$

This implies that the LR test is asymptotically unbiased Eq. (9) because $\eta(0)=0$ (we show below that $d\eta/d\beta/\sqrt{\eta(\beta)}$ is a finite number at zero).

Now we evaluate the second derivative of the power function at zero. Using the property that $d\phi/dx=-x\phi(x)$ , we obtain

(38) $\displaystyle\frac{d^{2}P_{\rm LR}}{d\beta^{2}}\bigg{|}_{\beta=0}=n\phi(Z_{1-% \alpha/2})Z_{1-\alpha/2}\frac{(d\eta/d\beta)^{2}}{\eta(\beta)}\bigg{|}_{\beta=% 0}.$

Since the numerator and denominator with the $\eta$ -function are both zero we need to consider the limit. From L’Hopital’s rule

$\displaystyle\lim_{\beta\rightarrow 0}\frac{(d\eta/d\beta)^{2}}{\eta(\beta)}=2% \lim_{\beta\rightarrow 0}\frac{(d\eta/d\beta)(d^{2}\eta/d\beta^{2})}{d\eta/d% \beta}=2\frac{d^{2}\eta}{d\beta^{2}}\bigg{|}_{\beta=0},$

so that Eq. (38) is rewritten as

$\displaystyle\frac{d^{2}P_{\rm LR}}{d\beta^{2}}\bigg{|}_{\beta=0}=2n\phi(Z_{1-% \alpha/2})Z_{1-\alpha/2}\frac{d^{2}\eta}{d\beta^{2}}\bigg{|}_{\beta=0}.$

Thus, the evaluation of the second derivative of the power reduces to the evaluation of the second derivative of the $\eta$ -function. We aim to find the second derivative of $\eta$ at zero. Interchanging expectation and differentiation from Eq. (23), we obtain $\frac{d}{d\beta}E_{(\beta,\bm{\gamma})}(\ln f(\bm{z};\beta,\bm{\gamma}))=0$ . This implies

$\displaystyle\frac{d\eta}{d\beta}=-\frac{d}{d\beta}E_{(\beta,\bm{\gamma})}(\ln f% (\bm{z};0,\bm{\gamma}_{0}))=-\left(\frac{\partial}{\partial\bm{\gamma}_{0}}E_{% (\beta,\bm{\gamma})}(\ln f(\bm{z};0,\bm{\gamma}_{0}))\right)^{\prime}\left(% \frac{d\bm{\gamma}_{0}}{d\beta}\right).$

From differentiation of an implicit function, we have

$\displaystyle\frac{d\bm{\gamma}_{0}}{d\beta}=-\left[\frac{\partial}{\partial% \bm{\gamma}}E_{(\beta,\bm{\gamma})}\left(\frac{\partial\ln f(\bm{z};0,\bm{% \gamma})}{\partial\bm{\gamma}}\right)\right]^{-1}\left[\frac{d}{d\beta}E_{(% \beta,\bm{\gamma})}\left(\frac{\partial\ln f(\bm{z};0,\bm{\gamma})}{\partial% \bm{\gamma}}\right)\right],$

but

$\displaystyle\frac{d}{d\beta}E_{(\beta,\bm{\gamma})}\left(\frac{\partial\ln f(% \bm{z};0,\bm{\gamma})}{\partial\bm{\gamma}}\right)=\bm{I}_{12},\frac{\partial}% {\partial\bm{\gamma}}E_{(\beta,\bm{\gamma})}\left(\frac{\partial\ln f(\bm{z};0% ,\bm{\gamma})}{\partial\bm{\gamma}}\right)=\bm{I}_{22},$

which results in $d\bm{\gamma}_{0}/d\beta=-\bm{I}_{22}^{-1}\bm{I}_{12}$ . Finally, the second derivative of the $\eta$ -function at zero is

$\displaystyle\frac{d^{2}}{d\beta^{2}}E_{(\beta,\bm{\gamma})}(\ln f(\bm{z};% \beta,\bm{\gamma}))-\left(\frac{d\bm{\gamma}_{0}}{d\beta}\right)^{\prime}\left% (\frac{\partial^{2}}{\partial\bm{\gamma}_{0}}E_{(\beta,\bm{\gamma})}(\ln f(\bm% {z};0,\bm{\gamma}_{0}))\right)\left(\frac{d\bm{\gamma}_{0}}{d\beta}\bigg{|}_{% \beta=0}\right)$ $\displaystyle=I_{11}-\bm{I}_{12}^{\prime}\bm{I}_{22}^{-1}\bm{I}_{22}\bm{I}_{22% }^{-1}\bm{I}_{12}=I_{11}-\bm{I}_{12}^{\prime}\bm{I}_{22}^{-1}\bm{I}_{12}.$

Combining the results, we arrive at the second derivative at the null

$\displaystyle\frac{d^{2}P_{\rm LR}}{d\beta^{2}}\bigg{|}_{\beta=0}=2n\phi(Z_{1-% \alpha/2})Z_{1-\alpha/2}(I_{11}-\bm{I}_{12}^{\prime}\bm{I}_{22}^{-1}\bm{I}_{12% }),$

which coincides with the second derivative of the Wald test Eq. (12). This means that the Wald and LR tests are locally equivalent.

References

Aivazian

S. A.

Yenyukov

I. S.

, & Meshalkin

L. D.

(1985). Applied statistics, 2. Investigation of Dependencies (in Russian). Moscow: Finasi i Statistika.

Bickel

P. J.

& Doksum

K. A.

(2001). Mathematical statistics. Basic Ideas and Selected Topics. Upper Saddle River, NJ: Prentice Hall.

Casella

& Berger

R. L.

(1990). Statistical inference. Belmont, CA: Duxbury Press.

Chandra

T. K.

& Joshi

S. N.

(1983). Comparison of the likelihood ratio, Rao’s and Wald’s tests and a conjecture of C. R. Rao. Sankhyā Ser. A 45, 226-246.

Chandra

T. K.

& Mukerjee

(1984). On the optimality of the Rao’s statistic. Communications in Statistics. Theory and Methods13, 1507-1515.

Cohen

(1994). The earth is round (p<0.05). American psychologist49, 997-1003.

Cox

D. R.

& Hinkley

D. V.

(1974). Theoretical statistics. Boca Raton, FL: Chapman & Hall.

Demidenko

(2013). Mixed models: Theory and applications. 2d. ed. Hoboken, NJ: Wiley.

Demidenko

(2020). Advanced Statistics with Applications in R. Hoboken, NJ: Wiley. www.eugened.org

10.

Demidenko

(2007). Sample size determination for logistic regression revisited. Statistics in Medicine26, 3385-3394.

11.

Demidenko

(2008). Sample size and optimal design for logistic regression with binary interaction. Statistics in Medicine27, 36-46.

12.

Duchateau

Mc Dermott

, & Rowlands

G. R.

(1998). Power evaluation of small drug and vaccine experiments with binary outcomes. Statistics in Medicine17, 111-120.

13.

Fears

T. R.

Benicou

, & Gail

M. H.

(1996). A reminder of the fallibility of the Wald statistics, The American Statistician,50, 226-227.

14.

Gauderman

W. J.

(2002). Sample size requirements for association studies of gene-gene interaction. American Journal of Epidemiology155, 478-484.

15.

Hauck

W. W.

& Donner

(1977). Wald’s test as applied to hypotheses in logit analysis. Journal of American Statistical Association72, 851-853.

16.

Kulinskaya

Morgenthaler

, & Staudte

R. G.

(2008). Meta analysis. A guide to calibrating and combining statistical evidence. Chichester: Wiley.

17.

Lehmann

E. L.

& Romano

J. P.

(2005). Testing statistical hypotheses, 3d edition. New York: Springer.

18.

Lemonte

A. J.

& Ferrari

S. L. P.

(2012). Local power and size properties of the LR, Wald, score and gradient tests in dispersion models. Statistical Methodology 9, 537-554.

19.

Madansky

(1989). A comparison of the likelihood ratio, Wald, and Rao tests. In: Contributions to Probability and Statistics, (Eds: Gleser

L. J.

et al.). New York: Springer-Verlag.

20.

Mc Cullagh

(1987). Tensor Methods in Statistics. London: Chapman & Hall.

21.

Pawitan

(2001). In all likelihood. Statistical modeling and inference using likelihood. Oxford: Clarendon Press.

22.

Rao

C. R.

(1948). Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Proc. Camb. Phil. Soc.44, 50-57.

23.

Rao

C. R.

(1973). Linear Statistical Inference and Its Applications. 2d ed. New York: Wiley.

24.

Rayner

J. C.

(1997). The asymptotically optimal tests, The Statistician46, 337-346.

25.

Schervish

M. J.

(1995). Theory of statistics. New York: Springer–Verlag.

26.

Self

& Mautitzen

R. H.

(1988). Power/sample size calculations for generalized linear models. Biometrics44, 79-86.

27.

Self

Mautitzen

R. H.

, & Ohara

(1992). Power calculations for likelihood ratio tests in generalized linear models. Biometrics48, 31-39.

28.

Serfling

R. J.

(1980). Approximation theorems of mathematical statistics. New York: Wiley.

29.

Severeni

T. A.

(2000). Likelihood methods in statistics. Oxford: Oxford University Press.

30.

Shao

(2010). Mathematical statistics. 2d ed. Springer: New York.

31.

Shieh

(2000). On power and sample size calculations for likelihood ratio tests in generalized linear models. Biometrics56, 1192-1196.

32.

Shieh

(2005). On power and sample size calculations for Wald tests in generalized linear models. Journal of Statistical Planning and Inference 128, 43-59.

33.

Væth, M. (1985). On the use of Wald’s test in exponential families. International Statistical Review53, 199-214.

34.

Whittemore

A. S.

(1981). Sample size for logistic regression with small response probability. Journal of the American Statistical Association76, 27-32.

35.

Young

G. A.

& Smith

R. L.

(2005). Essentials of statistical inference. Cambridge: Cambridge University Press.