Abstract
Background:
In December 2019, in light of additional blinded data, Biogen claimed efficacy of the drug Aducanumab (ADU).
Objective:
We conducted a reanalysis of the phase III ADU summary statistics, focusing in particular on the Clinical Dementia Rating-Sum of Boxes.
Methods:
We used a Bayesian framework to mitigate the problems of the null-hypothesis significance testing framework. In particular, we used Bayes Factor (BF) to analyze the summary statistics. The BF is the comparison of how well two hypotheses predict the data.
Results:
Our results showed that the evidence for ADU efficacy is very low. The results show that the only data with a BF value in favor of the alternative hypothesis (i.e., drug efficacy) is the high-dose condition in the EMERGE trial. However, the obtained BF falls within the range of values considered anecdotal, meaning a low level of evidence.
Conclusion:
We provide a clearer interpretation of the results of the clinical trials based on the Bayesian framework, as this may be useful for future development and research in the field.
INTRODUCTION
In March 2019, Biogen announced the discontinuation of their phase III trials for the drug Aducanumab (ADU) due to futility. Seven months later (December 2019), in light of additional blinded data, Biogen claimed efficacy of ADU. These new findings were presented at an international meeting in San Diego, California, and the obtained results were released online (https://investors.biogen.com/static-files/ddd45672-9c7e-4c99-8a06-3b557697c06f) (for a more in-depth review of the trials and clinical consideration, see Knopman, Jones, and Greicius [1]). The claim was of evidence of the efficacy for high dosage of ADU in the halted trial. Based on this evidence, Biogen submitted a new drug application to the Food and Drug Administration (FDA) in July 2020. However, several questions were soon raised by different researchers, including the Office of Biostatistics within the Office of Translational Sciences of the FDA. The Office of Translational Sciences declared that substantial evidence of effectiveness had not been provided (complete information released by the FDA can be seen here: https://www.accessdata.fda.gov/drugsatfda_docs/nda/2021/761178_Orig1s000TOC.cfm). The FDA assumptions were based on a p < 0.05 null-hypothesis significance testing framework (NHST) to evaluate “substantial evidence” for drug efficacy.
However, the NHST framework is associated with several concerns [2]. First, the p-value is susceptible to misinterpretation, leading to overestimation of the evidence against the null-hypothesis. Second, the p < 0.05 criterion used as a reference for accepting or rejecting the null hypothesis leads to an “all or nothing” binary decision. Two main issues arise from this: 1) the FDA can approve a drug whose efficacy is only minimally supported by the data; 2) the evidence in favor of efficacy cannot be assessed on a gradual scale. Notably, Bayesian statistics allows for a direct evaluation of the evidence in favor and against competing hypotheses. This approach, which is not possible in the frequentist framework, makes it possible to assess how strong the evidence is for a treatment effect.
Therefore, we reanalyzed the released data by means of a Bayesian analysis, in particular those related with the result of the Clinical Dementia Rating-Sum of Boxes (CDR-SB), that showed a significant p values for the high dose condition in the EMERGE trial.
MATERIALS AND METHODS
Data
Because the raw data were not publicly released, the only available data and results from the phase III trial were those presented in San Diego (and retrievable from https://investors.biogen.com/static-files/ddd45672-9c7e-4c99-8a06-3b557697c06f). Specifically, we analyzed the CDR-SB scores available for the EMERGE final data set at week 78 and the ENGAGE final data set. See Table 1 for a summary of the original results considered.
The row represents the public data of the EMERGE and the ENGAGE trials. Here, n is the sample size, p the original p values of the obtained results
Statistical method
We used a Bayesian framework to mitigate the problems of the NHST framework mentioned in the introduction. In particular, we used the Bayes Factor (BF), which in its simplest form is also called likelihood ratio. The BF is the comparison of how well two hypotheses predict the data. The hypothesis that better predicts the observed data is the one that is said to be supported by more evidence. The equation of the BF is:
The BF differs from the NHST framework in many ways. First, the BF is a ratio of probabilities, and its result can vary from zero to infinity. Crucially, it requires two hypotheses, making it clear that any evidence against the null hypothesis can only exist in the presence of some alternative hypothesis. Second, the BF depends on the probability of the sole observed data, not taking into account unobserved “long run” results as in the case of the p value calculation. Hence, issues known to affect the computation of p values, such as the stopping rule, do not affect the BF.
To compute the BF, a t-statistic is needed. Since this was not available in the released summary statistics, it was obtained based on the sample size N and the p-value in each condition, by the inverse cumulative distribution function of the Student’s t value evaluated at the probability values p using the corresponding degrees of freedom v = N-1.
From the t-statistic, it is possible to obtain the BF given the null and the alternative hypotheses (see Rouder and colleagues [3]) as:
The strength of evidence is evaluated referring to the standard conventions for the evidentiary support of BF, such that values in the 1–3 range are classified as anecdotal; values in the 3–10 range as moderate; values in the 10–30 range as strong; values in the 30–100 range as very strong [4]. For BF values below 1, the reciprocal can be taken to obtain the strength of evidence in the opposite direction.
The released results of the EMERGE and ENGAGE trials were analyzed as in a meta-analysis. The idea is that if multiple experiments exist, it seems reasonable, from the Bayesian point of view, that the posterior odds from the first experiment can serve as the prior for the second experiment, and so on. Therefore, a meta-analytic extension of the BF was used, proposed by Rouder and Morey [5] and defined as follows:
RESULTS
All the Bayesian statistical analyses of the data were performed in JASP (version 0.92, jasp-stats.org) and using the BayesFactor package in R (http://cran.r-project.org/web/packages/BayesFactor/BayesFactor.pdf) for meta-analysis. Table 2 and Fig. 1 show the results of the BF analysis of the two trials.

A graphical representation of the Bayes Factor for low and high dose of the EMERGE and ENGAGE trials. Note that only the high dose condition of the EMERGE trial is in the direction of the alternative hypothesis (i.e., drug efficacy) but at an anecdotal strength of evidence.
Results of the Bayesian reanalysis of the EMERGE and the ENGAGE trials. Here, t are the t-values obtained from the p-values, and BF10 is the Bayes Factor of the comparison between the alternative hypothesis and the null
The results show that the only data with a BF value in favor of the alternative hypothesis (i.e., drug efficacy) is the high-dose condition in the EMERGE trial. However, the obtained BF falls within the range of values considered anecdotal, meaning a low level of evidence. An alternative way to interpret the results is to convert the BF into a posterior probability, which allows us to directly measure the strength of the evidence provided by the data. To do so, the following formulas can be used:
The obtained posterior probabilities are shown in Table 3.
The posterior probability (expressed in percentages) of the corresponding Bayes Factor for the EMERGE and ENGAGE trials
The results show that even in the high-dose condition of the EMERGE trial, the data supports the hypothesis of drug efficacy measured through the CDR-SB with a posterior probability of only 60%.
Another element that is evident from the data is that there is a large difference in the evidence supporting the hypothesis of efficacy between the two clinical trials. For this reason, we decided to calculate the meta-BF by collapsing together the two trials. The results are shown in Table 4 and Fig. 2. Interestingly, the evidence for the efficacy of the ADU drug in the high dose condition decreases dramatically, showing only a posterior probability of 22% that the drug is effective given the evidence.
The meta-BF and the corresponding posterior probability of evidence

The posterior probability of evidence of the efficacy hypothesis for the two trials, either separated (blue cross) or combined (red cross).
DISCUSSION
This study has shown that the Bayesian framework provides important and valuable information about the strength of evidence for the efficacy of the high dose of ADU as a treatment for Alzheimer’s disease. In particular, we have shown that the evidence for the efficacy of the drug is very weak under both the individual and combined conditions. These results further highlight the ability of Bayesian methods to provide clearer evidence for the hypotheses under investigation than the NHST framework. Notably, the availability of software such as Jasp or other libraries in the scientific community have allowed Bayesian methods to overcome computational difficulties and provide results quickly and easily.
These results indicate that it is possible and necessary to adopt Bayesian methods in addition to the NHST framework to render the scientific evidence more meaningful and detailed, especially in the evaluation of the results of clinical trials.
DISCLOSURE STATEMENT
Authors’ disclosures available online (https://www.j-alz.com/manuscript-disclosures/22-0132r1).
