Retrospective Bayesian Evidence of Null Effect in Two Decades of Alzheimer’s Disease Clinical Trials

Abstract

Despite intense research on Alzheimer’s disease, no validated treatment able to reverse symptomatology or stop disease progression exists. A recent systematic review by Kim and colleagues evaluated possible reasons behind the failure of the majority of the clinical trials. As the focus was on methodological factors, no statistical trends were examined in detail. Here, we aim to complete this picture leveraging on Bayesian analysis. In particular, we tested whether the failure of those clinical trials was essentially due to insufficient statistical power or to lack of a true effect. The strong Bayes’ Factor obtained supported the latter hypothesis.

Keywords

Alzheimer’s disease Bayesian statistics clinical trial drug development

INTRODUCTION

Alzheimer’s disease (AD) is a chronic multifactorial neurodegeneration-based disorder and the most common cause of dementia worldwide [1], with an estimated prevalence of 30 million people [2]. AD is characterized by extensive neuronal loss that tends to induce loss of memory and a more generalized cognitive decline [3]. Other clinical features include loss of bodily functions and psychological impairments, such as depression, aggressivity, and sleep disturbance.

Over the last two decades, a number of biomedical efforts have suggested that multiple factors are actively involved in the disorder, including the accumulation of amyloid-β plaques, formation of tau-protein neurofibrillary tangles, low levels of acetylcholine, and mitochondrial dysfunctions [4 –6]. Despite the substantial availability of scientific literature and the constant research effort, no effective and validated treatment for patients with AD can to date reverse symptomatology or stop disease progression [7]. In a very recent study, Kim et al. [8] have systematically reviewed eligible clinical trials for AD from the ClinicalTrials.gov database. 98 interventional phase II and phase III trials for unique compounds, carried-on between 2004 and 2021, were identified. The specific reason behind their failure was subsequently evaluated for each of them. The authors [8] elegantly demonstrated that the various methodological factors contributing to these clinical failures can be categorized into 1) insufficient evidence to initiate the pivotal trials and 2) pivotal trial design shortcomings. Although Kim et al. highlighted the relevance of a desirable complementary investigation focused on the statistical features of the failed trials, this remained unexplored in their work [8]. We therefore aimed to complete this big picture by means of a direct statistical evaluation of the efficacy of those trials.

In clinical trials the concept of efficacy is funded on the decision about the existence/non-existence of a treatment effect. In statistical terms, this is usually verified by means of a null hypothesis testing (NHST). Therefore, an effect is considered “positive” when the drug/compound under evaluation has a statistically significant effect (e.g., p < 0.05) on the primary endpoint compared to placebo. It is important to note that the NHST is a frequentist-based formal approach potentially associated to a number of issues. Common criticisms include a sensitivity to sample size, error rates, and statistical power [9, 10]. In particular, the failure to find a statistically significant result is often interpreted as an evidence that the effect does not exist at all. However, this is a misconception, as the NHST method does not allow to simultaneously draw inference for competing hypothesis (i.e., ‘the treatment showed an effect’ versus ‘the treatment showed no effect’) [11]. A correct interpretation only allows to conclude that there is no evidence of effect given the specific statistical parameters imposed (i.e., analysed sample size, selected p-value). In recent years, several authors have suggested that the NHST should no longer be the default statistical approach in biomedical field. Different inferential methods should be adopted instead to correctly address peculiar research questions and overcome the biomedical replication crisis [12 –15]. In particular, the adoption of Bayesian-based models may offer several practical and inferential advantages [16 –23] in the specific context of clinical trials. A relevant one is the Bayes Factor (BF) [24], an alternative hypothesis testing technique evaluating the conditional probability between two competing hypotheses. Here, we therefore used the BF approach to provide a statistical evaluation of the AD interventional phase II and III trials considered by Kim et al. [8]. In particular, we evaluated whether the failures of these clinical trials are essentially due to the lack of statistical power or to a true lack of effect. This offers an invaluable insight to understand criticalities associated with previous attempts, while suggesting an improved approach to evaluate future clinical trials.

MATERIALS AND METHODS

Data

We systematically reviewed the 98 failed AD compounds included in the study of Kim, et al. [8]. In order to be eligible for the subsequent analyses, clinical trials were retained if:

1) associated data were accessible through peer reviewed publications;

and

2) the Clinical Dementia Rating scale – sum of boxes (CDR-SB) had been used to evaluate cognitive and functional performance.

We decided to limit the analysis to a single cognitive test in order to avoid potential confounding effects due to methodological differences. Among those listed in Kim et al. [8] the CDR-SB was selected coherently with the recent work by Costa and Cauda [23]. Based on these selective inclusion criteria, 11 clinical trials interventional phase II and III from 2004 to 2021 were further considered [25 –35]. For each of them, the mean and standard error (SE) for the placebo condition and the clinical condition, and the mean difference between the two conditions and the total standard error were obtained.

Statistical methods

The statistical technique used here was the Bayes Factor. The BF allows to determine the strength of the evidence for the null hypothesis H₀ with respect to an alternative hypothesis H₁. In this specific case, the BF can be conceptualized as follows: ${BF}_{01} = \frac{H_{0} : no difference between the placebo and the medicated group}{H_{1} : some difference between the placebo and the medicated group}$

H₀ is therefore the model of no difference, while the H₁ models the existence of an effect. The value of BF₀₁ can range between 0 and infinite. When BF₀₁ < 1 the evidence favors H₁; conversely, when BF₀₁ > 1 the evidence favors H₀. Crucially, unlike the NHST, this approach fully models both the hypotheses. In other terms, this means describing “what the data should look like when there is an effect” [36]. Therefore, a probability distribution, with a specified shape, can be used to express the plausibility of different effect sizes. Since H₀ hypothesizes no difference between groups, this can be modelled through a point-null hypothesis, meaning that 0 is predicted as the only plausible value. With regard to H₁, clinical trials usually expect the effect to have a specific direction (e.g., improved score in a given cognitive test). Consequently, a convenient way of representing H₁ is by means of a half-normal distribution centered on zero (see Fig. 1). This implies that small effects are expected to be more plausible than large effects.

Fig. 1

The two probability distributions used to model the hypotheses for the computation of the BF.

The standard deviation (σ) of such distribution is usually set depending on the scale the expected effect. According to the literature, a suitable choice for the standard deviation is half of the maximum measured or estimated value [36]. In this specific case, the maximum value to be considered is the one that allows you to pass from one class to another of the CDR-SB test. As this is 4 points, a σ = 2 was assumed. However, since the value assigned to σ could have an influence on the results, an additional sensitivity analysis was performed by iteratively varying the standard deviation of the effect size (from 1 to 3) and each time calculating the BF.

In light of the methodological differences among the 11 considered trials (e.g., groups design, administered dosages, type of drug) the BF was calculated separately for each one, without merging results. However, the BF value obtained can be directly compared, allowing to have an overview of the actual effectiveness of the considered clinical trials.

RESULTS AND DISCUSSION

For each of the 17 endpoints considered, the BF₀₁ showed stronger evidence for H₀(Table 1).

Table 1

The obtained Bayes Factor (BF) values

Authors	Year	Tested compound	BF
Doody et al. [26]	2014	Solanezumab	6
Doody et al. [26]	2014	Solanezumab	21
Siemers et al. [28]	2016	Solanezumab	20
Honig et al. [32]	2018	Solanezumab	77
Ostrowitzki et al. [30]	2017	Gantenerumab	6
Ostrowitzki et al. [30]	2017	Gantenerumab	4
Wessels et al. [35]	2020	Lanabecestat	6
Wessels et al. [35]	2020	Lanabecestat	4
Egan et al. [31]	2018	Verubecestat	12
Egan et al. [31]	2018	Verubecestat	12
Egan et al. [34]	2019	Verubecestat	150
Lawlor et al. [33]	2018	Nilvadipine	14
Doody et al. [25]	2013	Semagacestat	25
Doody et al. [25]	2013	Semagacestat	36
Decourt et al. [29]	2017	Thalidomide	8
Salloway et al. [27]	2014	Bapineuzumab	4

This means that in no cases the hypothesis of observing a real effect was more plausible than that of observing no effect. The highest value BF₀₁ = 150 was obtained for the 2019 trial on Verubecestat [33]. Of note, according to Kass and Raftery [37] a BF greater than 20 means that the force of evidence is strong. Overall, the probability density showed that most of the values are concentrated around an average BF₀₁ of 24.26, and a standard error (SE) of 8.9 (see Fig. 2). Therefore, the null hypothesis (i.e., there is no real effect) was 24 times more likely than the alternative hypothesis (i.e., a difference exists between the groups). In terms of probability, we can be confident at the 96% that the treatment induced no effects, and that this is not due to design factors or sample size.

Fig. 2

Density distribution of the Bayes Factor based on the CDR-SB test results reported in the investigated clinical trials. Note that, despite the binning, no values fell between 0 and 1.

Interestingly, the original results of the Verubecestat trial [34] reporting the inefficacy of the compound were very robust, as shown by their very small variance. The Bayesian analysis converged, giving a high BF that confirmed the strength of the evidence. Nevertheless, the information given by these results is different and somewhat complementary. The original study based on the frequentist approach [34] tell us that the effect of Verubecestat was negligible. The BF proves instead that, given the data collected during the trial, the result is 150 times more plausible under the assumption that Verubecestat is indeed ineffectual, rather than under the hypothesis that it is effective but the trial failed to detect its effect. The same logic can be extended to the other clinical trials considered. Since the Bayesian approach directly tests the effect, it is unaffected by the parameters that must be defined in the NHST framework. Therefore, while in frequentist analyses greater samples are more likely to produce significant results, and the statistical significance of the results primarily depends on the specific p-value selected, these features does not impact on the BF. Moreover, our analyses showed that the absence of real effect is highly plausible across all the study considered, despite differences in cohorts, sample size, tested compounds, level of significance. Therefore we can reasonably conclude that our results show that the failure of previous clinical trials was not due to methodological aspects, but to a real inefficacy of the treatment.

Finally, the sensitivity analysis, performed to test the possible influence of the value assigned to the standard deviation of the effect size, showed that result remained very stable. The probability to find no effect ranged between 93% and 97%. This indicates that the evidence of the null hypothesis is actually very strong.

Although the focus of this research is on statistical aspects, it is worth mentioning that the repeated failure of clinical trials is likely to be related with the lack of a clear etiopathological hypothesis. While the amyloid hypothesis has been followed for decades, it was repeatedly challenged [8 , 39]. Therefore, it might not be a coincidence that all the failed compounds here considered but one (Nilvadipine) has a mechanism of action ascribable to the amyloid hypothesis.

Conclusion

In this short communication, a Bayesian framework was adopted to analyze the results of 11 failed clinical trials which tested different treatments for AD. The results showed that absence of real effect (i.e., H₀) is highly plausible across all the study considered, despite differences in cohorts, sample size, tested compounds, level of significance. Therefore we can reasonably conclude that our results suggest that the failure of previous clinical trials was not due to methodological aspects, but to a real inefficacy of the treatment. This evidence gives a possible negative answer to the question posed by Kim et al. [8], if following more rational drug development principles could improve the success rate of clinical trials. On the contrary, a more common use of the Bayesian framework in biomedicine can improve the way we approach research hypothesis, possibly improving the insight obtained from clinical trials.

DISCLOSURE STATEMENT

Authors’ disclosures available online (https://www.j-alz.com/manuscript-disclosures/22-0942r1).

References

Barker

, Luis

, Kashuba

, Luis

, Harwood

, Loewenstein

, Waters

, Jimison

, Shepherd

, Sevush

, Graff-Radford

, Newland

, Todd

, Miller

, Gold

, Heilman

, Doty

, Goodman

, Robinson

, Pearl

, Dickson

, Duara

(2002) Relative frequencies of Alzheimer disease, Lewy body, vascular and frontotemporal dementia, and hippocampal sclerosis in the State of Florida Brain Bank. Alzheimer Dis Assoc Disord 16, 203–212.

Haque

, Levey

(2019) Alzheimer’s disease: A clinical perspective and future nonhuman primate research opportunities. Proc Natl Acad Sci U S A 116, 26224–26229.

Lane

, Hardy

, Schott

(2018) Alzheimer’s disease. Eur J Neurol 25, 59–70.

Hardy

(2017) The discovery of Alzheimer-causing mutations in the APP gene and the formulation of the “amyloid cascade hypothesis”. FEBS J 284, 1040–1044.

Wilkins

, Swerdlow

(2016) Relationships between mitochondria and neuroinflammation: Implications for Alzheimer’s disease. Curr Top Med Chem 16, 849–857.

Selkoe

, Hardy

(2016) The amyloid hypothesis of Alzheimer’s disease at 25 years. EMBO Mol Med 8, 595–608.

Catania

, Giaccone

, Salmona

, Tagliavini

, Di Fede

(2019) Dreaming of a new world where Alzheimer’s is a treatable disorder. Front Aging Neurosci 11, 317.

Kim

, Lee

, Ong

, Gold

, Kalali

, Sarkar

(2022) Alzheimer’s disease: Key insights from two decades of clinical trial failures. J Alzheimers Dis 87, 83–100.

Jaynes

(2003), Probability theory: The logic of science, Cambridge University Press.

10.

Levine

, Weber

, Hullett

, Park

, Lindsey

LLM

(2008) A critical assessment of null hypothesis significance testing in quantitative communication research. Hum Commun Res 34, 171–187.

11.

Goodman

(2005) Introduction to Bayesian methods I: Measuring the strength of evidence. Clin Trials 2, 282–290; discussion 301-304, 364-378.

12.

Szucs

, Ioannidis

JPA

(2017) When null hypothesis significance testing is unsuitable for research: A reassessment. Front Hum Neurosci 11, 390.

13.

Ioannidis

(2005) Why most published research findings are false. PLoS Med 2, e124.

14.

Ioannidis

, Greenland

, Hlatky

, Khoury

, Macleod

, Moher

, Schulz

, Tibshirani

(2014) Increasing value and reducing waste in research design, conduct, and analysis. Lancet 383, 166–175.

15.

Costa

, Manuello

, Ferraro

, Liloia

, Nani

, Fox

, Lancaster

, Cauda

(2021) BACON: A tool for reverse inference in brain activation and alteration. Hum Brain Mapp 42, 3343–3351.

16.

Dunson

(2001) Commentary: Practical advantages of Bayesian analysis of epidemiologic data. Am J Epidemiol 153, 1222–1226.

17.

Gurrin

, Kurinczuk

, Burton

(2000) Bayesian statistics in medical research: An intuitive alternative to conventional data analysis. J Eval Clin Pract 6, 193–204.

18.

Lopes

, Müller

, Ravishanker

(2007) Bayesian Computational Methods in Biomedical Research, University of Connecticut, Department of Statistics.

19.

Gupta

(2012) Use of Bayesian statistics in drug development: Advantages and challenges. Int J Appl Basic Med Res 2, 3–6.

20.

Cauda

, Nani

, Liloia

, Manuello

, Premi

, Duca

, Fox

, Costa

(2020) Finding specificity in structural brain alterations through Bayesian reverse inference. Hum Brain Mapp 41, 4155–4172.

21.

Ferreira

, Barthoulot

, Pottecher

, Torp

, Diemunsch

, Meyer

(2020) Theory and practical use of Bayesian methods in interpreting clinical trial data: A narrative review. Br J Anaesth 125, 201–207.

22.

Kelter

(2020) Bayesian alternatives to null hypothesis significance testing in biomedical research: A non-technical introduction to Bayesian inference with JASP. BMC Med Res Methodol 20, 142.

23.

Costa

, Cauda

(2022) A Bayesian reanalysis of the phase III aducanumab (ADU) trial. J Alzheimers Dis 87, 1009–1012.

24.

Jeffreys

(1961) The theory of probability, Clarendon, Oxford.

25.

Doody

, Raman

, Farlow

, Iwatsubo

, Vellas

, Joffe

, Kieburtz

, He

, Sun

, Thomas

, Aisen

, Siemers

, Sethuraman

, Mohs

(2013) A phase 3 trial of semagacestat for treatment of Alzheimer’s disease. N Engl J Med 369, 341–350.

26.

Doody

, Thomas

, Farlow

, Iwatsubo

, Vellas

, Joffe

, Kieburtz

, Raman

, Sun

, Aisen

, Siemers

, Liu-Seifert

, Mohs

(2014) Phase 3 trials of solanezumab for mild-to-moderate Alzheimer’s disease. N Engl J Med 370, 311–321.

27.

Salloway

, Sperling

, Fox

, Blennow

, Klunk

, Raskind

, Sabbagh

, Honig

, Porsteinsson

, Ferris

, Reichert

, Ketter

, Nejadnik

, Guenzler

, Miloslavsky

, Wang

, Lu

, Lull

, Tudor

, Liu

, Grundman

, Yuen

, Black

, Brashear

(2014) Two phase 3 trials of bapineuzumab in mild-to-moderate Alzheimer’s disease. N Engl J Med 370, 322–333.

28.

Siemers

, Sundell

, Carlson

, Case

, Sethuraman

, Liu-Seifert

, Dowsett

, Pontecorvo

, Dean

, Demattos

(2016) Phase 3 solanezumab trials: Secondary outcomes in mild Alzheimer’s disease patients. Alzheimers Dement 12, 110–120.

29.

Decourt

, Drumm-Gurnee

, Wilson

, Jacobson

, Belden

, Sirrel

, Ahmadi

, Shill

, Powell

, Walker

, Gonzales

, Macias

, Sabbagh

(2017) Poor safety and tolerability hamper reaching a potentially therapeutic dose in the use of thalidomide for Alzheimer’s disease: Results from a double-blind, placebo-controlled trial. Curr Alzheimer Res 14, 403–411.

30.

Ostrowitzki

, Lasser

, Dorflinger

, Scheltens

, Barkhof

, Nikolcheva

, Ashford

, Retout

, Hofmann

, Delmar

, Klein

, Andjelkovic

, Dubois

, Boada

, Blennow

, Santarelli

, Fontoura

(2017) A phase III randomized trial of gantenerumab in prodromal Alzheimer’s disease. Alzheimers Res Ther 9, 95.

31.

Egan

, Kost

, Tariot

, Aisen

, Cummings

, Vellas

, Sur

, Mukai

, Voss

, Furtek

, Mahoney

, Harper Mozley

, Vandenberghe

, Mo

, Michelson

(2018) Randomized trial of verubecestat for mild-to-moderate Alzheimer’s disease. N Engl J Med 378, 1691–1703.

32.

Honig

, Vellas

, Woodward

, Boada

, Bullock

, Borrie

, Hager

, Andreasen

, Scarpini

, Liu-Seifert

, Case

, Dean

, Hake

, Sundell

, Poole Hoffmann

, Carlson

, Khanna

, Mintun

, DeMattos

, Selzler

, Siemers

(2018) Trial of solanezumab for mild dementia due to Alzheimer’s disease. N Engl J Med 378, 321–330.

33.

Lawlor

, Segurado

, Kennelly

, Olde Rikkert

MGM

, Howard

, Pasquier

, Börjesson-Hanson

, Tsolaki

, Lucca

, Molloy

, Coen

, Riepe

, Kálmán

, Kenny

, Cregg

, O’Dwyer

, Walsh

, Adams

, Banzi

, Breuilh

, Daly

, Hendrix

, Aisen

, Gaynor

, Sheikhi

, Taekema

, Verhey

, Nemni

, Nobili

, Franceschi

, Frisoni

, Zanetti

, Konsta

, Anastasios

, Nenopoulou

, Tsolaki-Tagaraki

, Pakaski

, Dereeper

, de la Sayette

, Sénéchal

, Lavenu

, Devendeville

, Calais

, Crawford

, Mullan

(2018) Nilvadipine in mild to moderate Alzheimer disease: A randomised controlled trial. PLoS Med 15, e1002660.

34.

Egan

, Kost

, Voss

, Mukai

, Aisen

, Cummings

, Tariot

, Vellas

, van Dyck

, Boada

, Zhang

, Li

, Furtek

, Mahoney

, Harper Mozley

, Mo

, Sur

, Michelson

(2019) Randomized trial of verubecestat for prodromal Alzheimer’s disease. N Engl J Med 380, 1408–1420.

35.

Wessels

, Tariot

, Zimmer

, Selzler

, Bragg

, Andersen

, Landry

, Krull

, Downing

, Willis

, Shcherbinin

, Mullen

, Barker

, Schumi

, Shering

, Matthews

, Stern

, Vellas

, Cohen

, MacSweeney

, Boada

, Sims

(2020) Efficacy and safety of lanabecestat for treatment of early and mild Alzheimer disease: The AMARANTH and DAYBREAK-ALZ Randomized Clinical Trials. JAMA Neurol 77, 199–209.

36.

Lakens

, McLatchie

, Isager

, Scheel

, Dienes

(2018) Improving inferences about null effects with Bayes factors and equivalence tests. J Gerontol B Psychol Sci Soc Sci 75, 45–57.

37.

Kass

, Raftery

(1995) Bayes factors. J Am Stat Assoc 90, 773–795.

38.

Harrison

, Owen

(2016) Alzheimer’s disease: The amyloid hypothesis on trial. Br J Psychiatry 208, 1–3.

39.

Drachman

(2014) The amyloid hypothesis, time to move on: Amyloid is the downstream result, not cause, of Alzheimer’s disease. Alzheimers Dement 10, 372–380.