Abstract
Background:
Many post hoc analyses of clinical trials in Alzheimer’s disease (AD) and mild cognitive impairment (MCI) are in small Phase 2 trials. Subject heterogeneity may lead to statistically significant post hoc results that cannot be replicated in larger follow-up studies.
Objective:
We investigated the extent of this problem using simulation studies mimicking current trial methods with post hoc analyses based on ApoE4 carrier status.
Methods:
We used a meta-database of 24 studies, including 3,574 subjects with mild AD and 1,171 subjects with MCI/prodromal AD, to simulate clinical trial scenarios. Post hoc analyses examined if rates of progression on the Alzheimer’s Disease Assessment Scale-cognitive (ADAS-cog) differed between ApoE4 carriers and non-carriers.
Results:
Across studies, ApoE4 carriers were younger and had lower baseline scores, greater rates of progression, and greater variability on the ADAS-cog. Up to 18% of post hoc analyses for 18-month trials in AD showed greater rates of progression for ApoE4 non-carriers that were statistically significant but unlikely to be confirmed in follow-up studies. The frequency of erroneous conclusions dropped below 3% with trials of 100 subjects per arm. In MCI, rates of statistically significant differences with greater progression in ApoE4 non-carriers remained below 3% unless sample sizes were below 25 subjects per arm.
Conclusions:
Statistically significant differences for ApoE4 in post hoc analyses often reflect heterogeneity among small samples rather than true differential effect among ApoE4 subtypes. Such analyses must be viewed cautiously. ApoE genotype should be incorporated into the design stage to minimize erroneous conclusions.
INTRODUCTION
As the number of failed trials of potential disease-modifying agents for Alzheimer’s disease (AD) and mild cognitive impairment (MCI) has grown, post hoc subgroup analyses are often conducted in order to assess for groups who are more likely to respond to the therapy or show disease progression [1]. The apolipoprotein E ɛ4 (APOE ɛ4) genotype is one of the most commonly collected biomarkers in clinical trials and the most common biomarker on which post hoc analyses are performed, as it is the major genetic risk factor for Alzheimer’s disease (AD), associated with both increased risk for developing AD and relatively earlier age of onset of late-onset AD [2]. It is also associated with increased risk for developing MCI and progression from MCI to AD [3]. Larger, population-based studies have generally shown an association between APOE ɛ4 and more rapid progression of cognitive decline in AD [4].
Experts have advocated enrichment of clinical trials samples in AD and MCI by considering APOE genotype status [5]. The efficacy of therapeutic agents for AD and MCI are initially demonstrated in smaller Phase 2 ‘proof of concept’ clinical trials, which typically do not incorporate analyses based on APOE genotype into the design stage. Rather, APOE subgroup analyses are conducted post hoc, and suggestions of differential effects based on APOE genotype may be used to guide inclusion criteria for larger Phase 3 trials. Such post hoc analyses may have lower power to detect treatment effects due to reduced sample sizes among subgroups and the need to correct for multiple testing [6]. However, post hoc analyses of Phase 2 data may also incorrectly conclude that treatment effects exist for several reasons, including imbalance in subject characteristics between subgroups (particularly for small samples); unrecognized interactions between subject characteristics and treatment; and variances among subgroups in the Phase 2 trial that are markedly different from the variances in the Phase 3 trial, as the former may deviate more from the population values than the latter due to its smaller sample size [6–8]. Furthermore, the course of cognitive decline in AD is highly variable among individuals, even after accounting for APOE genotype [9]. This raises the possibility that post hoc analyses of clinical trials based on APOE genotype could lead to erroneous conclusions, though the hazards of such analyses are often underappreciated [10].
At least two phase 3 clinical trials, using APOE as a selection criterion, have been undertaken based on post hoc analysis of phase 2 trials. Post hoc analysis of phase 2 data for rosiglitazone showed a treatment effect in APOE ɛ4 noncarriers on the ADAS-cog at the highest treatment doses, but no effect in the carriers [11]. A stratified randomization of APOE ɛ4 carriers and noncarriers was used in the follow-up phase 3 trial, but results were ultimately negative [12]. Separate trials of APOE ɛ4 carriers and noncarriers were undertaken in a phase 3 trial for the amyloid antibody bapineuzumab based on post hoc analysis of phase 2 trial data suggesting that the latter showed a treatment effect on the Alzheimer’s Disease Assessment Scale-cognitive (ADAS-cog) while the former did not [13]. However, results of both trials were also negative [14]. Finally, post hoc analysis of phase 3 data for intravenous gammaglobulin (IVIG) showed a treatment effect in APOE ɛ4 carriers, but not noncarriers, on the ADAS-Cog at the higher dosage [15]. Further trials of IVIG have been discontinued, and it is unclear if a trial based on APOE status will be undertaken.
Stone et al. [4] examined two industry-sponsored trials (Merck Protocol 091/rofecoxib [16] and Protocol 030/MK-677 growth hormone secretagogue [17]) and found post hoc analyses based on APOE genotype showed more rapid progression with the ɛ4 allele, but warned that smaller trials (<100 APOE ɛ4 subjects per arm) should be interpreted with caution. Although the general assertion that increasing sample size reduces the probability of false positive results is unquestioned, it is not clear that firm conclusions regarding necessary sample sizes can be reached from the analysis of only two trials. Clinical trials in MCI and AD present many challenges for statistical analysis, with a complex pattern of dropouts and missing data [18, 19]. The primary outcome instrument, the ADAS-cog, has several psychometric shortcomings [20, 21] and only modest interrater reliability [22]. Such factors make extrapolation based on treatment effects from a small number of studies potentially problematic. To address this concern, we empirically tested the recommendation of Stone et al. [4] by simulating clinical trials scenarios of MCI or AD patients across a broad range of sample sizes, using a recently developed meta-database of studies from the Alzheimer’s Disease Cooperative Study (ADCS) [23], Alzheimer’s Disease Neuroimaging Initiative (ADNI) [24], and Coalition Against Major Diseases (CAMD) [25].
MATERIALS AND METHODS
Study overview and participants
Participants for simulations were drawn from a meta-database consisting of 23 ADCS and CAMD studies and ADNI, representing both clinical trials and observational studies in AD, MCI, and normal individuals (National Institutes of Health grant R01 AG037561) [26]. The primary outcome measure was the ADAS-cog [27], which evaluates memory, reasoning, orientation, praxis, language, and word finding difficulty, and is scored from 0 to 70 errors. Of the available AD and MCI studies, 15 had both ADAS-cog ratings and APOE genotyping (see Table 1). Clinical assessments were done at 6-month intervals over the first 2 years.
For inclusion, all studies needed to demonstrate AD diagnosis was based on NINCDS-ADRDA criteria [28], with the additional requirement of minimal severity based on clinical ratings in the ADCS studies. These were Mini-Mental State Examination (MMSE) [29] scores between 14 and 26 (DHA, HC), 12 and 28 (CE), 12 and 26 (LL), 13 and 26 (PR), and 12 and 20 (VN) (i.e., mild to moderate severity across trials). Diagnosis of MCI required a CDR score of 0.5 with the memory box scored at 0.5 or greater, and delayed recall from the Logical Memory II subscale of the Wechsler Memory Scale–Revised [30] to be ≤8 for 16 years of education, ≤4 for 8–15 years, or ≤2 for 0–7 years. Patients had to be largely intact with regard to general cognition and functional performance, and could not qualify for a dementia diagnosis. Participants with AD or MCI in most of the trials analyzed could continue using marketed anti-dementia drugs if they had been on stable doses prior to entry, and were not excluded from simulations.
Simulation methods
Simulations were conducted under a detailed protocol [31], similar to our previously published approach [26, 32], to reflect the placebo arm of typical clinical trials of an experimental drug for amnestic MCI or AD with one treatment and placebo group, 1:1 allocation ratio, and parameters selected to be consistent with previously published trials [33, 34] and ADNI. This simulation approach allows for construction of clinical trials that are not identical to the original studies, but are expected to reflect the subject composition of future clinical trials as well.
For each trial scenario, a separate set of subjects was constructed by randomly choosing from the meta-database with replacement, i.e., subjects from the dataset could be present in the simulated groups more than once. Sample sizes of 25 to 300 per group were used; 6, 12, 18, and 24 month long trials were considered; and the ADAS-cog was the primary outcome. Both the placebo group and treatment group from the original trials were included in simulating placebo subjects, as no treatment effect was found in the original trials. To be consistent with the current practice of examining APOE status post hoc, stratified sampling within APOE categories was not performed. The outcome was the score for the subject at the specified time point in the meta-database, with random error of mean 0 and standard deviation 1 added to minimize ties. The outcome was then rounded to the nearest 1/3 point to yield plausible ADAS-cog scores. Dropout rates of 20% and 40% were incorporated into the scenarios.
Patients were classified as APOE ɛ4 carriers if they had one or more copies of the APOE ɛ4 allele, i.e., having genotypes of ɛ2/ɛ4, ɛ3/ɛ4, or ɛ4/ɛ4; and APOE ɛ4 noncarriers if they had no copies of the APOE ɛ4 allele, i.e., having genotypes of ɛ2/ɛ2, ɛ2/ɛ3, or ɛ3/ɛ3. As the effect of APOE was unaltered in the simulations, greater progression would be expected in ɛ4 carriers, consistent with population-based studies. Sensitivity analyses were performed excluding individuals with an APOE ɛ2 genotype, as there is less consensus on the effects of APOE ɛ2 on progression, so that APOE ɛ4 carriers would have genotypes of ɛ3/ɛ4 or ɛ4/ɛ4 and APOE ɛ4 noncarriers would have a genotype of ɛ3/ɛ3; excluding subjects in the treatment arm of original clinical trials in the meta-database; and excluding subjects belonging to minority groups, as these constituted only 5% of the overall sample and may differ in rates of progression.
Statistical analysis
The primary analyses were conducted using a mixed effects linear model (random coefficients model) [35], which adjusts for missing data to test for differences in the slopes (rate of change) of the ADAS-cog between APOE ɛ4 carriers and noncarriers. The mixed effects model was employed as it utilizes data from all participants (rather than just completers), minimizes bias, and better controls for Type I error in the presence of missing data [36]. For each simulated trial, a full model was constructed with group effect (APOE ɛ4 carriers or noncarriers), visit effect, and group by visit interactions, with age as a covariate, and a reduced model with visit and age effects. Thus, for participant i = 1, 2, …, n at visit j = 1, 2, …, n
i
, the full model was
and the reduced model was
Ten thousand simulations were done for each scenario so that estimates of power could be obtained to 4 digits. Power was calculated as the proportion of 10,000 simulated trials per trial scenario having ap-value≤0.05 for mixed model analysis and one-sided p-value≤0.025 for the Wilcoxon test. Analyses were performed using version 2.15.3 of the R programming environment [38]. Mixed model analyses were performed using version 3.1–89 of the nlme package for R [39].
The rate of progression in APOE ɛ4 carriers was estimated by combining the 13 AD studies used in the simulations using random effects meta-regression [40]. For each study, a mixed effects model was constructed as above to estimate the rate of progression in APOE ɛ4 carriers relative to APOE ɛ4 noncarriers. Estimates were then combined using a random effects meta-regression model, with weights inversely proportional to the standard error for each study. A random effects model was chosen over a fixed effects model as the former provides better estimates of the overall effect when study heterogeneity is present and allows generalization of results to a wider population [41]. Funnel plots were used to assess systematic reporting bias in studies [42], and study heterogeneity assessed using Cochran’s Q [43]. All analyses were performed using the metafor package in R [44].
Standard protocol approvals, registrations, and patient consents
All study procedures were approved by local institutional review boards. The analyses for this study were exempted from informed consent requirements by the IRB.
RESULTS
Patient characteristics
Across studies, 60% of AD subjects and 54% of MCI subjects were APOE ɛ4 carriers (Tables 2 and 3). For both AD and MCI, APOE ɛ4 carriers and noncarriers were similar on most demographic and clinical characteristics, being predominantly Caucasian, married, and highly educated. Slightly more than half of MCI participants were males, while slightly more than half of AD participants were females. APOE ɛ4 noncarriers were older than carriers in both AD and MCI groups.
For both MCI and AD groups, APOE ɛ4 carriers had worse performance on the ADAS-cog at baseline and all subsequent time points. The variability (standard deviation) was also higher for APOE ɛ4 carriers compared to non-carriers.
Rates of progression
Meta-regression of AD studies in the meta-database showed more rapid progression among APOE ɛ4 carriers of approximately 0.582 points/year on the ADAS-cog (p = 0.18), though the rate varied among studies (Supplementary Figure 1). The test of heterogeneity was not significant (p = 0.660), but Cochran’s Q can often fail to achieve significance with a small number of studies. Funnel plots suggested that less precise studies may be likely to favor more rapid progression in APOE ɛ4 noncarriers, though no significant evidence of publication bias was seen (Supplementary Figure 2).
Outcomes
The percentage of simulated trials with greater mean progression in APOE ɛ4 noncarriers (running counter to results from population-based studies and likely representing erroneous findings) is shown across a range of sample sizes and trial durations (Fig. 1). Up to 25% of 24-month MCI trials and 37% of 18-month AD trials had greater rate of progression in noncarriers, although many of these were not statistically significant. The percentage of simulated trials where the greater rate of progression in APOE ɛ4 noncarriers achieved statistical significance is shown in Fig. 1(b) and (d). AD trials had a substantial proportion of statistically significant results with greater progression in APOE ɛ4 noncarriers; for 18-month trials, the proportion was 21% with 25 subjects in the placebo arm, 8% with 100 placebo subjects, and less than 2% with 225 placebo subjects. As expected, the proportion of significant results with greater progression in APOE ɛ4 noncarriers decreased with increasing sample size, indicating trial results contradicting population-based studies is due to small-sample characteristics that differ from the population. The proportion of significant results with greater progression in APOE ɛ4 noncarriers also decreased with increasing trial duration, although larger decreases were seen with increasing sample size. MCI trials had more favorable results, with only the smallest trials exhibiting a statistically significant result with greater progression among APOE ɛ4 noncarriers; the proportion was 6% for trials with 25 placebo subjects and less than 1% for trials with 75 or more placebo subjects (Fig. 1).
Analyses conducted using nonparametric Wilcoxon tests were similar to the primary analyses (data not shown), as were analyses with 20% dropouts (Fig. 2). Sensitivity analyses excluding APOE ɛ2 genotypes, minority participants, and individuals in the treatment arm of the original trials in the meta-database did not show any meaningful difference in results, although the percentage of 18 month AD trials that showed greater progression in APOE ɛ4 noncarriers was higher when the treatment arm of the original trials was excluded (Supplementary Figures 3–5).
DISCUSSION
This study provides an empirical evaluation of the use of post hoc analyses based on APOE genotype carrier status in AD and MCI clinical trials. As expected, most post hoc analyses did not achieve statistical significance for the effect of APOE genotype, particularly smaller trials. However, for AD trials that did demonstrate a statistically significant difference in progression based on APOE genotype, a substantial proportion—up to 21% of trials with 18-month durations—had results that showed more rapid progression in APOE ɛ4 noncarriers, despite considerable evidence from population-based studies that carriers should have more rapid progression. Thus, even though the Type I error for each individual trial is controlled at α= 0.05 (or 0.025), the statistically significant sample estimate of the rate of decline is not consistent with the population value for the rate in up to 21% of trials examined. This discrepancy primarily occurs with smaller sample sizes of subgroups, with random selection of an unusual subgroup from the larger population of potential subjects by chance, and unlikely to be replicated in follow-up trials. These results provide a stronger justification for the recommendations of Stone et al. [4] that post hoc analyses based on APOE ɛ4 carrier status be confined to AD trials with at least 100 carriers present.
These results have significant implications for interpretation of post hoc analyses of AD clinical trials and the design of future trials based on post hoc comparisons. The findings presented here represent large-sample statistical results that model the effect of repeating a clinical trial multiple times. In reality, it cannot be determined if post hoc analysis of any one clinical trial has reached erroneous conclusions, although a probability of error can be assigned using these simulations as a guide. With a high proportion of simulations showing results in the opposite direction from epidemiological studies, it is likely that a large number of post hoc analyses reported in the literature or used internally in the design of future trials are incorrect. Given the considerable amount of time and resources involved in executing clinical trials, more rigorous methods of incorporating APOE genotype status or other subgroups into analyses of small trials are necessary.
For MCI trials, there was again a substantial proportion—up to 25% of trials with 24-month durations—where the median change for APOE ɛ4 noncarriers exceeded that of carriers. However, the percentage of trials with statistically significant results showing more rapid progression among APOE ɛ4 was much smaller, less than 1% when there were 50 or more placebo subjects in the trial. This may reflect that individuals with MCI show less departure from population values than individuals with AD, or that the longer duration of MCI trials provides more accurate estimates of the slope (rate of change) in this population. It may also reflect the simulation methodology, as there were only two MCI trials with APOE genotyping data available. Inclusion of additional trials into the simulation might reveal increased individual differences between trials or between centers within trials, leading to a larger proportion of post hoc analyses with conclusions showing more rapid progression among APOE ɛ4 noncarriers. Thus, the lower percentage of erroneous conclusions in the MCI studies relative to AD may not hold if a larger number of studies are included in the former.
A particular strength of this study is the use of a meta-database of several clinical trials and observational studies across a population of more than 3,500 individuals. Inclusion of a large number of participants across several studies greatly increases generalizability, in contrast to the analysis by Stone et al. [4] that was limited to analysis of two trials. The meta-analysis of the studies used in the simulations confirms that the sample is comparable to most population-based studies, which have found more rapid progression among APOE ɛ4 carriers. Furthermore, the simulation strategy allows one to analyze results of hypothetical future trials with similar characteristics to studies in the meta-database, rather than being limited to analysis of trials already conducted. Simulations also allow analysis of a wide range of trial sample sizes, so that a more precise recommendation on the minimum number of subjects for a post hoc analysis can be made.
Nevertheless, results of these analyses are not without limitations. Most significantly, the simulation designs are based on the current standard of a double-blind, placebo-controlled, randomized clinical trial. While this approach should ensure that the results are broadly applicable to AD and MCI clinical trials, use of alternative trial designs (e.g., adaptive designs with sample size re-estimation) is growing [45]. It is unclear if results here would apply when such departures from standard trial design are incorporated. Second, this analysis did not distinguish between individuals based on the number of APOE ɛ4 alleles present, and only utilized a single biomarker. Although this approach is consistent with most post hoc analyses of clinical trials, separate analyses for homozygotes, who show a greater rate of progression than heterozygotes in population-based studies, may yield additional power and reduce the number of erroneous conclusions. The inclusion of other biomarkers, such as amyloid imaging, may also improve the accuracy of post hoc analyses but were not evaluated in our simulations. Third, although we confirmed that APOE ɛ4 is associated with more rapid decline in our sample, other studies (not necessarily using the ADAS-cog as an outcome) have not [46]. Obviously, if APOE ɛ4 is truly not associated with the rate of decline, the rates of false positives in our study would not be accurate, though the warning against post hoc analyses would still hold. It is also possible that raters were aware of the APOE ɛ4 status of the subjects, so that systematic biases in the rating of carriers and noncarriers were present. Finally, as is typical for most AD clinical trials, the vast majority of subjects in these simulations were Caucasian, making extrapolation to other racial and ethnic groups difficult.
In sum, this study demonstrates one of the significant drawbacks of post hoc analyses based on APOE ɛ4 carrier status, particularly for AD clinical trials: a sizeable number of such analyses are likely to yield incorrect conclusions, with no clear method for determining if any one particular trial is erroneous. Thus, published reports of differential effects between APOE ɛ4 carriers and non-carriers from post hoc analyses must be viewed with a high degree of skepticism. Analyses based on APOE genotypes should be incorporated into the design stage of clinical trials when appropriate for the drug being tested or in targeted clinical trial designs [47, 48]. Care must be taken in interpretation of post hoc analyses of clinical trials in AD and MCI to avoid planning follow-up trials around findings from analyses that ultimately prove to be mistaken.
Footnotes
ACKNOWLEDGMENTS
Funding for this report was provided by NIH R01 AG037561 (LSS, REK, GW, GRC), NIH P50 AG05142, ADRC (LSS). Data used in the preparation of this study were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI, NIA U01 AG024904) database (adni.loni.ucla.edu), the Alzheimer’s Disease Cooperative Study (NIH AG10483), and the Coalition Against Major Diseases (CAMD).
