Abstract
Heritability analysis of complex traits/diseases is commonly performed to obtain illustrative information about the potential contribution of the genetic factors to their phenotypic variances. In this study, we investigated the narrow-sense heritability (h2) of Alzheimer’s disease (AD) using genome-wide single-nucleotide polymorphisms (SNPs) data from three independent studies in the linear mixed models framework. Our meta-analyses demonstrated that the estimated h2 values (and their standard errors) of AD in liability scale were 0.280 (0.091), 0.348 (0.113), and 0.389 (0.126) assuming AD prevalence rates of 10%, 20%, or 30% at ages of 65+, 75+, and 85+ years, respectively. We also found that chromosomal regions containing two or more AD-associated SNPs at p < 5E-08 could collectively explain 37% of the additive genetic variance of AD in our samples. AD-associated regions in which at least one SNP had attained p < 5E-08 explained 56% of the additive genetic variance of AD. These regions harbored 3% and 11% of SNPs in our analyses. Also, the chromosomal regions containing two or more and one or more AD-associated SNPs at p < 5E-06 accounted for 72% and 94% of the additive genetic variance of AD, respectively. These regions harbored 27% and 44% of SNPs in our analyses. Our findings showed that the overall contribution of the additive genetic effects to the AD liability was moderate and age-dependent. Also, they supported the importance of focusing on known AD-associated chromosomal regions to investigate the genetic basis of AD, e.g., through haplotype analysis, analysis of heterogeneity, and functional studies.
Keywords
INTRODUCTION
Alzheimer’s disease (AD) is the most common neurodegenerative disorder and the most common cause of dementia in people older than 65 years [1]. Late onset AD is believed to occur sporadically with a complex inheritance pattern [2]. As a complex disease of polygenic nature, characteristic for post reproductive period, the phenotypic variance of AD is likely to be attributed to the combined effects of genetic and non-genetic factors, and their interactions. For example, the Apolipoprotein E (APOE) ɛ4 allele, which is known for decades for its remarkably strong association with AD [3, 4], is considered as the risk but not causal factor for AD because, for example, even the APOE ɛ4 homozygous individual may remain AD-free until extreme old age [4, 5]. This implies that the effect of the strongest genetic risk factor for AD, the APOE ɛ4 allele, can be modulated by interactions with the environment and/or other genes.
Still, the field intends to characterize the genetic variance of AD by evaluating the narrow-sense heritability (h2), which is a fraction of phenotypic variance attributed to a ‘pure’ additive genetic component. The premise of h2 analyses is to quantify the upper limits of the phenotypic variation of AD that can be collectively caused by the additive genetic effects across genome. Nevertheless, care must be taken when interpreting the results from such analyses due to potential confounding of the additive genetic variance of a trait by the non-additive genetic (i.e., dominance and epistatic), and environmental components [6–9] that, for instance, may arise from the lack of controlled environment condition or appropriate study design [10, 11].
There are various methods for estimating h2 such as twin-based analyses or linear mixed models (LMMs). In general, LMMs-based methodology is preferred as it benefits from a more relaxed assumption regarding the relatedness of individual in sample, and provides a framework to model non-additive genetic components [12], environmental factors, and gene-environment interactions (e.g., by including a gene-by-environment (GxE) relationship matrix [13] or covariates such as age, sex, or epigenetic modifications into the model [9]) that in turn may attenuate the bias in estimated additive genetic component. The h2 of AD was estimated to be 58%–74% in twin studies [14, 15], and 24%–53% [16, 17] by LMMs-based methods. In addition, the APOE locus coded by rs429358 and rs7412 single-nucleotide polymorphisms (SNPs) was estimated to explain 13.42% of the phenotypic variance of AD. Also, SNPs within±50 kb of the APOE locus and 27 other well-known AD-associated genes were estimated to collectively account for 31.54% of the phenotypic variance of AD [17]. The differences between h2 estimates from twin studies and SNPs-based models inform that there might be more genes and genetic variants which can confer risk of AD.
In this study, we investigated the SNPs-based h2 of AD in three independent cohorts using the genomic best linear unbiased prediction (G-BLUP) methodology [18–20] which integrates the realized genomic relationship matrices, containing the observed proportion of identical-by-descent (IBD) loci that each pair of individual share across their genome [19, 21], into the LMMs framework. In particular, we evaluated the proportions of the additive genetic variance of AD that could be explained by chromosomal regions containing previously reported AD-associated SNPs at genome-wide (i.e., p < 5E-08) and/or suggestive (i.e., p < 5E-06) levels of significances.
METHODS
Study participants
Our genetic analyses were performed using the genotype and phenotype information available for 2,544 AD cases and 11,739 controls from three independent studies: 1) the original and offspring cohorts from Framingham Heart Study (FHS) [22, 23], 2) Health and Retirement Study (HRS) [24], and 3) National Institute on Aging’s Late-Onset Alzheimer’s Disease Family Study (NIA-LOADFS) [25]. All three studies were conducted under the institutional review boards (IRBs) guidelines, and can be accessed through the dbGaP repository (https://www.ncbi.nlm.nih.gov/gap) and the University of Michigan restricted access webpage (http://hrsonline.isr.umich.edu/index.php?p=data) by the qualified researchers upon approval by the local IRB.
The LOADFS and FHS studies include families and singletons whereas the HRS cohort is a population-based study. Most AD cases were diagnosed based on the clinical findings and routine neurologic examinations (e.g., the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) criteria [26]) without the aid of histopathologic findings (i.e., brain biopsy) or biomarkers. The AD cases and healthy controls were directly identified by the LOADFS and FHS researchers. The International Classification of Disease codes, Ninth revision (ICD-9) were used to determine cases and controls in the HRS cohort based on the Medicare claims available for the study subjects (i.e., ICD-9 : 331.0 code). All three studies predominantly enrolled individuals of Caucasian ancestry who were analyzed here as well. We excluded the third generation cohort of FHS in order to make the age of participants comparable among the three datasets. Therefore, based on the inclusion criteria, 3716, 4409, and 6158 subjects from LOADFS, FHS, and HRS, respectively, were included in our analyses. Basic demographic information about the study participants is presented in Table 1 (also see [2]). As seen in this Table, the female-to-male ratios in all three studies were larger than one, and the AD cases were on average around 7–17 years older than controls. Also, while LOADFS was primarily initiated to study the genetic basis of AD, the FHS and HRS studies were primarily intended to investigate the cardiovascular diseases and age-related health/well-being/economic issues, respectively. As a result, most AD cases in our study were contributed by LOADFS (i.e., 72.72%, 16.23%, and 11.05% of AD cases, respectively).
Basic demographic information about datasets included in the genetic analyses
FHS, Framingham Heart Study; HRS, Health and Retirement Study; LOADFS, National Institute on Aging’s Late-Onset Alzheimer’s Disease Family Study; Female%, the percentage of females in the study; Age ± SD, the average age and its standard deviation.
Heritability estimates
Our analyses were performed on around 1.5–2 million genotyped and imputed SNPs located on autosomal chromosomes after removing low-quality SNPs (i.e., minor-allele frequencies <0.01, missing rates >5%, pHardy - Weinberg < 1E-06, and squared correlation coefficient (r2) between the imputed and expected true genotypes <0.7 for imputed SNPs) and subjects (i.e., calling rates <95%) [27]. Also, in the case of the LOADFS and FHS datasets that had family-based design, SNPs and subjects/families with Mendel error rates >2% were removed. Information about the numbers of SNPs used in our analyses is shown in Table 2.
Numbers (and percentages) of SNPs included in the genetic analyses
SNP, single-nucleotide polymorphism; FHS, Framingham Heart Study; HRS, the University of Michigan Health and Retirement Study; LOADFS, National Institute on Aging’s Late-Onset Alzheimer’s Disease Family Study; R6, Regions containing at least one AD-associated SNPs at p < 5E-06; R6T, Regions containing at least two AD-associated SNPs at p < 5E-06; R8, Regions containing at least one AD-associated SNPs at p < 5E-08; R8T, Regions containing at least two AD-associated SNPs at p < 5E-08.
In each cohort, the G-BLUP methodology [18–20] was used to evaluate the SNPs-based h2 of AD by fitting a LMM in which the top 5 principal components (PCs), birth year, and sex of subjects were considered as fixed-effects covariates, and the additive genetic effects of individuals were considered as a random-effects covariate. The additive genetic effects were modeled by including a normalized marker-based additive relationship matrix [20, 28], generated over the SNPs that passed quality control criteria. The elements of this matrix represented the realized (i.e., observed) proportions of IBD alleles for pairs of individuals across the SNPs loci under consideration [19, 21]. The general structure of the fitted LMM was as follows:
To address the confounding effects of shared environment in the case of the LOADFS and FHS datasets that had family-based designs, another relationship matrix was included in the LMM to capture the GxE interactions for related subjects. The elements of a GxE matrix were the same as those in the additive relationship matrix for family members and zero otherwise [13].
The restricted maximum likelihood (REML) method [29] was applied to estimate the variance-covariance components of the fitted LMMs which were then used for estimating the additive genetic effects for individuals (i.e., best linear unbiased predictions or BLUPs) and the h2 values (i.e., the portion of phenotypic variance explained by the additive genetic variance,
The estimates of h2 of AD from the three cohorts under consideration were then combined by an inverse-variance meta-analysis [32] to obtain an h2 meta-estimate
Once the additive genetic variance of AD was estimated using all SNPs located on the autosomal chromosomes, we further investigated the fraction of the genetic variance that could be explained by the SNPs located within±1 Mb of the SNPs that have been previously associated with AD at the genome-wide (i.e., p < 5E-08) or suggestive (i.e., p < 5E-06) significance levels. The list of previously AD-associated SNPs was obtained from the NHGRI-EBI GWAS catalog (release of December 2018) [33]. This resulted in 1,032 AD-associated SNPs at p < 5E-06. Of these, 253 SNPs were associated with AD at p < 5E-08. We considered the±1 Mb up/downstream regions of any of these SNPs as an AD-associated chromosomal region, and investigated the extent to which such regions may contribute to the additive genetic variance of AD. We contrasted four alternative scenarios in which chromosomal regions that contained: 1) at least two AD-associated SNPs at p < 5E-08 (i.e., R8T regions), 2) at least one AD-associated SNPs at p < 5E-08 (i.e., R8 regions), 3) at least two AD-associated SNPs at p < 5E-06 (i.e., R6T regions), and 4) at least one AD-associated SNPs at p < 5E-06 (i.e., R6 regions) were analyzed. For these analyses, two additive relationship matrices, one generated using SNPs in the chromosomal regions under consideration and the other using the remaining SNPs, were included in the models. The fixed-effects covariates and shared environment effects were also modeled as explained above.
RESULTS
The SNPs-based h2 values and their SEs estimated on the observed scale of AD in the LOADFS, FHS, and HRS cohorts were 0.270 (0.099), 0.081 (0.068), and 0.054 (0.062), respectively, when all SNPs on autosomal chromosomes were included in the genetic analyses. The estimated
Table 3 summarizes the results of h2 analyses once the estimated variance components were transformed to the liability scale assuming population prevalence rates of AD were 10%, 20%, or 30% at different ages (i.e., 65+, 75+, and 85+ years, respectively [1]). The adjusted
Estimated SNPs-based narrow-sense heritability (h2) values (and their standard errors) of Alzheimer’s disease (AD) in liability scale assuming three population prevalence rates
SNP, single-nucleotide polymorphism; FHS, Framingham Heart Study; HRS, the University of Michigan Health and Retirement Study; LOADFS, National Institute on Aging’s Late-Onset Alzheimer’s Disease Family Study; R6, Regions containing at least one AD-associated SNPs at p < 5E-06; R6T, Regions containing at least two AD-associated SNPs at p < 5E-06; R8, Regions containing at least one AD-associated SNPs at p < 5E-08; R8T: Regions containing at least two AD-associated SNPs at p < 5E-08;
The genetic analyses were then performed to examine the fraction of the genetic variance of AD that could be attributed to SNPs within the±1 Mb up/downstream regions of previously discovered AD-associated SNPs. As seen in Table 3, the chromosomal regions that contained two or more AD-associated SNPs at the genome-wide level of significance (i.e., p < 5E-08) accounted for 36.87% of the additive genetic variance of AD (i.e., R8T regions). The regions with one or more AD-associated SNPs at p < 5E-08 collectively explained 56.22% of the additive genetic variance in AD (i.e., R8 regions). We also found that the regions that contained at least two AD-associated SNPs at suggestive level of significance (i.e., p < 5E-06) accounted for 72.43% of the additive genetic variance of AD (i.e., R6T regions). Finally, the chromosomal regions with at least one AD-associated SNPs at p < 5E-06 explained 93.96% of the additive genetic variance of AD (i.e., R6 regions). These four alternative chromosomal regions harbored around 3%, 11%, 27%, and 44% of SNPs in our analyses, respectively (Table 2).
DISCUSSION
In this study, we investigated the fraction of phenotypic variance of AD which might be explained by its additive genetic variance. The whole-genome SNPs-based h2 values of AD were estimated in three independent datasets which were then combined by an inverse-variance meta-analysis. We found that the meta-analysis h2 estimates in liability scale were 28%, 34.8%, and 38.9% assuming population prevalence of AD to be 10%, 20%, or 30%, approximately corresponding to the AD prevalence rates at ages 65+, 75+, and 85+ years [1]. The increase in prevalence rates of AD with age evinced the age-dependent liability thresholds [34] at which AD develops. In theory, the age-dependent liability can be mediated by alterations in genetic, environmental, and/or GXE effects over the life course. The increase in the estimated h2 of AD with the disease prevalence suggested that the overall additive genetic contribution to the AD liability can be different across lifespan. This is, in fact, in agreement with previous reports suggesting the effects of individual genetic factors associated with complex traits were age-dependent, i.e., their effects may appear at certain ages [35, 36] or even be opposite in different age periods [37].
The estimated h2 values in our analyses (i.e., h2= 28%–39%) were different from predicted values in previous twin studies (i.e., h2= 58%–74%) [14, 15], explaining 38%–67% of the predicted values. The difference in the proportions of predicted and explained h2 is known as the missing heritability problem. Literature discusses several potential causes of the missing heritability including the inflation of the estimates of the additive genetic variance by other factors such as non-additive genetic effects, epigenetics modifications, and/or environmental factors [6–9] in the twin studies. The three datasets analyzed here have different designs from those used by twin studies; i.e., the HRS cohort is a population-based study gathering data mostly from independent subjects, and LOADFS and FHS are family-based studies providing data for mixtures of singletons and two/three generations of mostly small-size families. Therefore, different designs may partially account for the discrepancy in the estimated h2 values between our and previous studies. In fact, the degree of relatedness of individuals in data may have direct impacts on the estimates of genetic parameters. For instance, having data from twins, large number of full-sibs, or multi-generational families may result in higher estimates of h2 because such datasets provide more inter-connections in the elements of the genomic relationship matrix. Therefore, the additive effect for each individual would be determined based on several highly correlated response values, and the variance components would be estimated more precisely. In such cases information from both family structure and linkage disequilibrium (LD) among markers is exploited by the LMMs framework for estimating the genetic parameters of interest [38]. On the other hand, in cohorts with independent individuals or more distant relatives the parameter are mainly estimated using LD structure of data as the kinship matrix provides less information regarding additive genetic relationship among individuals [38, 39]. However, it should be noted that, as the degree of relatedness increases among individuals, the requirement of controlling the environmental confounders for obtaining accurate estimate of h2 becomes more important due to the higher possibility of confounding the estimates of additive genetic component with shared environmental effects.
Also, the missing heritability might partially be accounted for by the density of markers used to estimate the SNP-based h2 of AD. It has been suggested that many common variants with infinitesimal effects and several rare variants of moderate to large effect sizes may causally contribute to the genetic architecture of complex diseases such as AD [40, 41]. Therefore, it is expected that using denser genotype data may result in capturing a larger fraction of the genetic variance due to LD among the discovered/undiscovered AD causal variants and SNPs included in heritability analysis. For instance, our results were consistent with those from two previous studies that estimated the SNP-based h2 of AD to be 24% [16] and 33.12% [42] using around half-million and two million SNPs, respectively. However, Ridge et al. (2016) demonstrated that 53.24% of the phenotypic variance of AD can be explained by genotype information from more than 8.7 million SNPs [17] which was more than 4 times the number of SNPs were used in our analyses.
Most AD cases in our study were contributed by LOADFS in which the case-to-control ratio was nearly one. The AD cases constituted 9.4% and 4.6% of the analyzed subjects in the FHS and HRS studies, respectively. Therefore, the genetic analyses of these two datasets may suffer from insufficient statistical power, which was reflected in the larger standard errors of the estimated h2 values compared to those obtained from LOADFS. This in turn may slightly bias the meta-analyses results and partially explain the smaller h2 estimates compared to twin studies.
It has been previously reported that the SNPs mapped to 28 well known AD-associated genes, several of which replicated in independent studies, could account for 59.25% of the genetic variance of AD [17]. However, due to the heterogeneity underlying the genetic architecture of AD it is important to extend such analyses to other chromosomal regions whose association signals were not universally replicated or were detected only at the suggestive level of significance (i.e., p < 5E-06). Therefore, we investigated the fraction of the additive genetic variance of AD in our samples that can be attributed to these regions. Our analyses demonstrated that SNPs within the regions with at least two (i.e., R8T regions) and at least one (i.e., R8 regions) AD-associated SNPs at p < 5E-08, reported in prior studies, could collectively explain 36.87% and 56.22% of the additive genetic variance of AD, respectively. The R8T and R8 regions harbored 3% and 11% of the SNPs in our analyses. These findings corroborated the results from the aforementioned Ridge et al. study [17]. Interestingly, when AD-associated regions with SNPs at the suggestive level of significance were analyzed, we found that only small fractions of the additive genetic variance of AD remained unexplained as SNPs within the R6T and R6 regions (i.e., 27% and 44% of SNPs in our analyses) accounted for 72.43% and 93.96% of the genetic variance of AD. These findings suggested that the AD-associated loci that did not pass the genome-wide significance threshold of 5E-08 in previous studies may not be necessarily false-positive findings. Instead, more rigorous studies with larger sample sizes or less heterogeneous samples may help discovering stronger association signals for them. Also, our findings suggested that some additional variants could potentially exist in the up/downstream regions of the already discovered AD-associated markers at p < 5E-06 to contribute to the genetic basis of AD. For instance, these can be some SNPs with small effect sizes which require very large samples to be detected, or complex haplotypes affecting AD susceptibility [40, 43].
In summary, we found that the common SNPs could moderately contribute to the genetic architecture of AD, explaining up to 39% of its phenotypic variance in liability scale in our samples and between 38%–67% of its predicted h2 in twin studies. As with any random variable, the h2 estimates may demonstrate variations among samples because AD is a genetically heterogeneous complex traits and, therefore, its genetic architecture is not universal. Therefore, between-study differences in h2 estimates of AD may, in part, reflect specifics of the investigated populations. In addition, the differences in the study designs (e.g., sample sizes, marker density, degree of relatedness of subjects, etc.) can contribute to the differences in h2 estimates. Our analyses demonstrated that the additive genetic contributions to the AD liability did not remain constant across the lifespan; instead, it increased with the increase in the AD prevalence at older ages. Of note, we found that±1 Mb flanking regions of AD-associated SNPs could account for major fractions of the additive genetic variance of AD in our samples (i.e., up to 56% for AD-associated regions at p < 5E-08 and up to 94% for AD-associated regions at p < 5E-06). The fractions of the additive genetic variance of AD explained by AD-associated regions at genome-wide and suggestive significance levels were different by 38%. This difference featured the importance of additive contributions to the genetic basis of AD by discovered/undiscovered variants in the regions that had not attained p < 5E-08 in conducted GWAS. These findings may have implications for the future studies of AD as they supported the importance of focusing on known AD-associated chromosomal regions through more rigorous methods such as, haplotype analysis, analysis of heterogeneity, deep sequencing, and functional studies, in order to investigate the genetic architecture of AD.
Despite rigor, we acknowledge some limitations of this study (and similar studies). In this study, we focused on the analysis of additive genetic variance of AD, disregarding the dominance and epistatic variance components. This was not meant to undermine potential non-additive contributions to the genetic variance of AD. However, it should be noted that the analysis of non-additive genetic variance components of a trait using LMMs framework requires gathering data with specific design (e.g., a large number of large full-sibs families) which may not be feasible in human studies [12]. Also, the cases and controls in the selected studies (i.e., LOADFS, FHS, and HRS) were mostly identified based on clinical criteria without the aid of histopathologic findings (i.e., brain biopsy). It has been suggested that histopathologic findings from brain biopsies may increase the accuracy of AD diagnosis. Therefore, future analysis of large samples of histopathologically diagnosed AD cases and healthy controls may help to obtain more accurate estimates of AD heritability [44].
Footnotes
ACKNOWLEDGMENTS
This research was supported by Grants from the National Institute on Aging (P01AG043352 and R01AG047310). The funders had no role in study design, data collection and analysis, decision to publish, or manuscript preparation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors declare no competing interests.
This manuscript was prepared using limited access datasets that are available through dbGaP repository (https://www.ncbi.nlm.nih.gov/gap) for qualified researchers (accession numbers: phs000168.v2.p2 (LOADFS), phs000007.v28.p10 (FHS), and phs000428.v2.p2 (HRS)) and through the University of Michigan (
). The authors thank Dr. Arseniy P. Yashkin for help preparing the HRS phenotypes.
Funding support for the Late Onset Alzheimer’s Disease Family Study (LOADFS) was provided through the Division of Neuroscience, NIA. The LOADFS includes a genome-wide association study funded as part of the Division of Neuroscience, NIA. Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by Genetic Consortium for Late Onset Alzheimer’s Disease. This manuscript was not prepared in collaboration with LOADFS investigators and does not necessarily reflect the opinions or views of LOADFS.
The Framingham Heart Study (FHS) is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195 and HHSN268201500001I). This manuscript was not prepared in collaboration with investigators of the FHS and does not necessarily reflect the opinions or views of the FHS, Boston University, or NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL-64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University. Funding for CARe genotyping was provided by NHLBI Contract N01-HC-65226. Funding support for the Framingham Dementia dataset was provided by NIH/NIA grant R01 AG08122. Funding support for the Framingham Inflammatory Markers was provided by NIH grants R01 HL064753, R01 HL076784 and R01 AG028321. Funding support for the Framingham C-reactive Protein dataset was provided by NIH grants R01 HL064753, R01 HL076784 and R01 AG028321. Funding support for the Framingham Adiponectin dataset was provided by NIH/NHLBI grant R01-DK-080739. Funding support for the Framingham Interleukin-6 dataset was provided by NIH grants R01 HL064753, R01 HL076784 and R01 AG028321.
The Health and Retirement Study (HRS) genetic data is sponsored by the Genetics Resource with HRS April 21, 2010, version G Page 5 of 7 National Institute on Aging (grant numbers U01AG009740, RC2AG036495, and RC4AG039029) and was conducted by the University of Michigan. This manuscript was not prepared in collaboration with HRS investigators and does not necessarily reflect the opinions or views of HRS.
