Abstract
Cohort studies investigating aging and dementia require APOE genotyping. We compared directly measured APOE genotypes to ‘hard-call’ genotypes derived from imputing genome-wide genotyping data from a range of platforms using several imputation panels. Older GWAS arrays imputed to 1000 Genomes Project (1KGP) phases and the Haplotype Reference Consortium (HRC) reference panels were able to achieve concordance rates of over 98% with stringent quality control (hard-call-threshold 0.8). However, this resulted in high levels of missingness (>12% with 1KGP and 5% with HRC). With recent GWAS arrays, concordance of 99% could be obtained with relatively lenient QC, resulting in no missingness.
Keywords
INTRODUCTION
APOE is a major genetic risk factor for late-onset Alzheimer’s disease (AD), explaining almost 30% of the population-attributable risk, and is by far the greatest contributing genetic risk factor of all variants identified from genome-wide association studies (GWAS) [1]. APOE ɛ3 is the most common and the neutral risk allele. Heterozygotes carrying one risk allele, APOE ɛ4, have four times the risk of AD, and homozygotes carrying two APOE ɛ4 alleles have 15 times the risk compared to APOE ɛ3 homozygotes [2]. A third low frequency allele APOE ɛ2 confers decreased risk, but with a more subtle effect [2]. ApoE plays a major role in the regulation of lipid and lipoprotein levels in the blood [3].
APOE genotype is often directly genotyped. As many large cohorts carry out genome-wide genotyping using arrays, it would be cheaper and more convenient to be able to ascertain APOE genotype from this already available data. If not directly included on the array, the genotypes required can be estimated by imputation using a reference dataset [4]. It has previously been shown that using array-based genotyping (using the Affymetrix Kaiser Axiom array) imputed to the 1KGP Phase 1 panel compared to direct genotyping finds 90% agreement for ɛ2/ɛ3/ɛ4 genotypes and 93% agreement for predicting ɛ4 status, but with high levels of missing data [5]. Radmanesh et al. found similar results for the Illumina HumanHap610 array imputed to 1KGP Phase 1 panel [6]. Correlations of the APOE SNPs rs7412 and rs429558 between imputed and genotype data were 0.9–0.94 (Kappa coefficients), but with up to 19% missing. Changing imputation algorithm parameters reduced the level of missingness to 5–9% but also reduced the accuracy. Here we extend these analyses and report the concordance between imputed and directly measured APOE genotypes for a range of genotyping arrays and imputation reference panels in a large sample size.
METHODS
Comparison of measured and imputed APOE genotypes showing concordance and missingness. A) for first-generation array (based on HapMap references) datasets; B) second-generation array datasets (based on 1KGP references)
Comparison of measured and imputed APOE genotypes showing concordance and missingness. A) for first-generation array (based on HapMap references) datasets; B) second-generation array datasets (based on 1KGP references)
% Con, percentage concordance; % Miss, percentage missing; ESS, effective sample size; 1KGP Phase 1, 1000 Genomes Phase 1 Version 3 reference panel; 1000G Phase 3, 1000 Genomes Phase 3 version 5 reference panel; HRC, Haplotype Reference Consortium Release 1 reference panel. *The Genotype Hard Call Probability Threshold is the threshold above which conversion of genotype call from dosage to hard call allele count format was carried out. Values below the threshold are called as missing.
Participants in this study comprise of individuals recruited for studies led by the Genetic Epidemiology group at QIMR Berghofer Medical Research Institute. The sample set is made up of Australians of European Ancestry including twins and their relatives who volunteered for studies on risk factors or biomarkers for physical or psychiatric conditions (described in [7, 8]).
APOE genotyping was performed using TaqMan SNP genotyping assays on an ABI Prism 7900HT and analyzed using SDS software (Applied Biosystems). The three main APOE alleles— ɛ2, ɛ3 and ɛ4— differ at two residues, consisting of a two SNP haplotype. The SNPs rs429358 and rs7412 were determined by allelic discrimination assays based on fluorogenic 5’ nuclease activity and the allele inferred.
Genome-wide genotyping was available for the samples on a range of arrays. 4190 individuals had GWAS data from the Illumina chips designed using the HapMap references (317K, 370K, 610K, 660K). 3385 individuals had GWAS data from the more recent Illumina arrays designed using the 1KGP data (Core+Exome, PsychArray, OmniExpress). Rs429358 is not directly genotyped on any of the arrays used. Rs7412 is genotyped on the Illumina Core+Exome and the PsychArray but was excluded when combined with data from other chips prior to imputation. Both datasets were imputed to three different imputation panels: 1KGP Phase 1 (Version 3, Nov 2010), 1KGP Phase 3 (Release 5, May 2013) [9], and HRC Release 1 [10]. Imputation was carried out using the University of Michigan Imputation Server with standard protocols (https://imputationserver.sph.umich.edu) [11]. APOE SNP data was extracted from the imputed datasets. Results from imputation are in dosage format, which yields continuous values taking into account any uncertainty in the number of alleles. Dosage scores were converted to hard call allelic counts (0, 1, or 2 copies of the alternate allele). Genotypes not above a hard-call threshold were coded as missing. The lower the hard-call threshold, the fewer values of missing data but the higher the chance of incorrect hard-calls. Therefore, a balance is required for optimum allele calls with an acceptable error rate. A modified version of DosageCoverter software (http://genome.sph.umich.edu/wiki/DosageConvertor) was used to convert the genotype probabilities to the best-guess genotype. The genotype with the highest probability in the VCF file is selected, subject to that genotype probability being above the hard-call threshold probability of 0.4. The conversion was repeated using thresholds of 0.6, 0.8, and 0.9. Where rs7412 was directly assayed on the array but excluded prior to imputation there was a good concordance with the imputed value (>99% for all thresholds).

Comparison of measured and imputed APOE genotypes showing concordance and missingness. (See Table 1 legend for abbreviations).
The accuracy of the APOE genotyping using the TaqMan SNP genotyping assays was assessed from 3576 duplicate DNA samples, where the genotyping error rate was 0.2% .
For the HapMap-based arrays, the APOE genotypes were well imputed using the 1KGP imputation panels (R2 values 0.79–0.84) and the HRC panel (R2 values 0.99). Concordances with direct genotypes are shown in Table 1 and Fig. 1. Concordance and missingness rates vary between genotypes, due to the differences in imputation accuracy in the SNPs used to decipher the allele. For both the 1KGP imputation panels, concordance rates >98% could only be reached with high levels of missing data. This missingness rate was driven by the rarer ɛ4 and ɛ2 alleles. Using a hard-call threshold of 0.8 for hard genotype calling resulted in 98% concordance for both 1KGP Phase 1v3 and 3v5 imputation panels but yielded high levels of missingness (12 and 14% respectively). Minor differences between the 1KGP phase 1 and 3 reference panels are likely due to an increase in sample size and the inclusion of additional world populations in the phase 3 reference, as APOE genotype distribution varies by ancestry [12]. The HRC reference panel improved concordance and missingness, with 99% concordance and 5% missingness at the calling threshold of 0.8. Using a very conservative calling threshold of 0.9 did not markedly improve the concordance and greatly increased the level of missing data.
As one would expect the use of the more recent 1KGP-based arrays improves the calling accuracy of imputed APOE genotypes (Table 1 and Fig. 1). The imputation accuracy for all panels increased to r2 values of 0.95–0.99. The calling accuracy was also good for all panels; using a relatively lenient hard-call-threshold of 0.4 resulted in concordance rates of 98.9 to 99.3% with no missingness. Unsurprisingly, given the extremely high concordance, increasing the hard-call-threshold to 0.8 yielded little increase in accuracy over all (concordance rates 99.3 to 99.4). However, there was a marked improvement in the calling of the rarer alleles: for example, the accuracy of the ɛ2 homozygote from the 1KGP Phase 3v5 imputed data increased from 92.3% at a hard-call-threshold of 0.4 to 95.8 at a threshold of 0.8 (although this resulted in a corresponding increase in missingness).
We went on to use these results to provide acceptably accurate APOE genotype data for our cohort study PISA (Prospective Imaging Study of Aging). We had a total of 19,449 samples of European ancestry with no direct APOE genotype but imputed GWAS data available. Samples were genotyped on either HapMap or 1KGP-based arrays (N = 10,544 and 8,905, respectively). We conservatively selected samples requiring direct APOE genotyping using HRC imputed data with a genotype hard call threshold of 0.9. Samples were selected if imputed APOE genotype were missing or if the concordance was <99.3% in the genotype group. This included all imputed APOE genotypes of ɛ2ɛ2, ɛ2ɛ4, and ɛ4ɛ4 for the HapMap-based arrays and genotypes of ɛ2ɛ2 for the 1KGP-based arrays. A total of 1,255 samples required genotyping from the HapMap-based arrays, and 126 from the 1KGP-based arrays. From the HapMap-based arrays 1120 had available DNA and were directly APOE genotyped. Concordance between imputed and directly genotyped was close to predicted from the analysis described above (Table 1), with 97.0, 97.7 and 96.8% concordance for ɛ2ɛ2, ɛ2ɛ4, and ɛ4ɛ4, respectively. From the IKGP-based arrays, 64 had available DNA for direct genotyping, and concordance between imputed for ɛ2ɛ2 genotypes was 100% , again close to the predicted concordance (Table 1). Our final HapMap-based array dataset included APOE genotype accurate to≥99.4% concordance for 10,409 samples where 10.8% were individually genotyped, and our final 1KGP-based arrays dataset included APOE genotype accurate to≥99.3% concordance for 8,804 samples where 0.7% were individually genotyped.
DISCUSSION
In agreement with previous analyses [5, 6], using a large sample of European ancestry, we have shown that use of GWAS data from HapMap-based arrays imputed to 1KGP reference panels can give reasonably accurate APOE genotype calls, but at the expense of increasing missingness biased towards the rarer alleles of greatest interest. We showed that use of the HRC reference panel improves the calling accuracy. Finally, the newer GWAS arrays based on the 1KGP data resulted in excellent imputation of the APOE genotypes on all the imputation panels examined.
As the strongest known genetic risk factor for AD, APOE genotype is routinely required in cohort studies investigating aging and dementia-related phenotypes. In addition, APOE genotype is increasingly used for selecting individuals at high risk of AD for longitudinal studies investigating early stage dementia and for selecting individuals for enrolment into early intervention clinical trials. The information presented here is useful for cohorts where GWAS data is available, to aid in the decision of whether additional genotyping or new imputation analysis is required to obtain reasonably accurate APOE genotypes. For studies carrying out association testing where maximizing power is the priority, the use of imputed APOE genotypes could be sufficient. But where accuracy is paramount genotyping may be required. Using our own data as an example, we were able to ascertain APOE genotype data using historic GWAS data and minimal amount of additional genotyping.
Footnotes
ACKNOWLEDGMENTS
For the GWAS datasets, we acknowledge funding from the Australian National Health and Medical Research Council (NHMRC grants 241944, 389875, 389891, 389892, 389938, 442915, 442981, 496739 and 552485), US National Institutes of Health (NIH grants AA07535, AA10248, AA014041, AA011998, AA013320, AA013321, AA017688, DA012854, NEI- 1R01EY018246-03) and the Australian Research Council (ARC grant DP0770096). MKL is supported by the Prospective Imaging Study of Aging (PISA), NHMRC grant 1095227.
