Abstract
Background:
Significant work has identified genetic variants conferring risk and protection for Alzheimer’s disease (AD), but functional effects of these variants is lacking, particularly in under-represented ancestral populations. Expression studies performed in easily accessible tissue, such as whole blood, can recapitulate some transcriptional changes occurring in brain and help to identify mechanisms underlying neurodegenerative processes.
Objective:
We aimed to identify transcriptional differences between AD cases and controls in a cohort of diverse ancestry.
Methods:
We analyzed the protein coding transcriptome using RNA sequencing from peripheral blood collected from 234 African American (AA) (115 AD, 119 controls) and 240 non-Hispanic Whites (NHW) (121 AD, 119 controls). To identify case-control differentially expressed genes and pathways, we performed stratified, joint, and interaction analyses using linear regression models within and across ancestral groups followed by pathway and gene set enrichment analyses.
Results:
Overall, we identified 418 (291 upregulated, 127 downregulated) and 488 genes (352 upregulated, 136 downregulated) differentially expressed in the AA and NHW datasets, respectively, with only 16 genes commonly differentially expressed in both ancestral groups. Joint analyses provided greater power to detect case-control differences and identified 1,102 differentially expressed genes between cases and controls (812 upregulated, 290 downregulated). Interaction analysis identified only 27 genes with different effects in AA compared to NHW. Pathway and gene-set enrichment analyses revealed differences in immune response-related pathways that were enriched across the analyses despite different underlying gene sets.
Conclusion:
These results support the hypothesis of converging underlying pathophysiological processes in AD across ancestral groups.
INTRODUCTION
Alzheimer’s disease (AD) is highly heritable with a complex genetic etiology [1–3] as at least 30 genetic loci have been associated with AD risk and/or protection [4–8]. Though AD affects individuals of all races and ethnicities, most genomics studies overall [9, 10] including in AD have focused largely on non-Hispanic white (NHW) individuals, though a growing body of work has extended to others ancestral groups including African Americans (AA) [11–15] and Hispanics [16–18]. Importantly, these studies suggest that in diverse populations the genetic architectures of AD overlap, but only partially. For example, the risk of AD associated with the APOE ɛ4 allele is significantly higher in NHW compared to AA [19–22], and this effect is related to the genetic local ancestry around the APOE gene [23]. Moreover, there are ancestry-specific coding and non-coding risk variants in genes such as ABCA7 [12, 14].
These studies have been critical to identify loci associated with AD. However, the associated variants are predominantly not protein coding, making their biological impact difficult to determine. While a “nearest gene” annotation approach has generally been used, there are several examples of DNA variants regulating genes that are far from the associated marker [24, 25]. Therefore, extending AD genomic studies to gene expression and gene regulation is critical to understanding the potential functional impact of genetic variants influencing AD risk/protection. Transcriptomic analyses have been performed in AD showing case-control gene expression [26–29] and splicing differences [30, 31] including differentially expressed genes that are established AD candidate genes such as ABCA7 [32], BIN1 [33], CD33 [34], and FRMD4A [35]. However, the results tend to rely on targeted or low sample size studies, do not widely replicate across studies, and all have focused on NHW, highlighting a need for genome-wide transcriptomic studies of larger sample size and in diverse ancestral populations.
Here, we report a transcriptome-wide analysis of gene expression differences between individuals with AD and cognitively intact controls in whole blood from AA and NHW populations. Given the challenge of obtaining autopsy AD confirmed brain tissue, particularly from groups traditionally underrepresented in genetic studies [36], peripheral blood was used as a easily accessible source of RNA. Previous studies have suggested that gene expression in blood can often recapitulate expression in brain in neurodegenerative conditions making it a useful source of material in this study [37–39]. Moreover, molecular pathways implicated by previous genomic studies of AD including endocytosis, lipid metabolism, and immune response consist of genes highly expressed in blood. In particular, genetic studies have implicated microglia, the immune cell in the brain as several associated genes are highly expressed in blood (including CR1, SPI1, MS4A, TREM2, ABCA7, CD33, and INPP5D) [40, 41]. Evidence for the role of myeloid lineage cells have been further implicated by gene-expression [42] and gene regulatory studies of AD avariants in myeloid enchancers [43] and as monocyte-specific eQTLs [44].
Therefore, here we test the hypothesis that gene expression changes exist in blood between AD and controls and that there will be differences shared and others specific to each ancestry. Further, we hypothesize that while the genes altered in expression may be unique to populations, the underlying pathways of these genes could converge suggesting a shared molecular etiology indicated gene-expression changes in blood.
MATERIALS AND METHODS
Sample ascertainment
All participants in this study, or a legal representative, provided informed consent with oversight by the appropriate Institutional Review Board (University of Miami Institutional Review Board protocol #20070307). Participant demographics are shown in Table 1 and include 465 individuals (AA – 113 with AD, 118 cognitively intact controls; NHW – 116 with AD, 118 controls) ascertained by the John P. Hussman Institute for Human Genomics (HIHG) at the University of Miami Miller School of Medicine (Miami, FL), North Carolina A&T State University (Greensboro, NC), and Case Western Reserve University (Cleveland, OH).
Description of the subjects used in this study. Age is at the time of examination where the blood draw for sequencing occurred
Participants were ascertained as part of ongoing genetic studies of AD and included both cases (>65 years of age of onset) and controls (>65 years of age at age of exam). Individuals who presented as presumptive cases (e.g., clinical diagnosis of AD or significant memory problems) were evaluated using a standard protocol comprised of a clinical interview that included a medical-family history, cognitive screening measures [45–47], the LOAD neurocognitive battery [48], and the Clinical Dementia Rating scale [49]. Individuals who presented as controls were evaluated using a standard protocol comprised of a clinical interview that included a medical-family history and the Mini-Mental State Examination (MMSE) or the Modified Mini-Mental State [45–47].
All participants were adjudicated by a clinical panel with expertise in AD related disorders and classified as AD according to standard criteria developed by the National Institute of Aging and the Alzheimer’s Association [50]. Controls were adjudicated based on information derived from the clinical interview and scores from the Mini-Mental State Examination (cutoff<25) [45, 46] or the Modified Mini-Mental State (cutoff<87). We excluded individuals with clinical histories that suggested cognitive problems that could be better accounted for by other causes (e.g., stroke, head trauma, or major psychiatric problems to name a few). In addition, we excluded individuals with known genetic causes or extended pedigrees suggestive of possible genetic causes. Genetic ancestry for each participant was confirmed using EIGENSTRAT analysis of existing genome-wide genotyping data [51, 52].
RNA sequencing
RNA was extracted from previously frozen whole blood collected in PAXgene RNA tubes (PreAnalytiX, Hombrechtikon, Switzerland) with the automated QIAsymphony (Qiagen, Germantown, MD) and stored at –80°C. RNA integrity (RIN) scores and RNA concentration were determined with BioAnalyzer 2100 (Agilent Technologies, Santa Clara, CA) and the Qubit RNA High Sensitivity Assay Kit (Thermo Fisher Scientific, Waltham, MA). Only specimens with RIN ≥5 were included for further analysis. Total RNA was prepared for sequencing using the NuGEN Universal Plus mRNA-Seq with globin and ribosomal depletion (NuGEN, San Carlos, CA). Sequencing was performed on the Illumina HiSeq3000 (Illumina, San Diego, CA) with paired end 125 bp read reactions.
Primary RNAseq Bioinformatics
Raw FASTQ reads were processed by a bioinformatics pipeline including adapter trimming by TrimGalore (v0.4.2) (https://github.com/FelixKrueger/TrimGalore), alignment with the STAR alignerv2.5.0a [53] to the hg19 human reference, and gene counts quantified against the GENCODEv19 gene annotation release using the GeneCounts module implemented in STAR. RNAseq alignment quality control metrics were extracted with the CollectRnaSeqMetrics module implemented in Picardv1.103 (http://broadinstitute.github.io/picard). To test whether the total number of reads differed based on covariates (age, sex, ethnicity, RIN, and/or affection status), we tested for associations between the total number of uniquely mapped reads and each covariate independently using univariate linear models. We then tested all covariates together in a multivariate model. Only RIN was significantly associated with decreased read counts in both the univariate and multivariate model and was thus included as a covariate in all differential expression models (Supplementary Table 1). Genes with raw read count <15 in at-least 25% samples from each ethnicity were considered as a low signal and filtered out from further analysis.
Differential gene expression
The gene counts for each sample were transformed and normalized using the variance-stabilizing transformation method implemented in the Bioconductor package DESeq2v2.2 [54] in Rv3.4.3. Multi-dimensional scaling on the transformed counts was performed to identify expression outliers and/or cryptic sample clustering was with the plotMDS() function from the limmav3.42.2 package [55] implemented in edgeRv3.6 software [56]. PlotMDS calculates the Euclidean distance between each pair of samples and plots these in a two-dimensional space.
Association between known covariates including last age at examination, collection center, sex, RIN score, and sequencing flow-cell batch with status was analyzed using univariate generalized linear models, and the covariates that were significantly associated (p < 0.05) with affection status (AD versus control) were included in the differential expression analysis model. RIN score and age were associated with AD status in both ethnic groups while collection center was associated with status only in the AA dataset (Supplementary Table 2). Linear regression models within DESeq2 were used to test differentially expressed genes between AD cases and controls within each ethnicity in a stratified analysis. In addition, we used a model including all sample data with a one-factor design including ancestry as a co-variate (joint analysis) and an interaction two-factor design with disease status and ancestry as the two factors to identify genes with different significant effects in the ancestry. In all analyses, genes with a false discovery rate (FDR) corrected p-value≤0.05 were considered significantly differentially expressed.
Pathway enrichment analysis
To identify functional groups enriched among differentially expressed genes, we performed two complimentary analyses. First, we tested enrichment in Gene Ontology Biological Process (GO:BP) terms using the GOseq R package v1.30.1 [57]. The background gene set for this analysis was the same set of gene used in the differential expression analysis, namely all genes with more than 15 reads in at-least 25% the individuals. We considered pathways with a corrected p-value≤0.05 to be significantly enriched. Secondly, we performed Gene Set Enrichment Analysis (GSEA) [58] on genes with a nominal p≤0.05 ranked by fold change against the Hallmark pathway gene set of the Molecular Signatures Database (MSigDB) [59]. The GSEA pre-ranked analysis was done with 1000 permutations and genes were weighted by fold-change. We considered categories enriched with family-wise error rate (FWER) ≤0.05. FWER was used because it is a more conservative estimation of enrichment beyond background [58, 60]. For both enrichment analyses, we considered only ontologies or categories with gene set size 20-300 to allow for more robust biological interpretation.
RESULTS
RNAseq quality control
RNAseq data was of high quality with the total number of reads produced for each sample was 46.1±8.1 million on average (Supplementary Figure 1). A multiple regression model of number of reads using factors status, age, sex, ethnicity, and RIN score indicated no significant difference in the number of reads based on sex (p = 0.097), age (p = 0.20), status (p = 0.36), or ethnicity (p = 0.93). There was a trend in the model for fewer overall reads depending on RIN score (p = 0.03). Overall, an average of 88.5±3.8% of total reads mapped uniquely to the reference genome. Only 7.8±3.2% of reads were removed from the downstream analysis due to alignment to multiple locations in the genome and 3.2±2.4% that failed to align at all. Of the 20,332 protein coding genes in the GENCODE v19, ∼65% were detected in at least 25% of the samples in each dataset (13,486 genes in AA, 13,388 genes in NHW, 13,435 in the joint and interaction dataset). Multi-dimensional scaling analysis showed no obvious differences between samples based on status, ethnicity, or sex (Supplementary Figure 2).
Stratified differential expression analysis
A total of 418 and 488 genes were identified to be differentially expressed in the AA and NHW datasets, respectively. Out of 418 differentially expressed genes in AA, 291 genes were upregulated and 127 were downregulated in AD cases compared to controls. In the NHW dataset, 352 genes were upregulated and 136 genes were downregulated. The 10 most significantly different genes by FDR in each ethnicity are listed in Table 2. A complete list of analyzed genes is provided in Supplementary Table 2. Interestingly, only 16 differentially expressed genes (Table 3) were common between ancestral groups. All of those showed the same direction (15 upregulated and one downregulated).
A) Ten most significantly differentially expressed genes between AD and controls in NHW. B) Ten most significantly differentially expressed genes between AD and controls in AA. Genes are selected based on FDR significance. FC is the linear fold-change in AD compared to controls
Sixteen genes significant by FDR in both AA and NHW. FC is the linear fold-change in AD compared to controls
Status and ethnicity joint and interaction differential expression analysis
Since there were few overlapping genes in the stratified by ethnicity analysis, we performed a joint analysis using all samples in a one-factor design with ancestry as a co-variate. In this analysis, a total of 1,102 genes showed a significant difference including 290 downregulated and 812 upregulated. The 10 most significantly different genes by FDR in each direction are listed in Table 4. A complete list of all analyzed genes in this analysis is provided in Supplementary Table 3. All sixteen consistently differentially expressed genes from the stratified analysis (Table 3), were also significant in the joint analysis and all with the same direction of change. However, the overall overlap of the joint analysis was stronger with the NHW than with AA (Supplementary Figure 3), suggesting the possibility of unique underlying genes in AA pathophysiology of AD.
A) Ten most significantly differentially expressed upregulated genes between AD and controls in joint analysis model. B) Ten most significantly differentially expressed downregulated genes between AD and controls in joint analysis model. Genes are selected based on FDR significance. FC is the linear fold-change in AD compared to controls
In a final analysis, we set out to identify where the effect of AD was significantly different across ancestries in a two-factor interaction model. There were 27 such genes identified with 15 decreased (negative fold change) and 12 increased (positive fold change) in AA cases versus controls relative to NHW cases versus controls. The most extreme example of these types of relationships include MACROD2 which is strongly decreased in AA cases versus controls (stratified p-value = 0.0004, FC = –1.96) but strongly increased in NHW cases versus controls (stratified p-value = 0.029, FC = 1.51). Others such as LCN2 which shows strong increased in AA cases versus controls (stratified p-value = 0.0032, FC = 2.64) but no change in NHW cases versus controls (stratified p-value = 0.723, FC = 1.07). The full list of genes analyzed with this interaction analysis is included in Supplementary Table 3.
Differential expression of AD candidate genes
We determined whether any of the differentially expressed genes are established AD candidate genes (APP, PSEN1, PSEN2) or have been implicated previously by case-control genome-wide association (GWAS) studies in NHW [8] or AA [14, 61]. Among all differentially expressed genes, two have previously been implicated with single-variant genome-wide significant association in GWAS in NHW: SPI1 (FDR = 0.034, FC = 1.21) and EPHA1 (FDR = 0.044, FC = 1.22). None of the genes implicated by single marker association in AA GWAS show significantly altered expression. However, CR1 shows differential expression in AA in our analysis (FDR = 0.035, FC = 1.30) and was suggested as a potential candidate by gene-based association in an AA GWAS study [14]. In the joint analysis, four genes previously implicated by GWAS association or gene-based association are differentially expressed. These include PTK2B (FDR = 0.0043, FC = 1.08) and SPI1 (FDR = 0.0084, FC = 1.18), which are genes near genome-wide significant loci in NHW GWAS [8], and STARD10 (FDR = 0.0060, FC = 1.20) and VRK3 (FDR = 0.028, FC = 1.08), implicated in a recently submitted large AA GWAS [61].
Pathway enrichment analysis
We first utilized GOseq to detect Gene Ontology Biological Process (GO:BP) terms enriched among differentially expressed genes in AA and NHW datasets separately and then for the genes significant in the joint analysis. Though the specific sets of genes were different in each ethnicity’s dataset, immune response related pathways were consistently enriched with differentially expressed genes in the AA, NHW, and joint sets of genes. The ten most enriched biological process pathways with FDR ≤0.05 are presented in Tables 5A, 5B, and 5C and the complete set of pathways analyzed including the specific genes in each enriched set are in Supplementary Table 3.
Most enriched Gene Ontology Biological Processes A) Top ten gene ontology biological processes (GO:BP) enriched in the stratified AA analysis. B) Top ten gene ontology biological processes (GO:BP) enriched in the stratified NHW analysis. C) Top ten gene ontology biological processes (GO:BP) enriched in the joint analysis. #InSet refers to the number of differentially expressed genes included in the gene set that belong to the GO:BP term. #In Cat is the total number of genes in the GO:BP category
Next, we utilized gene set enrichment analysis (GSEA) on genes that were nominally differentially expressed (p≤0.05) weighted by fold change for each gene. We performed this analysis in AA and NHW datasets separately and then again for the genes significant in the joint analysis. The ten most enriched upregulated and downregulated sets with FDR ≤0.05 are presented in Tables 6A, 6B, and 6C.
All Gene Set Enrichment Analysis categories with FWER ≤0.05 A) Enriched in the stratified AA analysis. B) Enriched in the stratified NHW analysis. C) Enriched in the in joint analysis. Name is the Category ID. Size is the number of differentially expressed genes in the category. NES is the normalized enrichment score reported by GSEA. Nom. p is the nominal enrichment p-value. FWER is the familywise-error rate estimated probability that the normalized enrichment score represents a false positive finding [58]
DISCUSSION
Overall, we identified 418 and 488 differentially expressed genes between AD individuals and controls in in the AA and NHW datasets in a stratified analysis. Analysis of all individuals together correcting for ancestry as a co-variate revealed a larger set of differentially expressed genes with 1,102 significantly different genes, 290 downregulated and 812 upregulated; however, the overlap was stronger with the NHW stratified analysis suggesting some shared overall mechanisms between NHW and AA, but more unique genes in the AA population. Relatively few of the differentially expressed genes have been implicated directly in AD etiology by association studies. Notably though, NHW AD cases have increased expression of SPI1 and decreased expression of SPI1 have been causally linked to protection against AD [42]. Several other of the most significant genes have been implicated by functional and/or previous transcriptomic studies. For example, our finding of CD19 significantly decreased in AD versus controls among AA (FDR = 0.003, FC = –1.55) is consistent with previous findings that CD19 + B lymphocytes are decreased in AD patients as B cell subpopulations are remodeled and CD19 + B cells are important in brain neuro-inflammation response [63]. Similarly, differentially expressed in NHW dataset was RER1 (FDR = 0.015, FC = 1.07) which has been shown, when increased in level, to alter the processing of the amyloid precursor protein and increasing amyloid-β (Aβ) production [64].
Notably, only 16 genes were differentially expressed in both ancestries when comparing stratified analyses. Some of these also have functional implications in AD. For example, MMP9 (AA; FDR = 0.003, FC = 1.96, NHW; FDR = 0.031, FC = 1.36), APMAP (AA; FDR = 0.009, FC = 1.25; NHW; FDR = 0.024, FC = 1.17), and LTB4R (AA; FDR = 0.009, FC = 1.2, NHW; FDR = 0.036, FC = 1.14) have been implicated in functionally altering Aβ levels [65–68]. Furthermore, two genes (MMP9 and CRISPLD2) have been shown to be altered in expression in previous NHW blood by microarray analyses [69].
Despite the few overlapping genes between ancestries, the differentially expressed genes in each were enriched for inflammatory and immune response genes in both stratified and joint analyses. Both innate and adaptive immune function are important processes in AD pathogenesis [70, 71] and differential expression of immune genes has been noted in studies of AD brain tissue from NHW including cytokines, cell adhesion, and interferon response [72]. In fact, a recent study has indicated that differentially expressed immune pathways are consistent between brain and blood in AD [73]. The gene ontology enrichment of both the NHW and AA strongly implicate the adaptive immune response as humoral and B-cell related immune process are enriched in AA while interferon response is implicated in NHW. While the specific genes and pathways differ between the ethnicities, a convergence of underlying mechanism related to immune response is suggested by enrichment of the differentially expressed genes across ethnicities.
There are limitations to our approach. For example, though expression studies have shown correlation in gene expression between brain and peripheral blood and DNA variant-expression relationships often hold true between the two tissues [37, 77], these are not completely correlated and thus brain specific pathways will not be represented here. In addition, whole blood represents a heterogeneous mixture of cell types (e.g., monocytes, neutrophils, B-cells, memory T-cells, helper T-cells, etc.). As such, our analyses represent the amalgamation of the expression of all these cell types. This results in two limitations. First, that proportions of these cell types could differ due to unknown co-variates. However, we expect that our large sample size can account to remove most of this potential bias. In addition, some effects of underlying genetic risk variants for AD have shown to have very cell specific effects, for example only in myeloid cells [42]. In whole blood, these cell specific effects could be masked by other cell types. Evolving computational tools for deconvolution and molecular tools such as single cell RNA sequencing could be employed to overcome this limitation in future studies.
Moreover, comprehensive environmental exposure information is not always available and could be reflected in blood expression patterns. While this study is well powered to detect differential expression at this sample size, heterogeneity in many of these underlying external factors may influence expression changes. However, functional genomic approaches, including transcriptomics, is crucial to extend genetic discoveries to strategies for diagnosis and therapeutic intervention at preclinical stages of the disease. In particular, identification of immune pathways dysregulated in disease has been noted clinically with neuroinflammation as a hallmark [78] and much research has implicated the immune system in AD pathophysiology [70]. As such, immune modulation has been proposed as a therapeutic option including vaccination [79] and small molecule modulation including some in current clinical trials including G-CSF (NCT01617577, NCT03656042) and GM-CSF (NCT01409915). A better understanding of these immune processes could lead to more targeted interventions in this important molecular pathway.
Existing GWAS and sequencing studies strongly implicate genes involved in molecular processes including immune response, lipid metabolism, and endocytic processes. Extension of these findings to gene expression have been hampered in that previous transcriptomic studies in AD have generally had limited sample size, lower throughput technologies, and inconsistent tissue types used. Importantly, these studies have also focused on individuals of NHW backgrounds thereby missing potential ancestry specific gene expression effects [26–29]. Here, we have addressed some of these issues here by performing, to our knowledge, the largest protein-coding transcriptomic analysis in AD and first done so in a multi-ethnic sample cohort including both NHW and AA. Interpretation of the consequences of individually differentially expressed genes between AD and control in these two ancestral backgrounds in terms of AD risk or protection is challenging. However, the convergence of functional pathways of these genes strengthen and enhance our understandings of the underlying pathophysiological processes in AD.
Footnotes
ACKNOWLEDGMENTS
This work was funded by NIH/NIA grants U01AG052410-01 and RF1AG054074-01. We thank the Alzheimer’s disease community including patients and family members who agreed to participate in the study and make this research possible. We acknowledge the Center for Genome Technology of the John P. Hussman Institute for Human Genomics for performing RNA extraction, storage and allocation and for performing RNA library preparation and sequencing.
