Abstract
Background:
Alzheimer’s disease (AD) is one of many common neurodegenerative diseases without ideal treatment, but early detection and intervention can prevent the disease progression.
Objective:
This study aimed to identify AD-related glycolysis gene for AD diagnosis and further investigation by integrated bioinformatics analysis.
Methods:
122 subjects were recruited from the affiliated hospitals of Ningbo University between 1 October 2015 and 31 December 2016. Their clinical information and methylation levels of 8 glycolysis genes were assessed. Machine learning algorithms were used to establish an AD prediction model. Receiver operating characteristic curve (AUC) and decision curve analysis (DCA) were used to assess the model. An AD risk factor model was developed by SHapley Additive exPlanations (SHAP) to extract features that had important impacts on AD. Finally, gene expression of AD-related glycolysis genes were validated by AlzData.
Results:
An AD prediction model was developed using random forest algorithm with the best average ROC_AUC (0.969544). The threshold probability of the model was positive in the range of 0∼0.9875 by DCA. Eight glycolysis genes (GAPDHS, PKLR, PFKFB3, LDHC, DLD, ALDOC, LDHB, HK3) were identified by SHAP. Five of these genes (PFKFB3, DLD, ALDOC, LDHB, LDHC) have significant differences in gene expression between AD and control groups by Alzdata, while three of the genes (HK3, ALDOC, PKLR) are related to the pathogenesis of AD. GAPDHS is involved in the regulatory network of AD risk genes.
Conclusion:
We identified 8 AD-related glycolysis genes (GAPDHS, PFKFB3, LDHC, HK3, ALDOC, LDHB, PKLR, DLD) as promising candidate biomarkers for early diagnosis of AD by integrated bioinformatics analysis. Machine learning has the advantage in identifying genes.
INTRODUCTION
Alzheimer’s disease (AD) is one of many common neurodegenerative diseases, with cognitive impairment as the main clinical symptom [1]. Since the pathogenesis of AD is unclear, the main treatment focuses on symptomatic treatment which can only improve a patient’s cognitive outcome [2, 3]. Recent studies have found that intervention and prevention before the onset of AD may delay or even prevent the occurrence and development of the disease. Therefore, finding effective biomarkers for early diagnosis and early prevention of AD are particularly important [4].
DNA methylation is one of the most common epigenetic modifications and is a reliable biomarker for the etiological study of AD [5, 6]. Indeed, DNA methylation has become a widely used tool in the study of AD pathogenesis and early diagnosis [7]. Recent evidence suggests that changes in glucose metabolism occur at an early stage before the onset of AD and are correlated with clinical disorders in dementia [8, 9]. Although many studies have found biomarkers for early diagnosis of AD [10–15], the diagnostic value of a single biomarker is not enough. Because there are few prediction models of AD based on machine learning [16–20], we use machine learning algorithms to construct a prediction model of AD based on the DNA methylation levels of glycolytic genes and various clinical indicators. We established a disease risk factor model by SHapley Additive exPlanations (SHAP) based on the prediction model of the relationship between glycolysis and AD. Moreover, we showed that utilizing this model can identify additional important factors which affect the occurrence and development of AD. Eight AD-related glycolysis genes (GAPDHS, PFKFB3, LDHC, HK3, ALDOC, LDHB, PKLR, DLD) were identified, which were promising candidate biomarkers for early diagnosis of AD by integrated bioinformatics analysis and guide future research in the field of neurodegeneration.
This study evaluated clinical information, blood biochemical indicators, and glycolysis gene methylation indicators of 122 subjects. An AD prediction model based on machine learning was developed and validated by receiver operating characteristic curve (ROC_AUC) as well as decision curve analysis (DCA) for early identification of AD patients at a high risk of the disease more accurately. Compared with other models based on machine learning or deep learning, our model performed better in predicting AD with higher accuracy and ROC_AUC. More importantly, the AD risk factor model by SHAP could be used to effectively find novel influencing factors leading to AD in order to improve disease prevention, diagnosis, and treatment. In addition, our study used the glycolysis gene methylation indicators in blood samples, which was more economical than alternative approaches such as PET or MRI. Meanwhile, this paper verified the DNA methylation level of glycolysis genes in the blood samples of the two groups, which provided ideas for further exploring how glycolysis affects the occurrence and development of AD based on epigenetics.
METHODS AND MATERIALS
Study design
A hospital-based case-control study was used in this paper. Our study included 60 AD patients who were treated or hospitalized in the affiliated hospitals of Ningbo University from October 1, 2015 to December 31, 2016. They were all Han nationality, aged 53–96 years (80.05±8.78 years), including 30 males and 30 females. The patients included in this study met the diagnostic criteria of the AD guidelines. At the same time, 62 age-matched non-AD patients without family history of dementia were randomly selected from these hospitals as the control group, including 43 males and 19 females, aged 62–93 years (78.89±8.45). The flow of this study was shown in Supplementary Figure 1. Informed consent forms approved by the Ethics Committee of Medical College of Ningbo University were signed with all the subjects.
Analysis of microarray dataset
Differential gene expression of the AD study (GSE132903) [21] was measured by Illumina HumanHT-12 v4.0 expression beadchip. This sample was composed of temporal gyrus samples from Alzheimer’s disease (AD = 97) and non-dementia controls (ND = 98). We obtained the mRNA expression data of AD patients from Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/). Differentially expressed mRNAs were obtained by using “limma R” packages in R software with log2 Fold Change (FC) > |0.5| and adj.p.val < 0.01. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis were carried out with the “clusterProfiler R” package of R software (version 3.5.3).
Collection of blood samples and clinical data
Two venous blood samples were collected from each of the 122 subjects on an empty stomach in the early morning. One was used for the detection of general biochemical indexes, and the other whole blood was anticoagulated and stored in a refrigerator at –80°C for subsequent genomic DNA extraction. Clinical information was also collected from all subjects.
Determination of glycolysis gene DNA methylation level
DNA was extracted by the magnetic beads method (Lab-Aid 820 automatic nucleic acid extractor, Zhishan Biotechnology, Xiamen, China) after centrifugation. The CpG islands near the promoter region of glycolysis genes were searched by PubMed and UCSC database, the gene structure and CpG Island (CGI) were found, and the corresponding DNA sequences were obtained to design gene primer sequences (Supplementary Table 1). Bisulfite transformation method was used to transform genomic DNA (EZ DNA Methylation-Gold™ kit) for accurate and rapid bisulfite methyl modification of DNA. Fluorescence quantitative methylation specific PCR was used to detect the level of methylation (Roche LightCycler®480 instrument and LightCycler®480 SYBR Green I Master Mix kit).
Missing value processing
The clinical data and the DNA methylation data of glycolysis genes were merged. The missing values of 8 glycolysis genes methylation indicators were less than 5.74%, while missing values of some clinical biochemical indicators were more than 20%. Features with missing values of more than 20%were removed, and then multiple imputation was used based on 5 replicates and a chained equation approach method in the R MI procedure was used to account for missing data. The difference between the data after interpolation and the data before interpolation was verified.
Statistical analysis and visualization
R software (version 4.0.2) was used for statistical analysis, the difference was statistically significant when p < 0.05. Visualization was performed by the“ggplot2 R” package. Statistical method: Continuous variables were expressed by X±S or M (IQR) according to whether they were normally distributed or not. Two independent samples t-test was used to compare normal distribution variables or continuous variables with data converted into normal distribution. Wilcox test was used for data that did not conform to normal distribution after conversion, and chi-square test was used for comparison between classified variables.
Data normalization and machine learning feature extraction
We used the python (v3.6.2) data analysis module pandas [22] to read and normalize the clinical indicators and glycolysis gene DNA methylation level data of all samples, and then used the Scikit-learn library [23] gradient lifting (Gradient Boosting machine, GBM) algorithm to remove the features that had no contribution to the model diagnosis results. Finally, we used the Scikit-learn library to divide the data into a 70%training set and a 30%test set (random_state = 0). The difference between the training set and test set was verified.
Construction and evaluation of AD prediction model
Mainstream machine learning algorithms including support vector machine (linear kernel function Linear-SVM, Gaussian kernel function RBF- SVM), Naive Bayes, random forest (n_estimators = 100), KNN (n_neighbors = 9), Logistic Regression, and Decision Tree were used by the Scikit-learn library to establish prediction models. Then 5-fold cross-validation was used to test each algorithm one by one, and the ROC_AUC of these models was compared. Finally, the random forest algorithm was selected to establish AD prediction model, the GridSearchCV of the Scikit-learn library was used to optimize the model in order to get the optimal parameters.
The optimal parameters of the random forest were obtained as follows: (Random Forest Classifier (bootstrap = True, class_weight = None, criterion = ’gini’, max_depth = None, max_features = ’auto’, max_leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 160, n_jobs = 1, oob_score = False, random_state = 0, verbose = 0, warm_start = False)
Decision curve analysis for model assession
The test set was assessed by DCA for model clinical application prospect.
Select the optimal model and establish the AD risk factor model
We used the optimal algorithm which was obtained by cross-validation to develop our AD prediction model, and then used the Scikit-learn library GridSearchCV method to adjust parameters in order to obtain optimal parameters of the model. We finally used the SHAP library (https://github.com/slundberg/shap) to establish an AD risk factor model.
The SHAP library in Python has embedded algorithms, and SHAP values can interpret machine learning models as an effective method. The SHAP value can also decompose, predict, and display the impact of each feature on the disease. In other words, the SHAP algorithm gives the extent to each feature in the model which contributes positively or negatively to each prediction result.
Development and validation of a multigene containing nomogram
Nomogram, which included several lines corresponding to certain clinical parameters, was widely used to predict the incidence of patients in a clinical environment [24]. In order to compare the advantages of machine learning models in clinical applications, we used univariate logistic regression to analyze the correlation between clinical indicators, glycolysis genes DNA methylation levels and AD, then tried to construct a multi-gene nomogram by incorporating the features of p < 0.05 in the results of univariate logistic regression analysis into multivariate logistic regression (Supplementary Table 2) and form a nomogram. Finally, we performed DCA on the test set to evaluate the clinical application prospects of model.
Validation of gene expression of AD-related glycolysis genes by AlzData
The gene expression of AD-related glycolysis genes between AD and normal brain tissues was determined using normalized brain gene expression profiling from AlzData (http://www.alzdata.org). AlzData is a useful database that provides a large number of human brain gene expression profiling [25, 26].
Construction of transcription factor regulatory network
The transcriptional factor regulatory network which plays a key role in regulating glycolytic related genes was constructed by using TRRUST database [27] (https://www.grnpedia.org/trrust/) and visualized by Cytoscape3.7.2 software.
TF-gene interactions
We performed TF-gene interaction analysis with integration of common AD risk genes and 8 glycolysis genes by NetworkAnalyst (https://www.networkanalyst.ca/) [28].
AD risk genes: APP, PSEN1, PSEN2, APOE, SORL1, ABCA7, TREM2, PLCG2, BDNF.
Glycolysis genes: GAPDHS, PKLR, PFKFB3, LDHB, LDHC, HK3, ALDOC, DLD.
RESULTS
Analysis of microarray dataset
By using the “limma R” package in R software, we identified a total of 703 transcripts which were highly differentially expressed (adj.p.val < 0.01; log2 Fold Change (FC) > |0.500|) from the GSE132903 dataset. GO analysis suggested that gene enrichment changed at the epigenetic modification level of DNA (Fig. 1), and KEGG analysis showed that the genes were enriched in the glycolysis pathway (Fig. 2). The volcano and heatmap of GSE132903 were presented in Supplementary Figures 2 and 3, respectively.

Gene Ontology (GO) enrichment analysis for the genes in the GSE132903 dataset DEG. A) Biological process of GO analysis. B) Cellular component of GO analysis. C) Molecular function of GO analysis.

KEGG enrichment analysis for the genes in the GSE132903 dataset down DEG.
Missing value processing
Features with missing values of more than 20%were removed, and then multiple imputation was used. An overview of the sample data was shown in the baseline table (Supplementary Table 3). The difference between the data before interpolation and after interpolation was verified, and the results showed that there was no significant difference between the two data sets (Supplementary Table 4).
Clinical data of subjects
Compared with the control group, there was no significant difference in age, height, body mass index (BMI), serum total cholesterol (TC), triglyceride (TG) and high-density lipoprotein cholesterol (HDL-C), hypertension, and diabetes (T2DM) between the AD group and the control group (p > 0.05). The body weight and smoking level of the AD group were lower than those of control group. The detailed results were shown in Supplementary Table 3.
Association between glycolysis gene DNA methylation level and AD
The Wilcoxon (Supplementary Table 3) showed that the DNA methylation levels of 8 glycolysis genes (PFKFB3, DLD, LDHB, ALDOC, PKLR, LDHC, HK3, GAPDHS) in the AD group were significantly different from those in control group (p < 0.001). Compared with the control group, the DNA methylation levels of PFKFB3, GAPDHS, LDHB, ALDOC, LDHC, and HK3 in the AD group were significantly higher, while the methylation levels of PKLR and DLD in the AD group were lower than those in control group (Fig. 3).

Association between DNA methylation of eight genes and the risk of AD.
Machine learning feature extraction
The Python data analysis module, pandas, was used to read the data of 122 samples, and the Scikit-learn module gradient lifting (Gradient Boosting machine, GBM) algorithm (threshold = 0.9999) was used to remove the three features of “type 2 diabetes”, “gender”, and “smoking”, which had no contribution to the prediction of the model, and retained the following 16 features: “GAPDHS, PKLR, PFKFB3, LDHC, DLD, HDL-C, ALDOC, TC, LDHB, TG, BMI, Height, Weight, Age, Hypertension, HK3” (Supplementary Table 5, Fig. 4A, B). The data was divided into a 70%training set and a 30%test set by Scikit-learn, the difference between training set and test set was then verified. The results showed that there was no significant difference between training set and test set (Supplementary Table 6).

Feature Importance.

Features required for 0.9999 of cumulative importance.
Construction and evaluation of AD prediction model
The Python machine learning module Scikit-learn was used to establish the AD prediction model with a variety of mainstream machine learning algorithms, and 5-fold cross-validation was carried out to obtain the model accuracy and ROC_AUC (Supplementary Table 7, Fig. 5A, B), in which the random forest model ROC_AUC was the highest scoring, with an average ROC_AUC of 0.969544. Finally, the random forest algorithm was chosen to establish the model.

Machine learning model 5-fold cross-validation accuracy.

Machine learning model 5-fold cross-validation ROC-AUC.
Decision curve analysis for model assessment
We carried out DCA within the threshold probability range of 0∼0.9875 (Fig. 6), the net benefit of the AD prediction model was higher than that of the two extreme curves and greater than 0. The optional threshold probability range was larger and relatively safe, indicating that the model meets the requirements of clinical applications.

Decision Curve Analysis of the clinical use (test set). The abscissa is the threshold probability, that is, the probability that the patient is diagnosed as AD in the AD disease diagnosis model, and the ordinate is the net benefit. The dotted lines represent the clinical diagnostic model of AD disease, and the two solid lines represent two extremes. The horizontal solid line indicates that all samples are negative, all people do not need treatment, and the net benefit is 0. A slanted solid line indicates that all samples are positive, all people need to be treated, and the net benefit is a backslash with a negative slope.
Development of AD risk factor model by SHAP
We used a SHAP library to establish an AD risk factor model (Fig. 7A). According to Fig. 7A, the global explanation of the AD risk factor model obtained by SHAP is as follows: For the SHAP values on the X axis in Fig. 7A, all the values on the left indicate the reasons for the negative movement of the predicted values, and the values on the right indicate the reasons for the positive movement of the predicted values. On the Y axis is the important response of these features to the impact of the disease.

SHAP summary plots. GAPDHS, PKLR, PFKFB3, LDHB, LDHC, HK3, DLD, and ALDOC in the legend refer to their DNA methylation levels, respectively. A) SHAP Disease Risk Factor Model. It tells which features are most important, and also their range of effects over the dataset. The color represents the feature value (red high, blue low). This reveals for example that a high GAPDHS heightens the predicted AD risk. B) Explainability of individual cases (SHAP of sample AD-14). C) Explainability of individual cases (SHAP of sample NC-42). D) Explainability of individual cases (SHAP of sample NC-35).
The scatters shown on the right side of the X-axis (Fig. 7A) play a positive role in promoting the occurrence and development of AD while the scatters shown on the left side of the X axis play a negative role.
According to the AD risk factor model by SHAP, we found that the DNA methylation changes of GAPDHS, PKLR, PFKFB3 and other 5 glycolysis genes play important roles in the occurrence and development of AD.
Figure 7B, C, and D show the contribution of SHAP values of all features of the AD samples (sample number AD-14) and the normal samples (sample number NC-42 and NC-35), respectively. The SHAP values in red show all the factors that promote the progression of AD, while values in blue denote factors which negatively influence AD progression. Figure 7B (AD-14) shows that the patient has a 98%chance of developing AD. It also shows the most important feature values that promote AD, such as GAPDHS = 0.06696, PKLR = 0.00722, PFKFB3 = 0.2096, etc. The factors that promote the development of AD in this sample are GAPDHS, PKLR, PFKFB3, HK3, LDHB, LDHC, and DLD, in which GAPDHS, PKLR, and PFKFB3 are the main factors. Figure 7C shows 1%of the sample (NC-42) is likely to develop AD, GAPDHS, PKLR, PFKFB3, ALDOC, and LDHB are factors that inhibit the development of AD, among which GAPDHS, PKLR and PFKFB3 are still the main factors. Figure 7D shows sample of NC-35 has a 4%chance of developing AD, where GAPDHS, PKLR, and PFKFB3 are the main factors that inhibit the progression of AD, while HK3 is the only factor driving the progression of the sample to AD. We also developed a SHAP interactive panel of AD diagnosis with all case (Fig. 8).

SHAP interactive panel of AD diagnosis with all case according to similarity. Red shows attributions that push the score higher, while blue shows the opposite.
Clinical application of the multigene panel
Univariate logistic regression was used to analyze which features would affect the occurrence of AD (Supplementary Table 8), including Sex, Weight, PFKFB3, DLD, LDHB, PKLR, HK3, GAPDHS, and Smoke (p < 0.05). Multivariate regression analysis showed that Smoke, PFKFB3, and HK3 were important factors affecting the diagnosis of AD (p < 0.05). In the nomogram, each variable corresponds to a score on the points line, and the sum of the scores corresponding to all variables also has a score on the “Total points” line. For example, sample number AD-85 has an AD incidence probability of 0.88 and requires active treatment (Supplementary Figure 4). We carried out DCA (Supplementary Figure 5), the net benefit of the nomogram model is better within the threshold probability range of 0∼0.76, but lower than the DCA of the model built by the machine learning algorithm (Fig. 6).
Validation of gene expression of AD-related glycolysis genes by AlzData
The five genes, PFKFB3, DLD, ALDOC, LDHB, and LDHC, have significant differences in gene expression between AD and normal groups (Supplementary Figure 6). HK3 has a high correlation with Aβ and tau, while ALDOC has a correlation with Aβ, PKLR, and APOE interact in protein-protein interaction network (Supplementary Table 9).
Key transcription factor regulatory network
A transcriptional factor regulatory network that plays a key role in regulating 79 glycolysis-related genes was constructed by the TRRUST database and Cytoscape. The results suggest that transcription factors PPARGC1A, SREBF1, NR0B2, MYC, ZBTB7A, HIF1A, ATF2, USF2, JUN, CEBPA, SP1, RELA, USF1, and FOS have regulatory effects on several glycolysis-related genes, including PFKFB3, HK3, PKLR, and LDHC (Fig. 9).

Transcription factor regulatory network.
TF-gene interactions
A TF-gene regulatory network was constructed. Figure 10 represents the interaction network between 7 AD risk genes (ABCA7, APOE, PSEN1, PSEN2, SORL1, PLCG2, BDNF) and 7 glycolysis genes (GAPDHS, PKLR, PFKFB3, LDHB, LDHC, ALDOC, DLD).

TF-gene regulatory network between AD risk genes and glycolysis genes. Red node represents gene. Blue node represents transcription factor. AD risk genes: PSEN1, PSEN2, APOE, SORL1, ABCA7, PLCG2, BDNF. Glycolysis genes: GAPDHS, PKLR, PFKFB3, LDHB, LDHC, ALDOC, DLD.
DISCUSSION
The pathogenesis of AD is closely related to epigenetics, and DNA modifications such as methylation can provide more evidence for the etiology of AD. A large number of studies have shown that epigenetic changes in AD risk genes may become potential biomarkers for AD [7, 29]. Our research group has also found a number of DNA methylation risk genes associated with AD [30–32]. Although it is unclear what role gene methylation plays in the pathogenesis of AD, searching for epigenetic methylation markers for AD in simple and easily available human blood samples will improve our understanding for the early prediction and treatment of AD.
Many studies have identified a close relationship between AD and glycolysis. In the pathological characteristics of AD, the amount of amyloid-β plaques (Aβ) deposition is associated with decreasing aerobic glycolysis activity in glucose metabolism. This decrease in aerobic glycolysis activity occurs about 10–15 years prior to cognitive decline, suggesting that the dysfunction of energy metabolism may be earlier than the onset of AD [33]. Imaging studies have shown that lower aerobic glycolysis was associated with higher tau deposition in MCI (mild cognitive impairment, pre-AD) populations [34]. Researchers have also found that glycolysis not only plays a role in tumors [35–38], but also in the development of neurodegenerative diseases [39]. With the help of KEGG and GO analysis of gene expression profiles (GSE132903 dataset in GEO database), we knew from the KEGG pathway that the differential genes in AD and control groups were not only significantly enriched in glycolysis pathways which was consistent with the reported relationship between AD and glycolysis [33, 34], but also in calcium signaling pathway, MAPK signaling pathway, mTOR signaling pathway, and other pathways. As the cellular Ca2 + signals regulate several important processes in neurons [40, 41], some studies provided evidences which indicated the dyshomeostasis of Ca2 + may play a key role in the pathogenesis of AD [42, 43]. Recent studies have indicated that p38 MAPK could orchestrate diverse events related to AD, such as neuroinflammation, tau phosphorylation, neurotoxicity, and synaptic dysfunction [44]. It was also reported that the mammalian target of rapamycin (mTOR) may play a role in Aβ and tau which induced AD [45]. A research demonstrated the dysfunction of NRGs/ErbB-dependent synaptic plasticity may be a potential mechanisms leading to different neurological diseases [46]. Since the glycolysis in AD was rarely studied, we conclude that DNA methylation of glycolysis genes may plays an important role in the occurrence and development of AD based on the literatures and bioinformatics analysis.
Molecular pathogenesis and machine learning
We collected clinical data and blood samples from control and AD patients to assess the DNA methylation level of glycolysis genes. We analyzed the DNA methylation level of 79 glycolysis-related genes in the two groups of samples, and screened 8 related genes, of which 6 were upregulated and 2 were downregulated in AD.
Next, we constructed a novel AD prediction model using the DNA methylation level of glycolysis genes and various clinical indicators. In this model, 8 glycolysis genes selected from the glycolysis pathway in the previous work and dozens of clinical indicators were integrated.
We wanted to find the most important features in the data set for further study of AD. Based on the AD prediction model using a random forest algorithm, we established the disease risk factor model by SHAP to find the most relevant factors affecting the occurrence and development of AD [47–49]. We found that the 8 glycolysis genes were all ranked in the top 10 of most relevant factors contributing to AD disease risk. According to previous reports, several glycolysis genes have been associated with the pathogenesis of AD [50–55]. It has been found that glyceraldehyde-3-phosphate dehydrogenase (GAPDH), a key enzyme in glycolysis pathway, was overexpressed in AD, resulting in an increase in Aβ deposition [56]. GAPDH can also be used in the early diagnosis and treatment of AD [57]. An APP/PS1 double transgenic AD mouse model shows that the decrease in lactic acid content and lactic acid transporter in the brain may block the transport of lactic acid from glial cells to neurons, resulting in lactic acid deficiency in neurons, and eventually leading to aberrant neuronal energy metabolism and apoptosis [55]. Fructose-6-phosphate-2-kinase/fructose-2 Biphosphatase 6-bisphosphatase 3 (PFKFB3) is a key enzyme in glycolysis which plays an important role in cell cycle progression and the prevention of apoptosis. It has been reported that inhibition of PFKFB3, in human fetal astrocyte culture can increase the accumulation of Aβ in and around astrocytes. Furthermore, PFKFB3 levels are associated with the increase in astrocyte and Aβ plaque load in AD transgenic mice (TgCRND8) which overexpress human Aβ in an age-dependent manner [50]. Analysis of microglial cells from wild-type mice and APP/PS1 double transgenic mice showed that genotype-related glycolysis increased with increased expression of PFKFB3 and ferritin [58]. In a model of cerebral ischemia/reperfusion injury of primary mouse cortical neurons, YY1 and GAS5 bind to the PFKFB3 promoter region to promote PFKFB3 expression and neuron glycolysis, resulting in the aggravation of neuronal apoptosis induced by OGD/R [59]. Current reports assessing PFKFB3 mainly focus on the changes of PFKFB3 gene expression level, resulting in oxidative stress, inflammation, energy metabolism disorders, and eventually to neuronal apoptosis and cognitive impairment [50, 60]. There are few studies assessing PFKFB3 at the epigenetic level.
The AD risk factor model established by the SHAP algorithm can help us to identify features that play an important role in the development of AD [47–49]. This method has great importance for AD diagnosis and treatment. In the feature interpretation of SHAP model, the factors which promote the development of the patient (AD-14) to AD are GAPDHS, PKLR, PFKFB3, HK3, LDHB, LDHC, and DLD, in which GAPDHS, PKLR, and PFKFB3 are the main factors (Fig. 7B). In the normal sample (NC-42), GAPDHS, PKLR, PFKFB3, ALDOC, and LDHB are the factors that inhibit the progression to AD, and GAPDHS, PKLR, and PFKFB3 are still the main factors (Fig. 7C), which may provide a direction for the accurate treatment of the disease. At the same time, we found that in normal samples (NC-35), GAPDHS, PKLR, and PFKFB3 are the main factors which inhibit the progression of AD, while HK3 is the only factor which promotes the progression of AD (Fig. 7D). According to this SHAP model, we may accurately guide AD treatment options which modify epigenetics such as diet adjustment, in order to delay or prevent the occurrence of disease [61–63].
We also see that the three genes, GAPDHS, PKLR, and PFKFB3, impart a variable influence in AD pathogenesis (Fig. 7B, C). The expression values of SHAP in sample number AD-14 indicate an increase in the chance of AD, while in sample number NC-42 indicate a reduced chance of AD. It also shows that although the SHAP expression values of GAPDHS, PKLR, and PFKFB3 in sample NC-35 indicate a reduction in the chance of AD, the SHAP expression of HK3 indicates an increased risk of AD. Taken together, these results suggest that even in a normal population, there are some factors which can affect the process of AD, and these influencing factors vary from person to person. By establishing the SHAP model of glycolysis gene methylation in AD based on a random forest algorithm, we can effectively predict AD and find the important influencing factors leading to the disease. This strategy can provide accurate guidance and individualized treatment for the prevention, diagnosis, and treatment of the disease.
Through methylation-specific PCR (MSP) detection in blood samples from control and AD groups, we found that the DNA methylation levels of GAPDHS, PFKFB3, LDHC, HK3, ALDOC, LDHB, PKLR, and DLD gene promoter regions in AD group were significantly different from those in the control group. CpG loci showed a significant correlation with AD, suggesting that the DNA methylation abnormalities of these genes were related to AD and may be involved in the pathogenesis of AD. Interestingly, by assessing transcription factor regulatory networks we found that PFKFB3, HK3, PKLR, and LDHC were regulated by key transcription factors. Verified by AlzData, 5 of the 8 glycolytic genes mentioned above have expression differences at the transcriptional level between AD and control group, including PFKFB3, DLD, ALDOC, LDHB, and LDHC. This observation explains to some extent the significant difference of DNA methylation in the promoter regions of these genes between control and AD groups. In addition, we integrated 8 glycolysis genes and common AD risk genes [64, 65] to construct a TF-gene regulatory network, and found the interaction network between AD risk genes (ABCA7, APOE, PSEN1, PSEN2, SORL1, PLCG2, BDNF) and glycolysis genes. Therefore, further study is needed to explore the possible underlying mechanism of glycolysis genes in the pathogenesis of AD.
Advantages of machine learning in clinical use
Machine learning algorithms are limited in their scope, because the targets and data used in their development would affect the model. This paper abandoned the subjective selection of a machine learning algorithms, and instead used a variety of machine learning algorithms to develop the model. The best model was selected by assessing ROC_AUC cross-validation scores, which reduced the interference of artificial subjective factors on model selection. Finally, DCA was used to assess the model. Our prediction model established by a random forest algorithm had the highest ROC_AUC, and the net benefit rate of AD prediction assessed by DCA was much higher than that without model prediction, indicating that the model had significant clinical application value. The combination of physiological, biochemical, epigenetic characteristics and machine learning algorithms illustrates the potential for more clinical diagnoses in the future, as well as personalized general treatments.
Previous studies have shown that the diagnostic value of a single biomarker for AD is limited [11, 67]. An alternative method to develop an AD prediction model is deep leaning. However, deep learning is dependent on high-end hardware and suffers from poor performance when given small data sets. Compared with deep learning [68], the traditional machine learning method has many advantages including fast computing speeds and a more interpretable model. Considering the limited sample and data in this study, we used the traditional machine learning algorithm to establish our AD prediction model and introduced a variety of traditional machine learning algorithms for cross-validation. We selected the random forest algorithm due to its high ROC_AUC score, which indicated that it was optimal for model development.
In order to prove the accuracy of the prediction model, we compared our model with other machine learning prediction models or deep learning prediction models of AD. One study used deep learning (DL), extreme gradient enhanced (XGBoost), and random forest (RF) to analyze and model plasma metabolites in cognitive normal people and AD patients, and validated internally using 10-fold cross-validation [19]. Another study proposed a multi-mode recurrent neural network (a deep learning method) to predict the conversion from MCI to AD, and used 5-fold cross-validation to evaluate the model [69]. Ning An’s research proposed a three-tier framework based on deep learning for AD classification and used 5-fold cross-validation [70]. The optimal model selected in Lena Scheubert’s study is the AD prediction model based on linear support vector machine and has been assessed by 3-fold cross-validation [20]. However, compared to these alternative approaches, our model showed higher accuracy and ROC_AUC. In addition, none of the above models were assessed by DCA. We believe these alternative AD prediction models failed match our ROC_AUC score due to insufficient optimization of their model parameters. In the most recent study [71], multi-mode data integration was used to establish an AD prediction model, and even after 10-fold cross-validation the ROC_AUC of the machine learning model was still lower than that of our model. Furthermore, there were also no DCA accession conducted in their studies. In addition, multimodal data will increase the patients’ cost of detection. The advantages of our prediction model include good prediction accuracy and ROC_AUC, utilization of DCA for clinical application, and the use of blood samples which are easily available, less invasive, and less expensive. Taken together these benefits result in a test which is easier for patients to accept.
Finally, we found that in the univariate logistic regression analysis of this case (Supplementary Table 8), the correlation p value of LDHC and ALDOC were more than 0.05. As we use logical regression to make our prediction model, we usually first use univariate logistic regression analysis, and then remove the features which contain a correlation of p > 0.05 as the next step of multivariable logistic regression analysis. However, in the SHAP model of this case (Fig. 7A), LDHC and ALDOC are important disease risk factors. If we delete these features of p > 0.05 after univariate logistic regression analysis by routine analysis methods, and then do the next multivariable logistic regression analysis, we will miss the opportunity to find the important factors affecting the occurrence and development of AD. In addition, the DCA of machine learning is better than that of nomogram, which is helpful for better application in clinical use. This case shows that using machine learning models to extract features can more effectively find disease risk factors and biomarkers than previous strategies.
Our study had several limitations. First, the small number of samples may limit our interpretation of the model, more samples should be collected from different regions. Second, some clinical indicators such as C-reactive protein were not used due to a large amount of missing data (> 20%) or unavailability of data. This exclusion could result in an underestimation of the predictive effect of these clinical indicators. Although we have developed a model that can predict AD, as a larger patient population is applied to this model, other features may be prioritized, which enhances the diagnostic potential of pre-AD or AD patients. This is indeed the advantage of using machine learning models, because with the growth of the number of features and samples, the model continues to learn and develop and would predict more accurately. Therefore, in a later study, we will need to include more samples and AD-related risk indicators during model training in order to improve the prediction accuracy of the model and increase its interpretability from both biological and population perspectives.
In conclusion, after integrating machine learning and combined analysis, we have developed an effective prediction model (glycolysis gene methylation prediction model of AD) and identified GAPDHS, PFKFB3, LDHC, HK3, ALDOC, LDHB, PKLR, and DLD, which have been little characterized previously to be associated with AD. These genes may be considered as potential biomarker for early diagnosis of AD or even therapeutic target. Moreover, we found that machine learning performed better than nomograms in identifying genes. The SHAP model based on machine learning may find potential biomarkers of AD, which maximizes the interpretability of the model and provides a powerful tool for precision medicine. This study highlights the important role glycolysis genes play in the pathogenesis of AD and warrants further investigation.
Footnotes
ACKNOWLEDGMENTS
This work was supported by the National Key Research and Development Program of China (2016YFC1305900), National Natural Science Foundation of China (82001155), Natural Science Foundation of Ningbo(2019A610295, 2019A610290, 202003N4244), the major fund project of Ningbo Science and Technology Bureau (2019B10034), Ningbo science and technology plan project (202002N3165).
