Abstract
Background:
The current standard for Alzheimer’s disease (AD) diagnosis is often imprecise, as with memory tests, and invasive or expensive, as with brain scans. However, the dysregulation patterns of miRNA in blood hold potential as useful biomarkers for the non-invasive diagnosis and even treatment of AD.
Objective:
The goal of this research is to elucidate new miRNA biomarkers and create a machine-learning (ML) model for the diagnosis of AD.
Methods:
We utilized pathways and target gene networks related to confirmed miRNA biomarkers in AD diagnosis and created multiple models to use for diagnostics based on the significant differences among miRNA expression between blood profiles (serum and plasma).
Results:
The best performing serum-based ML model, trained on filtered disease-specific miRNA datasets, was able to identify miRNA biomarkers with 92.0% accuracy and the best performing plasma-based ML model, trained on filtered disease-specific miRNA datasets, was able to identify miRNA biomarkers with 90.9% accuracy. Through analysis of AD implicated miRNA, thousands of descriptors reliant on target gene and pathways were created which can then be used to identify novel biomarkers and strengthen disease diagnosis.
Conclusion:
Development of a ML model including miRNA and their genomic and pathway descriptors made it possible to achieve considerable accuracy for the prediction of AD.
INTRODUCTION
Recent research surrounding miRNA has brought to significant attention the value of miRNA as biomarkers since miRNA are key regulators in gene expression and are often implicated in the abnormal growth or function of certain cells, such is the case for cancer [1], cardiovascular [2], and neurodegenerative diseases [3]. We work within these lines to try to utilize specific miRNA and their individual features to help in Alzheimer’s disease (AD) diagnostics. According to the Alzheimer’s Association, in 2021, there are over 6 million US adults living with AD and by 2050, this same number is projected to rise to nearly 13 million [4]. Monitoring miRNA dysregulation can be used to elucidate the details of many biological processes relevant to disease pathogenesis to a greater extent. In our case, the pathology and mechanisms of AD, along with the dysregulation of miRNA, are leveraged as indicators for the presence of the disease.
Analysis for the presence of AD and the implicated miRNA biomarkers can be conducive towards the diagnosis of AD and the understanding of its pathogenesis, which proceeds from preclinical AD to mild cognitive impairment to mild dementia to moderate dementia then severe dementia [5]. The progression of AD is often marked by increasingly severe memory loss and overall neurocognitive dysfunction. Some signs for moderate to severe AD are mood or personality changes, hallucinations, delusions, paranoia, repetitive movements and agitation, inability to communicate, seizures, and weight loss [6]. Although the precise details of AD pathogenesis are unclear—as AD is a heterogeneous disease reliant on heredity, neurotransmitter, immune, and environmental factors—presently, the amyloid-β (Aβ) theory and tau protein theory are two of the main theories commonly used to explain AD pathogenesis [7]. The Aβ and tau protein theory attribute AD pathogenesis to “extracellular aggregates of Aβ plaques and intracellular neurofibrillary tangles made of hyperphosphorylated tau protein” in the human brain [8]. Prior research has indicated that diverging AD-related neuropathologies, such as Aβ and tau pathology, are the cause of common disturbances in neuronal miRNA expression [9], thereby introducing miRNA as a potential biomarker for the development of AD.
Approximately 2600 mature microRNAs (miRBase v.22) have been reported to be encoded by the human genome [10]. These microRNAs have been demonstrated to play major regulatory roles in developmental processes such as metabolism, cell proliferation, apoptosis, developmental timing, neuronal cell fate, and neuronal gene expression (as reviewed in [11]) by regulating gene expression post-transcriptionally. Typically, miRNAs will bind to the 3′-UTR (untranslated region) of their target mRNAs and “repress protein production by destabilizing the mRNA and translational silencing” [12]. Both the deficiencies and the excesses of miRNAs have been linked to a number of clinically important diseases including myocardial infarctions and various cancers [13]. Single point mutations in miRNA or the miRNA’s target or epigenetic silencing of miRNA transcription units are all mechanisms by which the functions of miRNA in a cell are affected [13]. miRNA biomarkers for AD are to be related to the pathology and mechanism of AD, such as the abnormal development of Aβ plaques. It has been established that abnormalities in certain miRNA expression, such as the loss of the miR-29 cluster, are shown to be associated with increased beta-amyloid precursor protein-converting enzyme (BACE1) expression and Aβ levels in sporadic AD patients [14].
The current state of AD diagnosis is indecisive, as AD is only diagnosed with complete certainty after death through a brain autopsy [15], and expensive, as often multiple specialists are required to confirm the diagnosis and brain scans could be costly. However, miRNA holds potential as a useful biomarker for non-invasive diagnosis and treatment of AD since they could be easily detected in a variety of biofluids including cerebrospinal fluid (CSF) and blood [16]. miRNA profiling has become a useful diagnostic tool in AD treatment and with methods such as microarray analysis and polymerase chain reaction (PCR) the abundance of miRNA in AD patients can be elucidated. Additionally, miRNA biomarkers provide a few other advantages over other biomarker candidates including protein and metabolite biomarkers: 1) novel miRNA biomarkers would be more easily discovered by genomic methods like oligonucleotide microarrays and deep sequencing, which possess a higher throughput than mass spectrometry, the predominant technique for both metabolite and protein biomarker discovery; 2) though no approach is available to detect low abundant proteins or metabolites, low abundant miRNA biomarkers can be readily amplified and then detected by real-time quantitative PCR (qPCR), an already FDA-approved approach in clinical tests; 3) earlier diagnosis may be more easily attained using miRNA biomarkers because of their upstream position in regulation cascades, when compared to proteins and metabolites [17]. Certainly, the identification of AD early on in its pathogenesis would be ideal via miRNA profiling of an AD patient’s plasma, serum, and CSF. Treatments that are known to be effective for reducing early to moderate AD symptoms and can help to slow the progression of AD may be more effective when administered earlier [18]. miRNA expression profiles from AD patients can be integrated with data from healthy patients to identify specific AD miRNA biomarkers. As for the specific type of miRNA profiles, miRNA from the blood—plasma or serum—would provide the miRNA that are most impactfully and directly involved in the disease pathogenesis. Therefore, by elucidating disease mechanisms, miRNA profiling is a promising direction for AD diagnosis and treatment[19].
Machine learning in biomarker identification
Usage of artificial intelligence in biomarker-based diagnostics has rapidly proliferated within the last fifteen years. Early efforts in biomarker discovery were often centered around applications in cancer diagnosis and prognosis [20]. In a review conducted by Cruz and Wishart [20], most studies surrounding the applications of machine learning (ML) in cancer were increasingly dependent on protein biomarkers and microarray data and heavily reliant on existing machine-learning algorithms [20].
Currently, ML is used in various biomedical fields including disease diagnosis, prevention, therapeutics, and chemical toxicity risk assessment [21]. Additionally, as established by Zhang and coauthors [21], ML has been used increasingly in biomarker discovery, especially for classification in assessing an array of candidate biomarkers by discriminating between control groups and treatment groups through certain biomarker characteristics. Similarly, such classification techniques have been applied using miRNA as biomarkers. As demonstrated by Khoulenjani and coauthors [22], data-mining techniques and ML methods can be utilized to extract critical pieces of information from the relationship between cancer and miRNA to diagnose cancer. Likewise, similar concepts have been translated to AD diagnosis, with the work of Zhao and coauthors [23] a 70 AD:30 control split comprised of 96 samples yielded 76% accuracy. Then, an improved accuracy of 85.7% when multivariate random forest statistical analysis was applied to construct and test a miRNA signature for AD identification [23]. These early results signify the potential of ML with input miRNA for AD diagnosis while also propounding the need for increased datasets with less bias and higher yielded accuracy.
Studies connecting miRNA to AD
Using human 5S ribosomal RNA and 13 brain-enriched miRNAs, which were spotted onto GeneScreen Plus nylon membranes, Lukiw [24] employed DNA arrays to analyze and evaluate the expression of a subset of 12 miRNAs in the AD hippocampus in comparison with non-demented controls and fetal brain [24]. The results of the expression profiling showed that miR-128a, miR-9, and miR-125b, while not elevated in the control, were elevated in the hippocampus of AD patients. In a following study by Lukiw and Pogue [25], cultured human fetal brain-derived primary neural cells were shown to be induced for production of reactive oxygen species (ROS) in the presence of metal salts such as aluminum and iron sulfates, which “induce genes in cultured human brain cells that exhibit expression patterns similar to those observed to be up-regulated in moderate- to late-stage AD” [25]. These cells were demonstrated to have increased expression of miR-128, miR-9, and to a lesser extent miR-125b, suggesting that ROS influences AD brain through pathways specifically mediated by miRNAs such as those pathways that redirect brain cell fate towards progressive dysfunction and apoptotic cell death [25].
The connection between microRNAs and AD pathogenesis was established through accumulating evidence between miRNAs and the expression of amyloid-β protein precursor (AβPP) and BACE1 [19]. Previous studies have demonstrated the effect of increased AβPP levels can lead to development of early-onset dementias, including AD. Patel and coauthors [26] utilized human cell lines to determine that overexpression of miRNAs hsa-mir-106a and hsa-mir-520c results in translational repression of AβPP mRNA and significantly reduces AβPP protein levels—signifying the first demonstration that miRNAs have the ability to regulate levels of human AβPP [26]. Similarly, in sporadic AD brains, Yang and coauthors [27] detected the upregulation of BACE1, which cuts AβPP in the first step of Aβ formation, and enzymatic activities. Later, Hébert and coauthors [14] linked miR-29a, miR-29b-1, and miR-9 to the regulation of BACE1 expression, utilizing a cell-culture model to assert a potential causal relationship between the expression levels of miR-29a/b-1 and Aβ generation wherein the loss of certain microRNAs can lead to increased Aβ and BACE1 levels in sporadic AD [14]. These results provide supportive evidence that miRNA along with its associated pathways and genes have on AD pathogenesis, specifically for Aβ plaque formation.
The subset of miRNA relevant in AD pathogenesis was expanded upon with the work of Liu and coauthors [28], who employed qRT-PCR and western blot analysis to show that miR-106b targets Fyn to inhibit Aβ1-42-induced tau phosphorylation at Tyr18, elucidating a new molecular mechanism to the hyperphosphorylation of tau to form neurofibrillary tangles [28]. This further supports the role of miRNA in AD development, particularly in the hyperphosphorylation of neurofibrillary tangles.
Differences between miRNA biomarkers of various body fluids
The benefits of using miRNAs as biomarkers for disease diagnosis have been previously established: besides altered expression under different disease states, miRNAs are also very accessible and allow relatively non-invasive sample collection due to their presence and high stability in biofluids such as blood, urine, and saliva even after collection. Through methods like quantitative real-time PCR, microarrays, and next generation sequencing, miRNAs are relatively easy to work with and assay [29]. Furthermore, miRNA expression analyses are comparatively cheap and could be readily applied for “in-vitro diagnostic testing by molecular diagnostics and CLIA (Clinical Laboratory Improvement Amendments) laboratories” [30].
For neurodegenerative diseases such as AD, due to the proximity that human CSF has to the diseased tissue, CSF samples have the advantage of presenting a more stable set of biomarkers from the brain than that of blood samples [31]. However, CSF samples are much more invasive to obtain and unless there is a significant need, most patients are reluctant to go through with a lumbar puncture (spinal tap). Though blood serum contains miRNA signals from all tissues in the body, it is more readily available and thus much less invasive [31]. miRNAs have demonstrated their capacity as non-invasive blood- and serum-based biomarkers for numerous cancer and non-cancer human pathologies [32]. For miRNA in AD specifically, a study performed by Leidinger and coauthors [30] used a signature of 12 differentially expressed blood-based miRNAs to distinguish with high accuracies between AD patients and healthy controls, thereby establishing the potential for AD diagnosis through non-invasive blood drawing procedures.
Furthermore, differences in miRNA data between serum and plasma samples necessitate the identification of biomarkers specific to blood parts. As Wang and coauthors [33] described, there are differences between the expressions of miRNA from serum and from plasma: generally higher miRNA concentrations were observed in serum samples when compared to corresponding plasma samples [33]. Therefore, separate considerations must be made for processing of miRNA data depending on the different sample types.
Reviewed in by Wei and coauthors [34], the updated list of miRNA implicated in AD has grown extensively asserting both the diagnostic and therapeutic applications of potential miRNAs. However, despite the increased miRNA biomarkers discovery, further miRNA profiling—taking into account miRNA features such as target genes and pathways—may elucidate new putative markers more relevant to the molecular mechanisms of AD pathogenesis. Thus, there is a compelling need to develop more reliable methods of identifying AD miRNA biomarkers and through the collection and analysis of certain miRNA and their features, common attributes can be identified, allowing for specific targeting of diagnosis and treatment for the disease. The goal of our study was to analyze blood-based miRNA sets implicated in AD, show their distinguishing and common features, and to create a ML model that would be able to distinguish AD patients from healthy control subjects while accounting for the differential expression of key miRNA in plasma and serum samples.
METHODS
We used the following programs and databases for miRNA analysis and ML model development: miRPathDB [35, 36], GeneCards [37, 38], and Waikato Environment for Knowledge Analysis (WEKA) [39, 40]. The flowchart of methods is shown in Fig. 1.

Overview of the methods of the study. A selection of miRNA related to AD were found and verified from multiple published papers, analyzed using KEGG for roles within the AD pathway, paired with their corresponding attributes and then inputted into WEKA programs to build identification models. Attribute filtering was used to significantly reduce the number of attributes in the training set and then multiple classification methods were run to compare accuracies.
Our study began with the selection of dysregulated blood-based miRNA significantly related to the development and pathogenesis of AD, along with a selection of random miRNA that have not been implicated in AD. miRNA significantly associated with the development and pathogenesis of AD were extracted based on sample type: with information about miRNA from serum sources extracted from [41–50] while information about miRNA from plasma sources were extracted from [51–57] (as described in a review by Nagaraj and coauthors [58]). The completed training and testing sets of miRNA contain only miRNA that were shown to be significantly associated with AD. Selected miRNA were then attributed their corresponding miRNA features: experimentally proven target genes found using miRPathDB [35, 36], and related pathways were found using GeneCards [37, 38]. This initial dataset was later split and trimmed depending on the sample type (postmortem, exosome, serum, or plasma)—of which serum and plasma became the final dividers (Fig. 2). These separate serum- and plasma-selected miRNA sets were analyzed and cross-referenced using KEGG [59–61]. KEGG elucidated the role of the selected miRNA within the greater context of known AD pathways, as well as which genes are targeted by dysregulated miRNA present in both fluids. We tested the performance of five different ML methods—specifically the Multilayer Perceptron (MLP) classifier, the Naïve Bayes (NB) classifier, the Random Tree (RT) classifier, the Random Forest (RF) classifier, and the ZeroR (ZR) classifier—in WEKA [39, 40] to create a miRNA-based sample-specific model for identification of and differentiation between AD and control.

Overview of the fluid-specific divisions of the study. Following splits to exclude postmortem and exosome samples, two sets specific to fluid type were created (serum and plasma). Each training set present is evaluated using multiple classifiers and 10-fold cross validation. The testing proceeded in two stages: one in which models were tested with confirmed AD miRNA biomarkers (which was further split with “clean” testing sets with no overlap and “natural” testing sets with some overlap) and another which models were tested with miRNA biomarkers from some other disease (coronary artery disease).
Kyoto Encyclopedia of Genes and Genomics (KEGG) database
KEGG Database [59–61] (Kanehisa Lab) features vast collection of comprehensive pathway maps illustrating molecular interaction, reaction and relation networks relating to metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases, and drug development. KEGG Pathway has been references across hundreds of papers and includes 801,598 pathway maps and references as of 2021. Each pathway map includes enzyme–enzyme interactions, gene expression relations, protein–protein interactions, and reverences to more in depth information for each object entry.
mirPathDB
mirPathDB 2.0 [35, 36] is a dictionary database that provides extended analysis functionality and intuitive visualizations for 27,452 (candidate) miRNAs, 28,352 targets, and 16,833 pathways for Homo sapiens. Each entry is supplemented with locational and sequence information as well as genomic targets (which include differential labeling between predicted and experimental evidence), significant pathways, and similar miRNAs. The software’s miRNA target data is retrieved from the June 2019 versions of MiRanda, miRTarBase, and TargetScan.
GeneCards
GeneCards 5.3 [38] is a searchable, integrative database of 330,601 entries that provides comprehensive, user-friendly information on all annotated and predicted human genes from over 190 data sources. Each entry includes genomic, transcriptomic, proteomic, genetic, clinical, and functional information about both protein coding and RNA genes including IncRNAs, piRNAs, and other ncRNAs. The software produces pathway information from external sources such as the NCBI BioSystems Database and the KEGG.
Machine-learning analysis
ML analysis was performed on the miRNA training and testing datasets with WEKA program environment [39, 40]. WEKA is an open source workbench that encompasses a variety of tools for data preparation and filtering, classification and pattern recognition, clustering, association rules mining, and visualization.
Attributes for the miRNA were drawn from the miRPathDB [35, 36] and GeneCards [37, 38] databases which introduced over 10,000 attributes corresponding to an individual miRNA’s validated target genes and significantly overrepresented pathways. These genomic and pathway-centric attributes highly contribute to the construction of a ML classifier, as they are the source for the vast amount of information necessary for effective pattern recognition. Specifically, for each miRNA the constructed descriptors set includes a unique set of validated target genes and pathways.
In order to create a comprehensive list of all the unique descriptors from each miRNA, we developed a code to standardize the attributes for each instance of miRNA. By iterating through a unique dictionary list of genes and pathways (extracted from the collection of target genes and pathways each miRNA has), a value of 0 or 1 is assigned to each miRNA for each gene and pathway. Through this, a data set is constructed one column at a time, with each column representing either a genomic attribute (for example, AβPP) or a pathway attribute (for example, Parkinson’s disease) and each row represents a single miRNA and its 0 or 1 descriptor values. “1” indicates that the miRNA possesses that column’s gene target or pathway, while “0” indicates that the miRNA does not possess that column’s gene target or pathway. Specific code details of our code could be found here: https://github.com/amywx08/github-public.
Final genomic attributes include certain miRNAs gene targets such as APP, BACE1, MAP2K1, GSK3B, and AKT3, all of which are implicated in the development and pathogenesis of AD (Fig. 4A, B), as well as miscellaneous gene targets that have either no direct linkage with AD, such as TNK2, SLCGA15, and GTF2H3, or gene targets that play an important role in diseases similar to AD, such as HIP1, which encodes for a key protein within Huntington’s disease that initiates apoptosis [62]. HIP1 and other included gene targets like it and might play similar significant roles throughout the pathogenesis of other neurodegenerative diseases (including AD). Final pathway attributes reflected the implication of the miRNA in a variety of pathways including those implicated in neurodegeneration and brain dysfunction, such as Parkinson’s disease, Huntington’s disease, AD, schizophrenia, anxiety, and cocaine dependence. Additional pathway attributes reflected the role of individual miRNA in cancer pathways such as MicroRNAs in cancer, pancreatic cancer, prostate cancer, brain cancer, lung cancer, as well as gene expression-related pathways, such as gene silencing by miRNA, miRNA Regulation of DNA damage response, and RNA processing. Representatives for these genomic and pathway descriptors are shown in Tables 1 and 2.
Representatives of included genomic attributes
Representatives of included pathway attributes
Each of these genomic and pathway attributes (exemplified in the righthand columns of Tables 1 and 2) are nominal descriptors. Within each inputted data set, if a miRNA has “0” for a genomic attribute value (for example, AβPP), the miRNA does not target that specific gene (0 = does not target AβPP). If a miRNA has “1” for a genomic attribute value, the miRNA does target that specific gene (1 = does target AβPP). For pathway attribute values (for example, Parkinson’s disease), “0” means that the miRNA is not involved in a specific pathway (0 = not involved in Parkinson’s disease), while “1” means that the miRNA is involved in that specific pathway (1 = involved in Parkinson’s disease).
Following the creation of these sets, the InfoGain attribute evaluator and the ranker search method were utilized to rank the significance of all these attributes by the amount of information gained with respect to class. After InfoGain attribute filtering, using a threshold of 0.0306, the number of attributes for the training dataset was reduced from 11,731 attributes to 704 for the final serum model (excluding both postmortem and exosome samples) and using a threshold of 0.07, 11,253 attributes decreased to 54 for the final plasma model (excluding both postmortem and exosome samples). The Multilayer Perceptron classifier (MLP) method and the Random Forest classifier (RF), both of which performed with the highest accuracy, were used to build robust models via the training sets and the corresponding testing sets. Other methods and classifiers including NB, RT and ZR were attempted as well.
The model was trained to detect the patterns of potential miRNA blood-based biomarkers of AD for the purpose of diagnosis with specificity for plasma-based miRNA biomarkers and serum-based miRNA biomarkers. Each separated dataset was then individually analyzed using various algorithms of ML classification. To demonstrate the validity of these models, we tested the created ML models with both “clean” test data (which had no overlap with the corresponding training set) and “natural” test data (which included some overlap with the corresponding training set) on both trained serum and plasma models. An overview of the workflow is shown in Fig. 2. As a second test to further demonstrate and the validity of the model, we tested the model on an independent set of miRNA proven to be significantly associated with AD and a set of miRNA proven to be significantly associated with another disease (i.e., coronary artery disease, CAD). Model performance was evaluated by measuring accuracy, which here is understood as the total number of correctly classified instances over the total number of instances.
RESULTS
Selection of AD dysregulated miRNA were obtained from studies focused on miRNA taken from serum sources [41–50] and studies focused on miRNA taken from plasma sources [51–57] (as described in a review by Nagaraj and coauthors [58]). These miRNA are categorized below in Fig. 3 based on blood fluid type. These miRNAs are identical to those used for pathway analysis and to build the final machine-learning models discussed later in this section.
Discrepancies between the serum and plasma miRNA profiles can be explained by the differences in molecular composition between the two fluids, as human plasma retains the clotting factor fibrinogen, while human serum does not [63]. As asserted by Wang and coauthors [33], the repertoire of circulating miRNA might be altered by the coagulation process as RNA is released. This coagulation process intensifies variations between samples on the observed proteome, thereby complicating comparisons and data analysis [33]. Similar conclusions have been the basis for other studies surrounding miRNA expression in disease. For instance, it was found by Mompeón and coauthors [64] that methodological variances in studies of cardiovascular disease compromised the potential of miRNA diagnostic and prognostic methods. They concluded that “plasma and serum exhibited different of circulating miRNA expression in Non-ST-elevation myocardial infraction”, implying that any studies that did not use the same starting fluid could not be compared [64]. Therefore, for the purposes of AD diagnosis, the need to separate dysregulated miRNA by their sample type is clear.
Furthermore, when the serum miRNA biomarkers for AD from Fig. 3 were cross-referenced against 50 serum miRNA used for Parkinson’s disease diagnosis [65–70], represented in Supplementary Table S1, there was only an 8% similarity. Therefore, the overlap is small enough for differentiation in our analysis.

Similarities and differences between the AD-implicated miRNA dysregulated in serum and dysregulated in plasma.
Pathway analysis
Additionally, we connected the miRNAs represented in Fig. 3 to their validated target genes, implicated pathways, and biological role in AD pathogenesis and development. The manually constructed pathway maps available in the KEGG database [59–61] were utilized to analyze the involvement of the selected miRNAs in AD. hsa05010 was used in particular to quantify this connection as it illustrates how Aβ is connected to pathological effects within neurons via the alteration of protein signal transduction leading to the formation of Aβ aggregates—a key indicator of AD development—mitochondrial dysfunction, and cell death [71]. Therefore, the validated target genes of the selected miRNA from each fluid were cross-referenced against those present in the AD pathway. Our results demonstrate that many of the genes present in the AD pathway are targeted by the selected miRNA, some even by multiple miRNA. Details for each fluid are illustrated in Fig. 4A and 4B and Supplementary Table S2.

AD pathway with the target genes the selected serum miRNA marked by stars. The schema represents the AD pathway in KEGG. Genes that are marked by a darker red star have less of the selected miRNA targeting them with them while genes with a brighter red star have more of the selected miRNA targeting them. Those genes with no stars are targeted by none of the selected miRNA.

AD pathway with the target genes the selected plasma miRNA marked by stars. The schema represents the AD pathway in KEGG. Genes that are marked by a darker blue star have less of the selected miRNA targeting them with them while genes with a brighter blue star have more of the selected miRNA targeting them. Those genes with no stars are targeted by none of the selected miRNA.
Machine-learning analysis
The miRNA in this study were characterized using their target genes [35, 36] and implicated pathways [37, 38]. These descriptors were preprocessed using InfoGain filtration technique from WEKA to strengthen the information gain and connections present in the data while also eliminating anynoise from redundant or non-important variance patterns. The InfoGain attribute evaluator determines the worth of an attribute (e.g., a specific target gene, pathway) by measuring the information gain with respect to the class (AD or non-AD) where InfoGain(Class,Attribute)=H(Class) - H(Class | Attribute). A threshold is then used to remove attributes that do not have an information gain beyond a certain number. Using WEKA [39, 40], we created disease-recognition models for AD and explored various classification techniques including NB, ZR, RF, RT, and MLP. The results of the tests are asfollows.
Performance for diagnostics using serum-based biomarkers
To address the discrepancy between serum and plasma miRNA profiles, a large compilation of AD dysregulated miRNA (with the exclusion ofpostmortem and exosome samples) were extracted from a review by Nagaraj and coworkers [58] and split by sample type (serum sources [41–50] and plasma sources [51–57]. The resulting dataset included 100 miRNA (with 50 miRNA being AD biomarkers and 50 not), which was then filtered with a threshold of 0.0306 to leave 704 attributes. The cross-validation test yielded an average accuracy of below 80% for the ZR and NB classifiers, while RT yielded an average accuracy of 82%, RF yielded an average accuracy of 86%, and MLP yielded an average accuracy of 92%. A visual comparison of the accuracy of the five classifiers are shown in Fig. 5 and the corresponding area under the ROC curve (AUC) values are illustrated in Supplementary Figure S1. Additional attempts at increasing accuracy were run by differing the threshold levels for InfoGain attribute filtering as shown in Supplementary Figure S2.

Performance of each classifier on a serum-based biomarker dataset with 50 AD and 50 non-AD miRNA (with the exclusion of postmortem and exosome samples) is visually displayed by a bar graph for the accuracies of the five classifiers.
Performance for diagnostics using plasma-based biomarkers
For the resulting plasma dataset 88 miRNAs were included (with 44 miRNAs being AD biomarkers and 44 not). These were then filtered with a threshold of 0.07 to leave 54 attributes. The cross-validation test yielded an average accuracy of below 80% for the ZR and NB classifiers, while RT yielded an average accuracy of 88.6%, RF yielded an average accuracy of 90.9%, and MLP yielded an average accuracy of 85.2%. A visual comparison of the accuracy of the five classifiers are shown in Fig. 6 and the corresponding AUC values are illustrated in Supplementary Figure S3. Additional attempts at increasing accuracy were run by differing the threshold levels for InfoGain attribute filtering as shown in Supplementary Figure S4.

Performance of each classifier on a plasma-based biomarker dataset with increased miRNA (with the exclusion of postmortem and exosome samples) is visually displayed by a bar graph visual for the accuracies of the five classifiers.
Validation via outside testing sets
AD testing set on serum
To further ensure the reliability of the serum model, we ran additional tests with submissions of miRNA proven to be dysregulated in the serum of AD patients. The miRNAs from the serum testing sets (“clean” and “natural”) were extracted from a paper published by Wu and coauthors [72], which identified miR-146a-5p, miR-106b-3p, miR-195-5p, miR-20b-5p, miR-497-5p, miR-125b-3p, miR-29c-3p, miR-93-5p, and miR-19b-3 to be significantly dysregulated in the serum of AD patients [66]. These additional tests included a submission of a “clean” test set of six miRNA wherein there was no overlap between the test set and the plasma training set used previously and a “natural” test set of seven miRNA wherein there was some overlap (>35%) with the training set used previously. Both of these datasets were compatible with the previous training set in attribute number (704 descriptors) and type. The “clean” dataset refers to a test set that includes no miRNA overlap with the training set whereas the “natural dataset” is simply all the miRNA extracted from a study separate from those studies used in the training set. When run with MLP, the best performing classifier for the serum training set, the “clean” dataset yielded 83.3 % accuracy while the “natural” test set yielded 88.9% accuracy(Fig. 7).

Accuracies of the clean and natural testing sets when supplied to the trained serum and plasma models.
AD testing set on plasma
The same procedure was run to ensure the reliability of the plasma model. The miRNAs for the plasma testing sets (both “clean” (meaning no overlap with the plasma training set) and “natural” (meaning some overlap with the plasma training set)) were extracted from a paper published by Cosín-Tomás and coauthors [73], which identified hsa-miR-142-3p, hsa-miR-15b-5p, hsa-miR-545-3p, hsa-miR-34a-5p, hsa-miR-29b-5p, and hsa-miR-29b-3p to be significantly dysregulated in the plasma of AD patients [73]. The additional tests included a submission of a “clean” test set of 10 miRNA wherein there was no overlap between the test set and the plasma training set used previously and a “natural” test set of 9 miRNA wherein there was some overlap (>35%) with the training set used previously. Both of these datasets were compatible with the previous training set in attributes (54 descriptors). When run with RF, the best performing classifier for the serum training set, the “clean” dataset yielded 78.6% accuracy while the “natural” test set yielded 85.7% accuracy (Fig. 7).
Prediction performance for other disease biomarkers
A few tests were run to demonstrate that both the serum and the plasma models are isolated for AD data and can effectively detect only their presence and not some other disease. For this, we submitted miRNA datasets of other diseases to the corresponding classifier model (the sample profile—serum or plasma—was also taken into account for the non-AD disease miRNA submitted). The selected disease was CAD. CAD’s miRNA biomarkers, which were extracted from a review by Melak and Bayes [74], were inputted into the corresponding classifier model. The produced results were as expected. The accuracies are as follows: the testing set of 14 miRNA dysregulated in CAD serum resulted in 42.9% average accuracy of prediction AD and the testing set of miRNA dysregulated in CAD plasma resulted in a slightly worse average accuracy of 33.3% (Fig. 8). The under or near 50%, close to chance accuracy percentages of these other disease tests support the efficacy of the two classifier models.

Accuracies of the CAD serum and plasma testing sets when supplied to the trained serum and plasma models.
Comparing attribute contribution
When the previous serum-based biomarker model was run with isolated features, pathways only and target genes only, the resulting accuracies (Fig. 9) indicated that a model with both attributes would yield the greatest performance (92%).

Accuracies on the serum-based biomarker model with attribute isolation and attribute combination, as obtained from 10-fold cross-validation.
When the previous plasma-based biomarker model was run with isolated features, pathways only and target genes only, the resulting accuracies (Fig. 10) indicated that a model with both attributes would yield the greatest performance.

Accuracies on the plasma-based biomarker model with attribute isolation and attribute combination, as obtained from 10-fold cross-validation.
Adjusting for limitations in miRNA count
After AD miRNA of the serum-based biomarker dataset were randomly chosen and removed from the dataset, they were re-inputted into the model for 10-fold cross-validation. This process was repeated, and the results are represented in Fig. 11. Using 10 AD miRNA, an 80% accuracy level can still be maintained.

Accuracies of the serum-based biomarker model with increasing AD miRNA, as obtained from 10-fold cross-validation.
After AD miRNA of the plasma-based biomarker dataset were randomly chosen and removed from the dataset, they were re-inputted into the model for 10-fold cross-validation. This process was repeated, and the results are represented in Fig. 12. Using 24 AD miRNA, an 87.5% accuracy level can still be maintained.

Accuracies of the plasma-based biomarker model with increasing AD miRNA, as obtained from 10-fold cross-validation.
DISCUSSION
In the past decades, miRNAs have been increasingly recognized as a key tool for deeper insight into a wide-ranging variety of biological processes including the development of many human diseases including cancer, cardiovascular diseases, and neurodegenerative diseases (not limited to AD) [1–3]. For neurodegeneration, specifically, the misregulation of miRNA through errors in the microRNA pathway have shown to be essential in the pathogenesisof multiple neurodegenerative conditions including AD, amyotrophic lateral sclerosis, and frontotemporal dementia [75]. While the details of AD pathogenesis and development remain unclear, two potential mechanisms have emerged—the Aβ theory and tau protein theory. The Aβ theory attributes the development of AD to the accumulation of Aβ plaques and the tau protein theory attributes AD pathogenesis to neurofibrillary tangles of hyperphosphorylated tau protein.
To better understand how the dysregulation of certain miRNA may lead to the development of AD, utilization of selected miRNAs and pathway analysis elucidated the relationship between miRNA and gene targets within the broader context of an AD pathway. Surprisingly, despite the selected miRNA (identified in Fig. 3) for either fluid only making up a portion of all miRNA implicated to be involved with AD pathogenesis (without accounting for dysregulated miRNA from postmortem and exosome studies), more than half of the of genes present in the pathway were targeted by a selected miRNA. Specifically, casein kinase 2 (CK2) was determined to be targeted by ten unique dysregulated plasma miRNAs (Fig. 4A), which are hsa-miR-320a-5p, hsa-miR-320a-3p, has-miR-2110, hsa-let-7b-5p, hsa-miR-10b-5p, hsa-miR-301a-3p, hsa-miR-1260a, hsa-103a-3p, hsa-miR-193a-5p, and hsa-186-5p. ERK1/2 was determined to be targeted by eight unique dysregulated serum miRNAs (Fig. 4B), which are hsa-miR-106a-5p, hsa-miR-106b-5p, hsa-miR-143-3p, hsa-miR-335-5p, hsa-miR-143-5p, hsa-miR-137-3p, hsa-miR-9-3p, and hsa-miR-93-3p. As expected, CK2 is implicated in pathways of neurodegeneration (multiple diseases) including not only AD but also syndromic neurodevelopmental disorder and the prion disease pathway, which similar to AD is caused by abnormal accumulation of protein in the brain, eventually causing memory impairment, personality changes, and difficulties with movement. Additionally, ERK1/2 is implicated in similar neurodegenerative pathways (AD, prion disease, etc.) but also many that are related to neuron function (such as axon guidance, neurotrophin signaling pathway, glutamatergic synapse, cholinergic synapse, serotonergic synapse, etc.) and cancer (colorectal cancer, renal cell carcinoma, pancreatic cancer, endometrial cancer, prostate cancer, thyroid cancer, etc.). Both of these genes play a role in the PD-L1 expression and PD-1 checkpoint pathway in cancer, which may provide some insight into the pathogenesis of AD. However, deeper analysis must be conducted before any conclusions are drawn.
Additionally, miRNA constellations indicate alignment with amyloid generation. In particular, eight miRNAs of the elucidated biomarkers target AβPP, a key gene in the synthesis of the amyloid precursor protein, which gives rise to the Aβ peptide that forms amyloid plaques. There are six miRNAs in serum that target AβPP: hsa-miR-106a-5p, hsa-miR-106b-5p, hsa-miR-455-3p, hsa-miR-497-5p, hsa-miR-101-3p, hsa-miR-20a-5p. There are two miRNAs in plasma that target AβPP: hsa-miR-320a-5p and hsa-miR-15b-5p. This supports the hypothesized AD pathology in which AD arises from extracellular of aggregate Aβ in the brain. Further, these biomarkers suggest the involvement and possible interaction of miRNAs both in serum and plasma blood in the development of AD.
Moreover, issues with AD diagnosis are abundant. Besides imprecision, since AD is only diagnosed with complete certainty after death through a brain autopsy [15], diagnosis can also be costly and invasive as multiple specialists often need to be inquired and brain scans must be taken. To solve not only diagnosis problems but also problems surrounding treatment and drug discovery, miRNA and ML have been used increasingly in biomarker identification [21]. These techniques have been successfully applied to diagnosing cancer [22] and even AD [23], but a problem persists specifically for the identification of blood-based biomarkers—the differences between blood parts. There are proven connections between what platform a miRNA blood-sample is taken from and its measurements; therefore, decreasing confounding between blood sample types necessitates the creation of two separate models: one, trained on serum data and another, trained on plasma data. Of the data selected for both these fluid types, surprisingly, very few of the dysregulated miRNA overlapped (Fig. 3), further supporting the idea of different miRNA profiles for different blood parts. This lack of overlap is converse to targeted genes though, in which serum and plasma miRNA share many of the same gene targets within the AD pathway (Table 3). When run, both of the models (serum and plasma) built on gene target and pathway descriptors, performed well in both training and testing (10-fold cross-validation and supplemental test sets), achieving over 90% in 10-fold cross validation and over 85% in additional testing with the outside datasets. This proves that a relatively accurate tool for AD diagnosis could be created by integrating pathway and genomic data while maintaining the differentiation of miRNA expression between blood sample profiles (serum, plasma). Notably, the increase of more miRNA between the first and second version of the models decreased the overall accuracy for 10-fold cross-validation but the addition was needed to address robustness and decrease the likelihood of misrepresentation as a small sample for training can lead to an inflation in results and an overestimation of the model’s actual performance.
The creation of a model that can find putative miRNA biomarkers for AD with regard to blood-type is important not only in the identification of novel biomarkers for the purpose of diagnostics, but also in its relevancy for categorizing data from a yet unspecified blood part. In the future, a similar method, adapted for therapeutic rather than diagnostic miRNA, could in theory be used to elucidate miRNAs for therapy. Additionally, this study negated the groups of miRNA AD biomarkers discovered from postmortem and exosome samples for the sake of purity but further comprehension on the innate differences between exosomal samples and other blood parts and the differences between postmortem blood and non-postmortem samples will be conducive not only to the study of AD but any study that relies on the use of blood-based miRNA biomarkers. Furthermore, both the pathway analysis and the creation of an identification model were founded off of the understanding that there are inherent differences between the miRNA profiles of serum and plasma—a concept supported in a previous study by Wang and coauthors [33]. Despite putative associations demonstrating that the coagulation of blood may cause these observed differences [33], it remains unclear, however, why these differences arise and how they may be connected to the pathogenesis of AD via the targeting of different pathway steps. It may be possible that certain steps require the involvement of serum-centric miRNA while others require plasma-centric miRNA but a much deeper analysis must be performed before any conclusions are drawn.
Similar to the way there are more than 100 types of cancer that share many common features (growing blood vessels, mutation in the TP53 gene), commonalities, like protein accumulation and neuronal dysfunction, between various neurodegenerative diseases are also great. Moving forward, AD miRNA biomarkers can be cross-referenced with those of similar neurodegenerative diseases, such as Parkinson’s disease and Lewy body dementia, to elucidate AD specific miRNAs. Furthermore, a more sophisticated model for these AD-specific miRNAs may take into account dysregulation levels compared to a healthy control baseline and specific miRNA function (amyloid generation, tau phosphorylation, neuroinflammation) to more closely monitor an individual’s AD progression. We would address the miRNA on specific stages in neurodegeneration, like amyloid generation, tau phosphorylation, neuroinflammation, in our future work.
DISCLOSURE STATEMENT
The authors disclosures are available online (https://www.j-alz.com/manuscript-disclosures/21-5502r1).
