Abstract
Background:
Many neurocognitive and neuropsychological tests are used to classify early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and Alzheimer’s disease (AD) from cognitive normal (CN). This can make it challenging for clinicians to make efficient and objective clinical diagnoses. It is possible to reduce the number of variables needed to make a reasonably accurate classification using machine learning.
Objective:
The goal of this study was to develop a deep learning algorithm to identify a few significant neurocognitive tests that can accurately classify these four groups. We also derived a simplified risk-stratification score model for diagnosis.
Methods:
Over 100 variables that included neuropsychological/neurocognitive tests, demographics, genetic factors, and blood biomarkers were collected from 383 EMCI, 644 LMCI, 394 AD patients, and 516 cognitive normal from the Alzheimer’s Disease Neuroimaging Initiative database. A neural network algorithm was trained on data split 90% for training and 10% testing using 10-fold cross-validation. Prediction performance used area under the curve (AUC) of the receiver operating characteristic analysis. We also evaluated five different feature selection methods.
Results:
The five feature selection methods consistently yielded the top classifiers to be the Clinical Dementia Rating Scale - Sum of Boxes, Delayed total recall, Modified Preclinical Alzheimer Cognitive Composite with Trails test, Modified Preclinical Alzheimer Cognitive Composite with Digit test, and Mini-Mental State Examination. The best classification model yielded an AUC of 0.984, and the simplified risk-stratification score yielded an AUC of 0.963 on the test dataset.
Conclusion:
The deep-learning algorithm and simplified risk score accurately classifies EMCI, LMCI, AD and CN patients using a few common neurocognitive tests.
Keywords
INTRODUCTION
Dementia is a neurodegenerative disease characterized by progressive memory loss as a result of neuronal cell death. More than 47 million people worldwide live with dementia and by 2050 that number is expected to increase to 131 million [1]. The most common type of dementia is Alzheimer’s disease (AD), and mild cognitive impairment (MCI) is often seen as risk state of progression to AD. The latter can be subdivided into early mild cognitive impairment (EMCI) and late mild cognitive impairment (LMCI), as defined in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [2]. While there is no cure for dementia, early diagnosis may enable lifestyle changes (such as diet and exercise), neurocognitive enrichment, and therapeutic treatment that may temporarily improve symptoms or slow the rate of decline of symptoms, thereby improving the quality of life [3].
The core clinical criteria for the diagnosis of MCI and AD are neuropsychological tests [4, 5]. Fluid and imaging biomarker tests, such as cerebrospinal fluid markers and p-tau, may in some cases supplement standard clinical tests in specialized clinical settings [6]. A large array of neurocognitive tests are currently used to detect cognitive impairment and classify among normal controls (CN), EMCI, LMCI, and AD [7, 8]. Many studies have identified a few top classifiers using logistic regression and machine learning methods [9–18]. Some studies have also used MRI and genetic data in conjunction with neurocognitive measures for classification [19, 20]. However, most of these studies to date performed binary classification (i.e., between CN and AD or CN and MCI) [10, 21]. Classifying CN, EMCI, LMCI, and AD remains challenging. Deep learning is increasingly being used in medicine, including classification of diseases to aid diagnosis [22–24]. Deep learning, or machine learning in general, uses algorithms to learn the relationship amongst different data elements to inform outcomes. In contrast to traditional analysis methods (such as logistic regression), the specific relationships amongst different input variables with outcome variables do not need to be explicitly specified a priori. Neural networks, for example, are made up of a collection of connected nodes that model the neurons present in a human brain [25]. Each connection, like the synapses in a brain, transmits and receives signals to other nodes. Each node and the connections it forms are initialized with weights which are adjusted throughout training and create mathematical relationships between the input data and the outcomes. Deep learning is well-suited to analyze complex and large datasets where input and output variables cannot be readily parameterized. The goal of this study was to compare different feature-selection algorithms and develop a deep-learning algorithm to identify the top neurocognitive test scores that accurately classify normal control, early MCI, late MCI and Alzheimer’s disease. From these findings, we further constructed a novel simplified risk score model to classify normal, EMCI, LMCI and AD for clinical use.
METHODS
Study population
Data used in this study was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu). Patients were taken from the ADNI1, ADNIGO, ADNI2, and ADNI3 patient sets. Figure 1 shows the flowchart for patient selection. The inclusion criteria were a confirmed diagnosis from screening to the baseline visit and the exclusion criteria were greater than 20% of patient data missing. The total sample size in the study was 1,937 patients, with 1,743 being randomly assigned to the training dataset and 194 being assigned to the testing dataset before any feature selection or feature engineering was performed. The specific dataset used was from the ADNI database, which is a multi-institutional data source, with built-in datasets. Future studies will use independent datasets outside of the ADNI database. Of the 1,937 participants that met the inclusion criteria, 516 patients were diagnosed as CN, 383 were diagnosed as EMCI, 644 were diagnosed as LMCI, and 394 were diagnosed as AD.

Flowchart of patient selection.
Data preprocessing
We evaluated about 100 input variables (i.e., test scores, demographic information, and biomarkers). Correlation matrix analysis showed that 47 variables had a correlation coefficient above 0.5 and were determined to be correlated, which merited exclusion from further analysis. In addition, 29 variables were missing in > 20% of patients and they were also excluded from analysis. For the rest of the variables, missing data (most of which had < 10% data missing) was imputed with Classification and Regression Trees (CART) using Multivariate Imputation by Chained Equations (MICE) in R, a statistical analysis software (version 4.0.0) [26]. Although regional volumes were available through the Free Surfer pipeline, > 30% were missing and regional volumes were thus not included in the analysis. Intracranial volume was included with imputation because < 20% was missing.
The following neurocognitive tests, demographics, comorbidities and other variables were used in our analysis. The neuropsychological scores included ADAS11 (Unweighted sum of 11 items from The Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADAS-Cog)), ADAS13 (Unweighted sum of 13 items from ADAS-Cog), ADASQ4 (Score from Task 4 (Word Recognition) of the Alzheimer’s Disease Assessment Scale (ADAS)), CDRSB (Clinical Dementia Rating Scale - Sum of Boxes), FAQ (Functional Activities Questionnaire), LDELTOTAL (Delayed total recall), MMSE (Mini-Mental State Examination), RAVLT_forgetting (Rey’s Auditory Verbal Learning Test –Forgetting score), RAVLT_immediate (Rey’s Auditory Verbal Learning Test –Immediate Recall score), RAVLT_learning (Rey’s Auditory Verbal Learning Test –Learning Score), RAVLT_percentage_forgetting (Rey’s Auditory Verbal Learning Test –Percent Forgetting), TRABSCOR (Trail Making Test Part B Time), mPACCdigit (Modified Preclinical Alzheimer Cognitive Composite with Digit test), and mPACCtrailsB (Modified Preclinical Alzheimer Cognitive Composite with Trails test). The extracted imaging parameters that were used included intracranial volume (ICV), volume of ventricles, and whole brain volume. APOE4 status was also included. The demographics included age, sex, race, ethnicity, education level (PTEDUCAT). The outcomes were 4 diagnosis classes: AD, LMCI, EMCI, and CN based on comprehensive clinical diagnosis as provided in the dataset.
Neural network model
Ranking of feature importance was first conducted among cognitive tests, demographic information, genetic tests, and extracted biomarkers. Five different feature selection methods were utilized to identify the most predictive variables: Information Gain, Boruta Random Forest, Recursive Feature Elimination with the Random Forest Classifier, Logistic Regression with LASSO/L1 regularization, and Permutation Importance in Keras. The scikit-learn library was used for Recursive Feature Elimination and Logistic Regression analyses. The Boruta package in R was used for Boruta Random Forest and Weka for Information Gain [27]. To conduct Permutation Importance analysis, a separate neural network was trained with all features available rather than just the top few variables. This network consisted of 5 layers, a BatchNormalization layer followed by 2 fully connected (FC) dense layers followed by a dropout layer and finally another fully connected dense layer. The first FC layer consisted of 24 neurons with the ReLU activation function and the second FC layer consisted of 16 neurons with the ReLU activation function. The dropout layer had a dropout rate of 0.20. The last layer consisted of 4 neurons with the Softmax activation function for multiclass classification. The model was compiled with categorical cross entropy loss with the ADAM optimizer and a learning rate of 0.001 [28]. The top predictors were those that demonstrated statistical significance.
For the deep learning model, a Multi-layer Perceptron (MLP) neural network was constructed with two fully connected dense layers for classification followed by a dropout layer and finally a fully connected dense layer. The first two FC layers contained 8 neurons along with the ReLU activation function. The next layer was a dropout layer with dropout rate of 0.15. The last layer contained 4 neurons and used the Softmax activation function for multiclass classification. The model was compiled with categorical cross entropy loss with the ADAM optimizer and a learning rate of 0.003. Additionally, while testing the classification accuracy of the variables selected by Permutation Importance analysis, a BatchNormalization layer was added as the first layer of the network. The top predictors extracted from the global feature selection analysis were used as input for the neural network and the output was the diagnosis class. The dataset was split into 90% training data and 10% testing data using 10-fold cross validation while training the neural network. Diagnosis results were categorized by multiclass classification.
Risk score model development
A simplified risk score model was constructed using the top 5 global variables (ca. cognitive test scores) identified by the different feature selection methods as followed: 1) For each variable, scores were plotted for the 4 classes of diagnosis and cutoff points were chosen to maximize separation amongst the 4 classes. 2) The cutoff points were then used to construct a point value system for each cognitive test’s score range. This was done by fitting the top cognitive tests in the training dataset against the diagnosis outcome using a Generalized Linear Model (GLM). 3) The GLM then assigned risk score points for each of the cognitive tests score ranges. A higher number of points for a given score range means that the patient is more likely to have AD. 4) A composite risk score from the sum of the top 5 variables’ risk scores was constructed for each patient. 5) The risk score model was then tested on an independent testing dataset and evaluated using ROC analysis. Risk scores of the testing dataset were plotted for the 4 classes with interpolation smoothing.
We chose only the top five variables because: i) they are manageable for creating the risk score, ii) feature importance dropped significantly after the first five features for multiple machine-learning methods, which provided further validation for the selection of features and avoid potential bias, and iii) limiting to a few features (instead of all features) prevents overfitting in training the neural network and the risk score models.
Performance evaluation and statistical analysis
Statistical analyses were performed using SPSS v26.Frequencies and percentages for categorical variables between the stages of AD were compared in a pair-wise fashion using χ2 tests. Continuous variables, which were denoted as median (IQR), were first tested for normality with the Lilliefors corrected Kolmogorov-Smirnov test. If they were shown to not have a normal distribution, further comparison was done in a pair-wise fashion between groups using the nonparametric Kruskal-Wallis test. p-values < 0.05 were considered statistically significant.
ROC analysis was used to evaluate the performance of the NN and the risk score model, in which training data was first split into 90% for training and 10% for testing using 10-fold cross validation and then tested on an independent testing set. The AUC calculation was binary, in which one class was contrasted with the rest of the classes (one versus rest) and this was repeated for each of the 4 classes. The sensitivity and specificity reported were taken as an average of the binary sensitivity and specificity of each class. The 95% Confidence Interval (CI) for the AUC was obtained through bootstrapping the neural network’s predictions 1000 times.
RESULTS
Table 1 shows the demographic data for CN (n = 516), EMCI (n = 383), LMCI (n = 644), and AD (n = 394) groups. Age was not significantly different between groups except between CN and LMCI and between the LMCI and AD groups. Race and ethnicity did not differ significantly between groups. The median education level did not differ significantly between any pair except between AD versus the other classes.
Demographic information, neurocognitive tests, MRI-extracted biomarkers, and genetic factors among CN, EMCI, LMCI, and AD. Continuous variables are expressed as median (IQR) and the pairwise Kruskal Wallis test is employed. The χ2 test is used to identify significance between classes of categorical variables. p-values displayed are with Bonferroni’s correction. Pairwise comparisons are represented by symbols, where * indicates a statistical difference between the CN and EMCI groups, ** indicates a statistical difference between the CN and LMCI groups, *** indicates a statistical difference between the CN and AD groups, $ indicates a statistical difference between the EMCI and LMCI groups, $$ indicates a statistical difference between the EMCI and AD groups, and # indicates a statistical difference between the LMCI and AD groups
aPercentages were based on the total number of individuals in each diagnosis class that had a specific number of APOE4 alleles. bComorbidity data was not available for 101 CN patients, 77 EMCI patients, 85 LMCI patients, and 60 AD patients.
With a few exceptions, all neurocognitive test scores and mPACC tests showed pairwise differences between groups. The MRI-extracted parameters were significant between CN and AD and between CN and LMCI, but not significant between the other pairwise comparison. APOE4 was significant different in all pairwise comparisons. Sleep apnea and depression were the only significant comorbidities between groups.
Figure 2 shows the results of the rankings by importance from the 5 feature selection methods performed on the training dataset and Table 2 lists the top 9 features. CDRSB was the most frequently identified top feature amongst the top 5 feature selection methods (5 out of 5), followed by LDELTOTAL (4 out of 5), mPACCdigit (4 out of 5), mPACCtrailsB (3 out of 5), MMSE (2 out of 5).

Feature ranking for (A) Information Gain, (B) Recursive Feature Elimination with the Random Forest classifier, (C) Boruta Random Forest, (D) Permutation Importance, and (E) Logistic Regression with LASSO/L1 regularization.
Top 9 clinical variables ranked by 5 feature selection methods
ADAS11, Unweighted sum of 11 items from The Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADAS-Cog); ADAS13, Unweighted sum of 13 items from ADAS-Cog; ADASQ4, Score from Task 4 (Word Recognition) of the Alzheimer’s Disease Assessment Scale (ADAS); CDRSB, Clinical Dementia Rating - Sum of Boxes Score; FAQ, Functional Activities Questionnaire; ICV, Intracranial Volume; LDELTOTAL, Delayed Total Recall; MMSE, Mini-Mental State Examination; PTEDUCAT, Education Level; RAVLT_forgetting, Rey’s Auditory Verbal Learning Test–Forgetting score; RAVLT_immediate, Rey’s Auditory Verbal Learning Test–Immediate Recall score; RAVLT_percentage_forgetting, Rey’s Auditory Verbal Learning Test–Percent Forgetting; mPACCdigit, Modified Preclinical Alzheimer Cognitive Composite with Digit test; mPACCtrailsB, Modified Preclinical Alzheimer Cognitive Composite with Trails test; Ventricles, Volume of Ventricles; WholeBrain, Volume of Whole Brain
Neural network model for classification
Classification was performed using the top 5 features. The performances on the testing data for the 5 methods are summarized in Table 3. AUCs for the Information Gain, Boruta Random Forest, Recursive Feature Elimination with the Random Forest Classifier, Logistic Regression with LASSO/L1 regularization, and Permutation Importance were 0.978, 0.984, 0.983, 0.906, and 0.982, respectively, on the testing dataset. The classifier selected by Boruta Random Forest performed the best in terms of AUC, but the classifier selected by Recursive Feature Elimination performed better in terms of accuracy, sensitivity, and specificity. By comparison, classification using CDRSB, LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE individually yielded an AUC of 0.8899, 0.8957, 0.8619, 0.8624, and 0.7808 respectively, on the testing dataset.
Performance of the variables selected from the 5 feature selection methods in our NN. Bracket values indicate 95% confident intervals
Risk score model
We then developed a simplified risk score model using the same top 5 variables from our deep-learning analysis. Figure 3 shows an example of the CDRSB cognitive test scores for 4 different classes. The cutoff points that maximized separation between the 4 classes were 0.1, 1.3, and 3.5, between CN and EMCI, between EMCI and LMCI, and between LMCI and AD group, respectively. The point values for individual test score ranges are summarized in Table 4. The composite score from the top 5 cognitive tests (CDRSB, LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE) were constructed using a GLM. The classification results on the testing dataset are shown in Fig. 4. The risk score system classified the four groups accurately. The performance of the risk score model yielded an AUC of 0.963 [95% CI: 0.945-0.975], sensitivity of 88.06% and specificity of 96.16%, and accuracy of 89.18% for the testing dataset.

Scatterplot of CDRSB scores versus patient diagnosis in the training dataset. The black dashed lines represent the cutoff points that maximize separation between diagnosis classes.
Points given by the risk score model for each cognitive test and per diagnosis class on the training dataset

Composite risk score stratification. The scores ranged from 0 to 11, with 11 indicating the greatest risk for developing Alzheimer’s and 0 indicating the lowest.
DISCUSSION
This study developed a deep-learning algorithm to identify the top neurocognitive test scores that accurately classify normal control, early MCI, late MCI, and Alzheimer’s disease. Multiple feature selection methods identified essentially the same set of top variables, providing further corroboration. CDRSB was identified to be a top feature, followed by LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE for classification of disease subtypes. Performance indices of the deep-learning model and the simplified score system were highly accurate in classifying the four groups. The best model yielded an AUC of 0.984, and the simplified risk stratification score yielded an AUC of 0.962 for classification on the test dataset. We concluded that only a few neurocognitive tests are needed to accurately classify normal control, early MCI, late MCI and AD.
Clinical diagnosis of AD and MCI involves a large collection of clinical variables. They include two major neurocognitive tests: CDR and MMSE. MMSE and CDR are useful to distinguish between CN and AD but less so among EMCI, LMCI, and CN. We found that CDR was among the top performers to classify among CN, EMCI, LMCI, and AD, but MMSE was not.
The ADNI dataset consists of a large array of neurocognitive tests that are not currently being used in clinical settings but could have future applications. With the advances in computing, it becomes possible to use machine learning to analyze the large array of neurocognitive tests to accurately classify CN, EMCI, LMCI, and AD.
Although CDR and MMSE were used in the clinical diagnosis, there are other variables that were highly ranked, thereby providing insights into specific domain of cognitive dysfunction. It is possible that high performance was dominated by a few variables of the same cognitive test group. However, we used correlation-matrix analysis to remove variables that are highly correlated. Specifically, CDR and MMSE was found to be weakly correlated. This is not surprising as CDR and MMSE measure different dimensions of cognitive function. Our approach selected a small set of top predictors among many that are highly predictive of outcome.
The key findings are: 1) our NN model was able to diagnose 4 classes, which is not commonly done, 2) our NN model performance is comparable to literature, 3) the combined top neurocognitive scores performed better in distinguishing CN, EMCI, LMCI, and AD than individual scores. Taken together, our NN model and risk score can ultimately improve classification or diagnosis accuracy because it uses multiparametric data. ML can also incorporate longitudinal multiparametric data to predict disease progression.
The top 5 classifiers were all neurocognitive test scores. The CDRSB is rated along 6 domains of functioning, with each domain being rated on a 5-point scale, and the global CDRSB being a function of the scores from these 6 domains [29]. A higher CDRSB indicates more severe impairment. LDELTOTAL measures episodic memory and performance is measured primarily through the amount of a story that is remembered [30]. A lower LDELTOTAL score indicates more severe impairment. The mPACCdigit test measures working memory by asking the patient to repeat back a sequence of digits of increasing length, until they are not able to. The mPACCtrails B test determines performance of processing speed with a smaller score indicating more severe impairment. Lastly, the MMSE, one of the most clinically used battery tests, is a 30-question questionnaire that is used to screen for dementia and includes tasks that involve registration, recall, and attention. Individually, these tests all perform well in separating AD from CN individuals, but struggle to diagnose MCI subtypes, Indeed, we found classification using all top 5 variables and the derived risk score system outperformed classification using CDRSB, LDELTOTAL, mPACCdigit, mPACCtrailsB, and MMSE individually.
It is interesting to note that intracranial volume (ICV), volume of ventricles, and whole brain volume by MRI were not highly ranked as classifiers of the 4 classes. Volumetric differences could readily distinguish between CN and AD groups but might not readily differentiate between CN and MCI or between MCI subclasses [31]. Other studies investigating hippocampal volume for classifying dementia subtypes showed promise, but it is still challenging for hippocampal volume to accurately classify between CN and MCI patients [32]. We did not include hippocampal volume because it was not readily available in the dataset.
Many studies have previously examined neurocognitive tests and identified a few top neurocognitive classifiers using non-machine learning methods, but they are not discussed here [14–16, 18] (see reviews [15, 16]) By comparison, only few studies utilized (mostly supervised learning) machine learning methods and some of these studies combined neurocognitive tests and MRI regional brain volumes as input variables [9–13, 17] (Table 5). So et al. used a two-stage approach to classification with the first stage identifying the most important subsections of the MMSE and the second stage used subsections of the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) assessment [13]. They achieved up to 97% classification accuracy in Stage 1 with an MLP and 75% classification accuracy in Stage 2 with support vector machine. Lins et al. investigated a Brazilian dataset and utilized gender, age, study time (in years), AD8, MMSE, CDR, and SVFT scores, and two genetic markers (CYP46A1 and APOE4) [12]. They tested the predictive power using the Random Forest, support vector machine, and Stochastic Gradient Boosting classifiers along with an MLP neural network. They achieved a maximum binary classification accuracy between dementia and CN patients of 96% using the CDR and CYP46 features. Stamate et al. identified mPACCdigit, mPACCtrailsB, and LDELTOTAL as the top classifiers [17]. They combined these scores with PET and MRI data and achieved an AUC of 0.88 for the binary classification of NC versus dementia. Chiu et al. developed NMD-12, a 12-question questionnaire that was shortened from the original 45 question questionnaire from the HAICDDS project by the Information Gain algorithm [11]. They showed that this test performed better than the commonly used MMSE and MoCA tests with an AUC of 0.94 for discriminating between CN and MCI patients and 0.97 for MCI and dementia patients. Zhu et al. analyzed a Taiwan cohort and ranked the relative importance of neuropsychological tests using Information Gain, Random Forest, and the Relief algorithm [9]. They classified normal, MCI, very mild dementia, and dementia. They selected a few top ranked features, and their optimized algorithm had an accuracy of 0.81 using Relief feature selection followed by classification with MLP method. Gill et al. investigated an MRI-based feature and Modified Barthel Index Score (activities of daily living) for binary classification between CN and MCI [10]. They used supervised machine learning and found the AUC to be 0.86.
Comparison of machine learning studies in classifying different forms of dementia. VMD, very mild dementia; SVFT, Semantic Verbal Fluency Test; AD8, Dementia Screening Interview
In sum, our results are comparable or compared favorably with previous studies although comparisons were not made on the same datasets. Our study is novel in that we employed a deep learning method, applied to a large and multi-center ADNI data-set with commonly used measures. We also classified amongst four groups instead of commonly used binary classification in most previous studies (i.e., between normal controls versus AD, or normal controls versus MCI).
This study has several limitations. In this cohort white and non-Hispanic/Latino ethnicities are overrepresented. Some of the comorbidities, such as sleep apnea and depression, showed significant differences between the groups. We also did not include imaging variables. Although ADNI is a multi-center dataset, further testing of additional cohorts are needed for generalization. Additionally, further evaluation of independent datasets, including prospective studies, would improve generalizability of these findings.
An eventual goal of our and other similar appro-aches is to ultimately create an automated machine learning algorithm and a derived simplified risk score system to help physicians to make more streamlined and accurate diagnoses. Machine learning approaches can help physicians by offering an objective initial assessment and possibly a second opinion of the diagnosis. Moreover, in some other fields of medicine, machine learning can already accurately estimate risk for coronary heart disease [33] and the detection of lung nodules on chest X-rays [34]. In addition to approximating physician skills, machine learning can also detect novel relationships not readily apparent to human perception, especially in large, complex, and longitudinal datasets.
CONCLUSION
This study developed a deep-learning algorithm and a simplified risk score to identify the top neurocognitive test scores to classify normal control, early MCI, late MCI, and AD. We concluded that only a few neurocognitive tests are needed to accurately classify normal control, early MCI, late MCI, and AD. Accurate and early diagnosis may lead to better management of the diseases, including interventions that improve symptoms or slow the rate of decline of symptoms.
DISCLOSURE STATEMENT
Authors’ disclosures available online (https://www.j-alz.com/manuscript-disclosures/20-1438r2).
