Abstract
Background:
Subjective cognitive decline (SCD) may represent a preclinical stage of Alzheimer’s disease (AD). Predicting progression of SCD patients is of great importance in AD-related research but remains a challenge.
Objective:
To develop and implement an ensemble machine learning (ML) algorithm to identify SCD subjects at risk of conversion to mild cognitive impairment (MCI) or AD.
Methods:
Ninety-nine SCD patients were included. Thirty-two progressed to MCI/AD, while 67 remained stable. To minimize the effect of class imbalance, both classes were balanced, and sensitivity was taken as evaluation metric. Bagging and boosting ML models were developed by using socio-demographic and clinical information, Mini-Mental State Examination and Geriatric Depression Scale (GDS) scores (feature-set 1a); socio-demographic characteristics and neuropsychological tests scores (feature-set 1b) and regional magnetic resonance imaging grey matter volumes (feature-set 2). The most relevant variables were combined to find the best model.
Results:
Good prediction performances were obtained with feature-sets 1a and 2. The most relevant variables (variable importance exceeding 20%) were: Age, GDS, and grey matter volumes measured in four cortical regions of interests. Their combination provided the optimal classification performance (highest sensitivity and specificity) ensemble ML model, Extreme Gradient Boosting with over-sampling of the minority class, with performance metrics: sensitivity = 1.00, specificity = 0.92 and area-under-the-curve = 0.96. The median values based on fifty random train/test splits were sensitivity = 0.83 (interquartile range (IQR) = 0.17), specificity = 0.77 (IQR = 0.23) and area-under-the-curve = 0.75 (IQR = 0.11).
Conclusion:
A high-performance algorithm that could be translatable into practice was able to predict SCD conversion to MCI/AD by using only six predictive variables.
Keywords
INTRODUCTION
Alzheimer’s disease (AD) is the most common type of dementia and is considered one of the most devastating neurodegenerative disorders with great social, economic, and clinical impact. The natural history of AD is divided in three phases: the preclinical phase, where the pathogenic mechanisms of the disease have started, but no objective cognitive decline can be identified; the prodromal phase, where mild cognitive symptoms appear, and a clinical diagnosis of mild cognitive impairment (MCI) can be stablished; and the dementia phase, where cognitive decline interferes with daily living activities [1]. Importantly, some people in the preclinical phase of AD experience mild cognitive symptoms, mainly memory problems, with a normal performance on cognitive tests adjusted by age and education. Notably, AD has been found to change brain structure and function several years before the appearance of the first symptoms of cognitive decline [2, 3].
The concept of subjective cognitive decline (SCD) was first introduced in the early 1980 s to define an early stage of AD and was initially assessed using the Global Deterioration Scale. In recent years, SCD has received various labels, including subjective cognitive complaint, subjective memory complaint, subjective cognitive impairment, subjective memory impairment, subjective memory decline and others. In 2014, a consensus terminology and a conceptual framework for research on SCD related to AD was proposed by the SCD Initiative. This framework unified the multiple descriptors into a single term called SCD, which was defined as a subjectively experienced decline in cognitive capacities in the absence of objectively measurable neuropsychological deficits [4 –6].
In view of longitudinal studies, a multi-center longitudinal observational study (3-year follow-up) of the German Dementia Competence Network revealed that 22% SCD subjects progressed to MCI and 12% to AD dementia, while only 3.7% subjects developed a non-AD dementia [7]. Another prospective single-center clinical study found that of a total of 122 subjects with SCD, 39% converted to MCI and 10% to AD dementia at 48 months [8]. A recent 7-year follow-up study [9] showed that 26 out of 109 SCD subjects converted to MCI, 15 progressed to AD dementia, and 68 subjects remained stable.
Therefore, some subjects with SCD have been shown to remain stable with no further progression to MCI or AD dementia [10]. Different health problems other than neurodegenerative diseases can cause SCD and these do not lead to AD [11]. Beyond SCD and MCI, several attempts to identify subject’s characteristics that may improve the prediction of progression to AD dementia have been done. The early identification of brain dysfunctions in individuals likely to develop AD is among the greatest challenges of current research in the field of AD [12 –14]. The earlier a person receives a correct diagnosis; the sooner help can be provided. As such, there is lately a particular focus of researchers in developing approaches to predict risk of AD in potentially early stages of the disease. Investigations have regarded a vast variety of potential predictors, such as socio-demographic and clinical characteristics, cognitive performances on different domains, neuropsychiatric symptomatology, cardiovascular indexes, dietary and life habits, structural and functional neuroimaging investigations, and several fluid biomarkers [15 –21].
Structural MRI is considered an integral part in the clinical assessment of patients with suspected AD. MRI-based measures of atrophy in medial temporal lobe brain regions, such as the hippocampus and entorhinal cortex, are regarded as valid markers of neurodegeneration, and atrophy rates can be predictive of progression from MCI to AD [22]. In the early stages of the disease, the structural changes are more pronounced in the entorhinal cortex while as the disease progresses the hippocampus appears to provide higher discrimination accuracy of AD patients from both MCI subjects and normal controls [23]. Grey matter loss in other brain areas has also been reported, however the evidence is not so established (see for review [24]).
It is increasingly recognized that better predictive capability can be achieved by models that simultaneously exploit the information coming from several predictors, and machine learning (ML) can be used to create such models [25]. As a branch of artificial intelligence, ML refers to analytical algorithms that iteratively learn from data, identify patterns, and allow researchers to make inferences and find insights [26]. ML techniques can be used to integrate and interpret complex health data in scenarios where traditional statistical methods cannot perform [27]. Different combinations of the above-mentioned predictors have been applied in different published studies in the attempt to predict MCI conversion to AD dementia. The results achieved vary broadly among studies, ranging from some that achieved performances just above chance to a few showing high accuracy levels [28 –31].
Ensemble methods [32] are based on the hypothesis that combining multiple weak models together can often produce a much more accurate model. Most of the time, these weak models perform not so well by themselves either because they have a high bias or because they have too much variance. Then, the idea of ensemble methods is to try reducing bias and/or variance of such weak learners by combining several of them together to create an ensemble model that achieves better performance. We can mention three major kinds of ensemble learning methods: bagging, boosting, and stacking. Bagging stands for Bootstrap Aggregation and is a technique of sampling different sets of data from a given training set by using replacement (bootstrapping). The model is trained on all the different sets and aggregates the results. For aggregating the outputs of weak learners, bagging uses majority voting (most frequent prediction among all predictions) for classification and averaging (mean of all the predictions) for regression.
The idea of boosting is to train weak learners sequentially, each trying to correct its predecessor. The weak learners are sequentially corrected by their predecessors, and in the process, they are converted into strong learners. Finally, stacking consists on training a meta-model to output a prediction based on different weak models’ predictions.
To our knowledge, despite the huge research effort, no gold-standard ensemble ML algorithm is available to predict SCD progression and clinical translation is still lacking. Considering all the above-mentioned issues, a series of ensemble ML algorithms were developed and cross-validated within a sample of subjects with SCD whose diagnostic follow-up was from 2001 to 2019. Our specific aims were to find the most relevant variables from the best algorithms in terms of sensitivity and specificity and combine these variables to find the algorithm(s) with the best performance.
MATERIALS AND METHODS
Participants
This study was approved by the Ethics Research committee of the Clínica Universidad de Navarra (Spain) and all participants signed a written informed consent.
Data were collected from 309 subjects retrospectively selected and evaluated in Memory Clinic at Clínica Universidad de Navarra between 2001 and 2017. To be eligible for the study, subjects had to meet the following criteria: 1) clinical diagnosis of SCD according to Jessen criteria [4] and 2) a high-quality brain magnetic resonance imaging (MRI) study.
The initial assessment included a medical history review, an interview with a family member or friend, and a general and neurological examination. All participants underwent a laboratory test (full blood count, biochemistry, vitamin B12, serum folate, glucose, lipids, syphilis serology, and thyroid function), neuropsychological assessment and a brain MRI, which were reviewed in a multidisciplinary consensus meeting to determinate a clinical diagnosis. Patients with 1) MCI or dementia, 2) major neurological or systemic illness that could cause cognitive impairment, 3) current or past major psychiatric disease (for example, schizophrenia, major depression, or bipolar disorder), 4) history of alcohol or substance abuse, 5) significant MRI abnormalities (brain tumors, large cerebral infarct, or bleeding), or 6) past head trauma with loss of consciousness, were excluded.
Until December 2019, 99 out of 309 participants were followed longitudinally and completed two or more visits. In follow-up visits, patients were evaluated by a neurologist and a neuropsychological assessment was performed to stablish clinical progression: 32 participants progressed to MCI or AD dementia and were labelled as SCD converters (class 1) and 67 remained stable and were labelled as SCD non-converters (class 0). The diagnosis was established according to the recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease [33, 34].
Feature sets
This study was performed by using three feature sets to investigate their predictive value in the conversion from SCD to MCI/AD dementia: socio-demographic and clinical information, Mini-Mental State Examination (MMSE) and Geriatric Depression Scale (GDS) scores (feature set 1a), socio-demographic characteristics and a battery of neuropsychological tests scores (feature set 1b) and regional MRI grey matter volumes (feature set 2).
Feature set 1a contained 12 input variables (9 categorical and 3 numerical) and the binary categorical output variable (Conversion), the feature set 1b, 40 input variables (2 categorical and 38 numerical) and the binary categorical output variable (Conversion), and the feature set 2, 120 input numerical variables and the binary categorical output variable (Conversion). Input variables of the feature sets 1a, 1b and 2 can be consulted in Supplementary Tables 2–4, respectively.
Measures
Socio-demographic and clinical information included age at diagnosis, gender, education level, treatment with antidepressants and neuroleptics, presence of hypertension, diabetes mellitus, cerebrovascular disease, ischemic cardiopathy, dyslipidemia, and smoking habits.
As part of the standard clinical protocol, all subjects underwent a neuropsychological assessment to evaluate cognitive status at baseline with a comprehensive test battery that evaluated the following domains: Global cognitive function: MMSE [35]; Function on daily living activities: Interview for Deterioration in Daily Living [36]; Depression: GDS [37]; Episodic verbal memory (word list learning, recall and recognition) and episodic visual memory (figure recall): Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) [38] and Free and Cued Selective Reminding Test (FCSRT) [39]; Attention and executive function: Trail Making Test parts A and B [40], Calibrated Ideational Fluency Assessment (CIFA) [41], Stroop Color and Word Test [42], and Digit Span Forward and Backward test [43]; Language: CIFA and Boston Naming Test [44].
The MRI studies were performed using two different 1.5 T MRI scanners (Siemens Symphony and Aera). The MRI examination included a high resolution T1-weighted 3D MPRAGE sequence with inversion time: 1100 ms, repetition time: 1900 ms, and echo time: 4.0 ms. T1-weighted images had different resolutions, due to the various clinical protocols used throughout the 16 years period: 1.0×1.0×1.5 mm3/0.5×0.5×1.0 mm3.
T1 weighted images were processed using Statistical Parametric Mapping (SPM version 12, Welcome Trust Center for Neuroimaging, University College London, UK) executed in MATLAB (Mathworks, MA). Processing included segmentation to extract grey matter probability maps and normalization using DARTEL. Average grey matter volumes were computed from the normalized probability maps for the 120 regions of interest (ROIs) defined in the automated anatomical labelling atlas 2 (AAL2) (https://www.gin.cnrs.fr/en/tools/aal/).
Statistical analysis
In all analyses, p < 0.05 was taken to indicate statistical significance. Continuous variables and categorical variables are expressed as the median and absolute number, respectively. The differences in categorical variables among groups were tested by the χ2 test. The differences in continuous variables among two groups were tested by the unpaired Student’s t test when normally distributed and the Mann–Whitney test when non-normally distributed.
Data pre-processing
This study was conducted by using open-source RStudio 2022 version 4.2.2 (R-Cran project, https://cran.r-project.org/). After loading packages,.xlsx files were read and categorical variables were converted to factor variables.
We created a function to identify and remove variables with a predetermined high frequency (>70%) and a function to replace outliers that fell outside Q1 –(1.5×IQR) or Q3 + (1.5×IQR), where IQR is the inter quartile range, or distance between the first (Q1) and third quartile (Q3).
Missing values can be due to data that were not recorded or, as in clinical practice often happens, a doctor can choose to not perform a specific test when a patient is too tired or is achieving a low score in a related test.
It was a priori decided to remove variables with >30% of missing values. Only the Digits variable exceeded the cut-off and was not used in our analysis.
We built a recipe by adding several transformation steps. The first of them was the imputation of variables with missing values, with the mean for numerical variables and the mode (the most frequently occurring value) for categorical variables.
Sometimes it makes sense to recode a variable, for example educational level (Supplementary Figure 1), by combining values into a smaller number of categories when the original categories contain too few cases to be reliable. An alternative to combining categories may be to eliminate some from the analysis. This may be necessary when there are too few cases in some categories or when combining them does not make sense.
Educational level was grouped into two categories (“1–2” and “3–4”) based on the Spanish educational system which ranges from no education/primary school to university education. The first category included no education or primary school and pre-professional or pre-university education, and the second one, higher professional education or university education and PhD. We used dplyr::recode_factor() function to recode educational levels.
The skimr package provided key descriptive stats for each variable.
Standardization, one hot encoding, and multicollinearity
We added new transformation steps to our recipe. Numerical variables were centered (subtracted by mean) and scaled (divided by standard deviation), and dummy variables were created. We omitted one dummy variable of each categorical variable. That is because the last category is inherently indicated by having a zero on all other dummy variables. Including the last category just adds redundant information, resulting in multicollinearity.
To address multicollinearity, defined as the non-independence of input variables, first, pair-wise correlated continuous variables were filtered. A threshold-correlation coefficient of 0.5 was used for feature set 1b, and of 0.75 for feature sets 1a and 2.
We complemented this step by calculating the variance inflation factor (VIF). Values greater than or equal to 5, which indicates high correlation between input variables, were dropped. Because there are several packages with different implementations of the VIF function, we specified the VIF function from the car package [45].
In data pre-processing, 3 of a total of 11 predictors, 20 of a total of 40 predictors and 96 of a total of 120 predictors were removed from feature set 1a, 1b and 2, respectively.
Data splitting, training control, class imbalance, cross-validation, hyperparameters optimization and prediction on the test data
Prior to model creation, the data was split into training (80%) and testing (20%) datasets using caret::createDataPartition() function, which takes as input the Conversion variable and the percentage data that should go into training as the p argument. It returns the row numbers that should form the training dataset. Plus, we set list = FALSE, to prevent returning the result as a list (we wanted a matrix). The advantage of using caret::createDataPartition() is it preserves the proportion of the categories in Conversion variable.
We specified a ctrl object, which was created using the caret::trainControl() function, which is used to define the resampling scheme. In the caret::trainControl() function, we set method = “cv”, number = 10 for 10-fold cross-validation (bootstrap is the default resampling approach but cross-validation was used instead), ctrl$sampling = “down”, “up” or “smote” and, also two options that are required to use receiver operating characteristic (ROC) as the metric: classProbs = TRUE and summaryFunction = twoClassSummary. To build better classification models with higher sensitivity, we over-sampled (“up”) the minority class, under-sampled (“down”) the majority class and generated synthetic samples (“smote”, Synthetic Minority Over-sampling Technique) combining over-sampling with under-sampling.
Direct training on a small sample can result in overfitting with poor generalization to new unseen testing data. To solve this, algorithm training was done using K-fold cross-validation (10-fold cross-validation). In this technique, the original data is randomly divided into K subsets of equal size. Of the K samples, a single sample is used as validation data, and the remaining (K–1) samples are used as training data.
Imbalanced data corresponds to data where the number of instances of one class is significantly higher than the number of instances of any other class. This issue causes a poor performance in most of the ML algorithms since they are biased toward the majority class. We saw from the exploratory data analysis that 67.7% of subjects did not convert to MCI/AD dementia and 32.3% converted. Although this statistic is not surprising, it makes the data slightly imbalanced. To build better classification models with higher sensitivity, we over-sampled (“up”) the minority class, under-sampled (“down”) the majority class and generated synthetic samples (“smote”, Synthetic Minority Over-sampling Technique) combining over-sampling with under-sampling.
The ctrl object outputted from trainControl() function was provided as a parameter to caret::train() function. In the caret::train() function, we implemented two bagging ML methods: Random Forest (“rf”) and Treebag (“treebag”), and one boosting ML method: Extreme Gradient Boosting (“xgbTree”).
We found the best hyperparameters for our models by using the tuneGrid parameter of the caret::train() function. The hyperparameters or tuning parameters in Extreme Gradient Boosting were: Nrounds
It gives the maximum number of iterations. We set nrounds = 500. Eta
It controls how much information from a new tree will be used in the Boosting. This parameter must be bigger than 0 and limited to 1. Big values of eta result in a faster convergence and more overfitting problems. Small values may need too many trees to converge. We set eta = c (0.01, 0.05, 0.1). Max_depth
It controls the maximum depth of the trees. Overfitting can be avoided with a smaller depth of the tree. We set max_depth = c (2, 3, 4, 5). Gamma
It prevents overfitting. We set gamma = 0. Colsample_bytree
It corresponds to the fraction of variables to use. We set colsample_bytree = 1. Min_child_weight
We set min_child_weight = 1. Subsample
It controls the number of samples given to a tree. We set subsample = 1.
Random Forest model creates many decision trees and for each new tree, the algorithm randomly selects a subset of predictors which helps to de-correlate the trees. The number of randomly selected predictors, called mtry in caret, is a tuning parameter and affects the performance of the model.
For each different value of mtry, caret performs a separate cross-validation. Once caret determines the best value for the tuning parameter, it runs the model one more time using this value. All the observations and the best value for the tuning parameter are stored in finalModel. On the other hand, Treebag selects all the predictors and does not need any hyperparameter.
We used the caret::predict() function with the training model to predict values using the testing dataset, which was the 20% of the dataset.
Ranked correlations of class 1, variable importance, variable selection, and stacking
We used lares::corr_var() function to explain the relationships of the target variable (class 1) with the rest and caret::varImp() function to determine which were the most important variables for our algorithms. The return of varImp functions were passed to the caret::dotPlot() functions to generate visualizations.
The variables of the models with the best performance were ranked by sorting their importance in descending order and only the variables exceeding 20% were kept. We built new models based on all retained variables and selected those with the highest sensitivity and specificity. The variables of these new best models with near-zero importance were removed and the algorithms were trained again. Finally, we selected the best model with the least number of variables.
We tried improving our results by stacking some algorithms. Stacking method was divided into three parts. In the first one, the dataset was divided into training subsets and a testing subset. Secondly, the first-level learners (ML algorithms chosen for their performance) were trained and tested. The respective outputs generated were next used as variables to create a new dataset for training the meta-learner (or meta-classifier), and the class labels were the same as the original dataset. So, the third and final step of this method was when the meta-learner was trained, with the new dataset created. This method has some issues such as: 1) which algorithms should be used as first-level learners and 2) which algorithm should be used as meta-learner.
We used caretEnsemble::caretList() function to build a list of uncorrelated first-level learners with the same trControl and resampling method (down, up or smote), which later was passed to caretEnsemble::caretStack() function. We set “glm” algorithm as meta-learner and we used the predefined hyperparameters in 3.4.
Performance metrics: Confusion matrix, ROC, and area under the curve
The performance of the models was evaluated using the confusion matrix to derive sensitivity and specificity measures (for details see Supplementary Material Section 1).
ROC analysis was also used for evaluating and visualizing the performance of classifiers. We started the evaluation of classifiers creating a prediction object with the ROCR::prediction() function. Using this object and the ROCR::performance() function, we also created a performance object. Then, we calculated and plotted the ROC curve, and the Area Under the Curve (AUC) (more details can be found in Supplementary Material Section 2).
Random splits and optimal thresholds
Considering that each of the models had been selected based on a single test set, we decided to include this section of great value for future studies in which the sample size is small, and resampling is not enough.
With the algorithm found as optimal (“Extreme Gradient Boosting with over-sampling of the minority class”, see Table 6) we ran a performance analysis with 50 different seeds (i.e., 50 random train/test splits). In each iteration, the following steps were carried out: the training set was pre-processed; the algorithm was defined based on the training set; the testing set was conditioned using the learned during the training set pre-processing; and, finally the performance of the model was evaluated on the testing set. To evaluate performance, the AUC was calculated, as well as sensitivity and specificity at an optimal cutoff point defined by the Youden index.
Performance metrics of ensemble Machine Learning algorithms, trained with left hippocampus, vermis_7, right precentral gyrus, right cerebellum_3, age and GDS variables
Bold type indicates the algorithms with the best performance in each class.
RESULTS
Feature sets 1a and 2 allowed us to develop ML models that had the ability to accurately predict SCD conversion.
Detecting multicollinearity using pair-wise correlation coefficients and variance inflation factor
No high correlation between independent variables was found in the feature set 1a. By contrast, several pairs of independent variables with high positive/negative correlation were observed in the feature sets 1b and 2.
Although complete elimination of multicollinearity was not possible, the degree of multicollinearity was reduced by, first, removing highly correlated numerical independent variables and, second, reducing the VIFs of the remainder variables to acceptable levels less than 5 (Tables 1 2).
Variable names of feature set 1b with VIF values that fall below the threshold 5
Variable names of feature set 2 with VIF values that fall below the threshold 5
The variable names correspond to the labels in the ALL2 atlas.
Descriptive statistics
Descriptive statistics of non-correlated numerical variables are reported in the Supplementary Tables 5–7. Statistics of numerical variables are reported before the standardization was applied. Descriptive statistics of categorical variables are displayed in Supplementary Table 8. It is worth noting that participants with SCD progression were older than those without SCD progression (p < 0.05). Significant differences were also detected in education level and grey matter volume in the following regions: left hippocampus, right precentral gyrus, and right thalamus (p < 0.05).
Performance of classifiers
We analyzed the effects of the down/up/smote-sampling procedure over classifiers performance, comparing the results obtained from training with and without handling the data imbalance issue. Non-balanced algorithms (originals) trained with feature sets 1a, 1b and 2, achieved sensitivities that were clearly lower than specificities (for example, the specificity of the random forest original model was 1, however its sensitivity was 0.17). Detailed quantitative results are presented in Table 3.
Performance metrics of ensemble Machine Learning models, trained with feature sets 1a, 1b and 2
For each of the feature sets, we selected the classifiers with the best performances based on one train/test split. The classifiers chosen were Random Forest down and Xgbtree smote for feature set 1a and 2, respectively, since they were the ones with the best outcome in comparison to the others. The decision was taken observing the sensitivity and specificity values. The sensitivities and specificities results for each of the classifiers developed are reported in Table 3. For more detailed results of the 10-fold cross-validation see Supplementary Figures 2–4. All models showed low sensitivity with feature set 1b which emphasized its deficiency in the prediction. Figure 1 shows the ROC curves for the best model on each feature set.

Comparison of ROC curves for the best models on feature sets 1a (Random Forest down), 1b (Random Forest smote) and 2 (Xgbtree smote).
Importance of the variables
In the current study we decided to focus only on the predictors selected by the ensemble ML algorithms with the best performance. While batteries of neuropsychological tests (feature set 1b) were not particularly predictive, feature set 1a and feature set 2 were more relevant for the prediction of SCD conversion.
As we can observe in variable importance analysis of Figs. 2 3, top variables (>20% importance) [age, GDS, education and gender] provided by Random Forest down algorithm trained with feature set 1a, and [left hippocampus, vermis_7, right precentral gyrus and right cerebellum_3 volumes] provided by Xgbtree smote algorithm trained with feature set 2, were used as input variables to create a new feature set.

Random Forest down top variable importance of feature set 1a.

Xgbtree smote top variable importance of feature set 2.
It can be observed in Table 4 the highest pair sensitivity and specificity (1.00 and 0.92, respectively) of Random Forest up and Xgbtree up algorithms trained with the top variables left hippocampus, vermis_7, right precentral gyrus, right cerebellum_3, age, GDS, education, and gender.
Performance metrics of new ensemble Machine Learning algorithms trained with top variables left hippocampus, vermis_7, right precentral gyrus, right cerebellum_3, age, GDS, education, and gender
Bold type indicates the algorithms with the best performance.
Table 5 shows the performance metrics of Random Forest up and Xgbtree up algorithms trained after removing variables with near-zero importance.
Performance metrics of Random Forest up and Xgbtree up algorithms after removing variables with near-zero importance
Confusion matrix
From left hippocampus, vermis_7, right precentral gyrus, right cerebellum_3, age and GDS as inputs, the best classifier Xgbtree up generated the following confusion matrix: TP = 6, TN = 12, FP = 1 and FN = 0 (Fig. 4). This classifier was able to correctly predict all positive data instances (6 positives out of 6), and to identify most of negative data instances (12 negatives out of 13).

Confusion matrix of Xgbtree up algorithm, trained with left hippocampus, vermis_7, right precentral, right cerebellum_3, age and GDS variables.
As a last step, we plotted the ROC curve and calculated the AUC which had an average score of 0.96 (Fig. 5).

Receiving operating characteristic curve of Xgbtree up algorithm, trained with left hippocampus, vermis_7, right precentral gyrus, right cerebellum_3, age and GDS variables.
Stacking
It can be observed in Table 6 that all algorithms but Treebag up, achieved a sensitivity of 1.00, and specificities ranged from 0.54 to 0.92.
We had a score to beat, the Xgbtree up specificity of 0.92. We defined two levels, the first level, with the uncorrelated Random Forest up and Xgbtree up algorithms and the second level, with the meta-learner glm. Treebag down was not considered because of its correlation with Random Forest up.
With stacking, our meta-learner did not generalize better than the single algorithm Xgbtree up (sensitivity = 1, specificity = 0.92), that is, it did not make better predictions (sensitivity = 1, specificity = 0.85) on unseen data, which means sometimes it is hard to capture any patterns that the baseline algorithms couldn’t have already captured.
Random splits and optimal thresholds
The generalizability of the results obtained with the Xgbtree up algorithm was tested in an iterative procedure, starting with 50 random train/test splits. The obtained performance metrics were: sensitivity of 0.83 (IQR, 0.17), specificity of 0.77 (IQR, 0.23) and AUC of 0.75 (IQR, 0.11).
The results of each of the random splits can be found in Supplementary Table 9.
DISCUSSION
Given a group of subjects with SCD, which of them will develop MCI/AD dementia, and which will not? Answering this question was the aim of this study. This binary classification task has been efficiently addressed by supervised ML techniques such as Bagging, Boosting and Stacking.
It has been demonstrated that multicollinearity, a state where two or more independent variables are highly correlated, can lead to misleading results when we attempt to determine which inputs can be used to predict the dependent variable in a model [46]. Multicollinearity creates a problem because the inputs are not actually independent; therefore, they are competing in the variable importance. Since multicollinearity diagnostic measures have different detection criterion, there is the need to study multiple diagnostics. Widely used and the most suggested are: pair-wise correlation coefficient, R-squared value [47], VIF, tolerance limit [48], eigenvalues [49], condition number and condition index [50], Leamer’s method [51], Klein’s rule [52], the red indicator [53], and Theil’s measure [54]. There is no clear-cut criterion for evaluating multicollinearity in models. Threshold values of many diagnostic measures are subjective in nature as no standard values exist. Moreover, different multicollinearity detection methods are not comparable with each other. That is why, many analysts often rely on more than one multicollinearity diagnostic measures.
To evaluate binary classifications and their confusion matrices, researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in ML, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy (the ratio between the number of correctly classified samples and the overall number of samples) computed on confusion matrices has been (and still is) among the most popular adopted metric in binary classification tasks. However, when the dataset is imbalanced (the number of samples in one class is much larger than the number of samples in the other classes), accuracy cannot be considered a reliable measure anymore, because it provides an overoptimistic estimation of the classifier ability on the majority class. The skewed distribution makes many ML algorithms less effective in predicting minority class. Our non-balanced models predicted more negative than positive cases. This means that it is usually easier for most of the classifiers to correctly classify samples from the majority class. Most standard algorithms expect balanced class distribution, so when a non-balanced data set is used, these algorithms provide unfavorable results. Hence, we observed that adopting pre-processing strategies in datasets containing classes that are under-represented in comparison to others, may introduce important benefits for data analysis.
Our best Xgbtree up algorithm, based on a reduced set of six variables (age, GDS, right precentral gyrus, left hippocampus, right cerebellum_3 and vermis_7), demonstrated better or similar predictive performance, compared to other previously published algorithms [55]. The prediction results of this classifier are heavily dependent on the chosen values for the tuning parameters. Although we adopted a parameter optimization procedure based on grid search, a more exhaustive study on the evaluation of classifier’s performance upon parameters optimization, combined with the application of other optimization techniques, could lead to an even better performance. However, this analysis is out of the scope of our work.
It is also important to note that Extreme Gradient Boosting is functionally like Random Forest. This fact could support the similar results observed for these two methods. However, whereas Random Forest builds the trees in parallel and these trees “vote” simultaneously on the preferred class during prediction, Extreme Gradient Boosting creates a series of trees in which the prediction receives incremental improvement by each tree in the series [56].
As mentioned above, SCD is characterized by self-experienced cognitive decline that is not yet detectable by neuropsychological tests. The tests may not detect early changes in modulation of cognition and behavior, even though these may be noticed by patients and family members [57]. In our study, the battery of neuropsychological test scores were not good predictors of conversion to MCI/AD dementia in subjects with SCD. On the contrary, in a study by Fleisher et al. [58], neuropsychological tests provided better predictive results than imaging measures in subjects with MCI, and in another study by Clark et al. [59], models developed using only socio-demographic information, clinical information and neuropsychological test scores resulted in an AUC score of 0.87 and a balanced accuracy of 0.84, while including brain imaging did not improve this performance (AUC = 0.81, accuracy = 0.83).
Regarding the socio-demographic variables education and gender, they seem not to be particularly relevant in predicting the conversion to MCI/AD dementia as it has been demonstrated in variable importance analysis. Hu et al. [60] also concluded that education and gender, did not seem to significantly influence AD conversion. Age was the sole of socio-demographic characteristics showing predictive value. GDS score showed predictive value as well. SCD and depressive symptomatology (assessed with the GDS) commonly co-occur in older adults; however, it is unclear whether they are independent risk factors for dementia [61] and the role of depressive symptomatology in current diagnostic criteria of SCD is intensively discussed [62].
MRI-based results are consistent with published findings supporting the crucial role of the hippocampus in MCI/AD dementia prediction. Most of these studies have found a decrease in hippocampal volume [65 –68]. Cantero et al. [67] reported an association between lower volumes in specific hippocampal regions and cognitive vulnerability in SCD. As the hippocampus plays an important role in episodic memory [68], it is expected that in the preclinical AD phase, memory performance would be related to anatomical modifications associated with hippocampal volume loss, even if the measured memory changes remain within the range of normality. Another brain area with predictive value was the precentral gyrus. Ribeiro and Filho [69] have reviewed different studies reporting grey matter atrophy in precentral gyrus associated with AD dementia and MCI.
In addition, our results also showed the predictive value of grey matter volumes in two cerebellar ROIs. The question that arises concerning the cerebellum is whether cerebellar changes are relevant to predict SCD conversion to MCI/AD dementia or merely a consequence of other pathological events. It seems that as the disease progresses, total cerebellar grey matter volume declines, affecting the vermis and posterior lobe in the early stages of the disease [70]. Jacobs et al. and Guo et al. [71, 72] provided evidence that the cerebellum is linked to specific cognitive circuits which are vulnerable to AD. The insidious decline in monitoring, speed and consistency of information processing and cognitive performance in preclinical AD supports the hypothesis that there is a cerebellar role in the pathophysiology of the disorder. The increased activities in certain brain regions may suggest functional compensation in these areas at the early stage of AD. Regarding brain regions of SCD subjects, the fMRI study by Ying et al. [73] showed that, compared to the control group, there was a significant increase in the activity of right posterior lobe of the cerebellum.
Limitations
When the aim of a classifier is to understand the relationship between the independent variables and the dependent variable, the multicollinearity problem needs to be addressed to avoid that correlated variables compete in explaining the dependent variable. It is necessary to define the best threshold at which multicollinearity should be assessed. Although a threshold of 0.75 is the most common, also a more restrictive one of 0.5 can be applied.
Researchers use confusion matrices to evaluate binary classification problems; therefore, the availability of a unified statistical rate able to correctly represent the quality of a binary prediction is essential. Accuracy, although popular, can generate misleading results on imbalanced datasets because they tend to classify all instances into the majority class. To avoid this problem, some resampling strategy should be used.
Another well-known limitation in clinical research is the size of datasets. Among 99 subjects with a clinical diagnosis of SCD, 32 progressed to MCI/AD dementia (32/99 = 32.32%). A 95% confidence interval for this proportion based on a sample of size 99 ranged from 23.11% to 41.53%. Alternatively, we can express this interval by saying that the proportion was 32.32% with a margin of error of±9.21%. Correct handling of small datasets in ML is a challenge.
Conclusions
We used ensemble ML techniques to develop algorithms able to identify which variables are important to predict SCD conversion and we promisingly achieved high predictive performance, among the very best of the many algorithms available in literature.
Although regional MRI grey matter volumes are potentially important predictors of MCI/AD dementia, this study shows that predicting progression using only top MRI variables is not the best solution. Our results add value, extending published findings to other brain regions not limited to hippocampus, and more importantly, to the very early SCD stage of the disease. Combining the socio-demographic variable age and the clinical variable GDS with the MRI variables right precentral gyrus, left hippocampus, right cerebellum_3 and vermis_7 grey matter volumes, led to an increase in the performance, although the degree of such increase in performance was dependent on the type of classifier. Our best ML model achieved a classification sensitivity of 1.00, a specificity of 0.92, and an AUC of 0.96 based on one train/test split, or a sensitivity of 0.83 (IQR, 0.17), a specificity of 0.77 (IQR, 0.23) and an AUC of 0.75 (IQR, 0.11) based on fifty random train/test splits.
Since this study showed promising results to predict the conversion from SCD to MCI/AD dementia, future research should focus on using larger samples to reach the level of reliability necessary for its applicability.
Footnotes
ACKNOWLEDGMENTS
The authors thank Miguel Fernández for his assistance with image processing.
FUNDING
The authors have no funding to report.
CONFLICT OF INTEREST
The authors have no conflict of interest to report.
DATA AVAILABILITY
The code used to generate and test the ML models is available in GitHub. The code:
Feature set_1a_complete.Rmd
Feature set_1b_complete.Rmd
Feature set_2_complete.Rmd Project name: Machine-Learning-in-Alzheimer-s-Research Project home page: https://github.com/projectsmartamdn/Machine-Learning-in-Alzheimer-s-Research
Archived version:
Feature set_1a_complete.Rmd
Feature set_1b_complete.Rmd
Feature set_2_complete.Rmd Operating system(s): macOS Programming language: R Other requirements: Excel 365 License: RStudio (version 4.2.2), Microsoft Office 365
