Abstract
Background:
Amyloid-β (Aβ) evaluation in amnestic mild cognitive impairment (aMCI) patients is important for predicting conversion to Alzheimer’s disease. However, Aβ evaluation through Aβ positron emission tomography (PET) is limited due to high cost and safety issues.
Objective:
We therefore aimed to develop and validate prediction models of Aβ positivity for aMCI using optimal interpretable machine learning (ML) approaches utilizing multimodal markers.
Methods:
We recruited 529 aMCI patients from multiple centers who underwent Aβ PET. We trained ML algorithms using a training cohort (324 aMCI from Samsung medical center) with two-phase modelling: model 1 included age, gender, education, diabetes, hypertension, apolipoprotein E genotype, and neuropsychological test scores; model 2 included the same variables as model 1 with additional MRI features. We used four-fold cross-validation during the modelling and evaluated the models on an external validation cohort (187 aMCI from the other centers).
Results:
Model 1 showed good accuracy (area under the receiver operating characteristic curve [AUROC] 0.837) in cross-validation, and fair accuracy (AUROC 0.765) in external validation. Model 2 led to improvement in the prediction performance with good accuracy (AUROC 0.892) in cross validation compared to model 1. Apolipoprotein E genotype, delayed recall task scores, and interaction between cortical thickness in the temporal region and hippocampal volume were the most important predictors of Aβ positivity.
Conclusion:
Our results suggest that ML models are effective in predicting Aβ positivity at the individual level and could help the biomarker-guided diagnosis of prodromal AD.
Keywords
INTRODUCTION
Amnestic mild cognitive impairment (aMCI) has been considered to be a preceding phase of Alzhei-mer’s disease (AD) dementia. In fact, patients with aMCI progress to AD dementia at a rate of 10–15% per year, averaging around 60% after 5 years [1]. However, there is a substantial variation in progression among individuals with aMCI. The deposition of amyloid-β (Aβ), which is a well-recognized pathological hallmark of AD, is a strong predictor of conversion to AD in patients with aMCI. In fact, only 40–60% of the patients with aMCI have been shown to be positive [2, 3], and amyloid-positive (Aβ (+)) patients with aMCI have a 4–9 fold higher conversion rate to AD as compared to their amyloid-negative (Aβ (-)) counterparts [4–6]. Therefore, the evaluation of Aβ in patients with aMCI is important to predict the conversion to AD among patients with aMCI.
Common methods to evaluate the deposition of Aβ are Aβ positron emission tomography (PET) and Aβ levels in the cerebrospinal fluid (CSF) [7, 8]. Aβ PET is a non-invasive procedure that is more widely available and has a higher reliability with repeated assessment as well as between centers, when compared to the CSF measurement of Aβ. Despite these advantages, Aβ PET has limitations, which include its relatively low availability due to high costs and safety issues including radiation exposure. In particular, primary physicians have not been able to use Aβ PET neuroimaging to determine the Aβ positivity in patients with aMCI in the clinical setting. Considering the limitations of Aβ PET, there would be a significant benefit to the creation of a predictive model for Aβ PET positivity in patients with aMCI, which would allow clinicians to predict those at high-risk of Aβ deposition. Further, the accessibility to examinations and cost-to-expectation result are also considered [9]. Therefore, it is necessary to make models account for cost-effectiveness: such models could include a low-cost model using clinical features, and a high-cost model using medical imaging data such as the magnetic resonance imaging (MRI) scans of patients.
Advances in technology have allowed machine learning (ML) to develop alternative prediction models. Supervised learning methods are able to perform high-dimensional data analysis and their complex non-linear interactions, which are not addressed by traditional prediction models [10–13]. However, despite ML showing more accurate predictions, the use of ML is still limited in the clinical field due to some disadvantages. In AD research, more traditional methods such as logistic regression are typically used [14–16] to predict Aβ PET positivity rather than the recent ML-based prediction models. Regarding the prediction of Aβ PET positivity, there has also been no study comparing the accuracy of ML methods relative to the traditional methods, although comparison across the ML models has been performed [17]. Therefore, a comparative study is needed to evaluate the utility of the ML methods compared to the existing methods [18]. Furthermore, since it is difficult to understand and interpret the process of ML, interpretable ML methods are also needed. Finally, ML can result in overfitting of the data and the results can be unstable, so performance must be proven through external validation [19, 20]. However, performing external validation across a health care system would be challenging.
The aim of our study was therefore to develop and validate ML approaches to predict Aβ positivity in patients with aMCI. We compared performance across traditional prediction models, such as logistic regression, and machine learning methods. Furthermore, we evaluated whether adding brain imaging information improved prediction performance. By selecting the model with the highest performance, we inserted as many variables as possible for the interpretable ML model, in order to check which variables were important. We hypothesized that various clinical, neuropsychological, and neuroimaging features of aMCI would be associated with Aβ positivity, and that a combination of these features could make it feasible to accurately predict Aβ positivity at the individual level.
MATERIALS AND METHODS
Participants
We recruited 529 patients with aMCI who under-went 18F-florbetaben PET (N = 389) or 18F-fluteme-tamol PET (N = 140) between February 2015 and June 2019. These patients were included from two different cohorts for training and external validation: 324 patients with aMCI from Samsung Medical Center (SMC) for the training cohort, and 187 patients with aMCI from multiple centers (48 patients with aMCI from Kyung Hee University Hospital, and 139 patients with aMCI from Dong-A University Medical Center) for the external validation cohort. All participants with aMCI met the following criteria [1, 21]: 1) subjective memory complaints by the participants or caregiver; 2) objective memory impairment below -1.0 SD on verbal or visual memory tests; 3) no significant impairment in activities of daily living; 4) non-demented. All participants underwent a comprehensive evaluation for dementia, including a clinical interview, neurological examination, standardized neuropsychological tests, blood tests including apolipoprotein E (APOE) genotyping, and brain MRI. We excluded patients who had secondary causes of cognitive deficits confirmed with laboratory tests, such as vitamin B12, syphilis serology, and thyroid/renal/hepatic function tests, as well as those with structural lesions on conventional brain MRI such as territorial infarction, intracranial hemorrhage, brain tumor, hydrocephalus, or severe white matter hyperintensities, according to the modified Fazekas ischemic scale [22]. Patients with other types of degenerative disease such as progressive supranuclear palsy, cortico-basal syndrome, frontotemporal dementia, or Lewy body/Parkinson’s disease dementias were also excluded.
The institutional review boards at SMC approved this study. Written informed consent was obtained from the patients and caregivers.
Neuropsychological tests
All participants underwent the Seoul Neuropsychological Screening Battery 2nd edition (SNSB-II) [23, 24], but a small number of participants could not complete all of the tests. We used the tests that provided numeric scores, including the Digit Span Tests (forward and backward), the Korean version of the Boston Naming Test (K-BNT), the Rey-Osterrieth Complex Figure Tests (RCFT; copying, immediate recall, delayed recall, and recognition), the Seoul Verbal Learning Tests (SVLT; immediate recall, delayed recall, and recognition), the semantic Controlled Oral Word Association Tests (COWAT), and the Stroop Test (color reading). The results with numerical continuous values were used in the analysis.
The Digit Span Tests (forward and backward) were used to evaluate the patient’s attention and working memory, and the K-BNT score is associated with naming ability. The SVLT (immediate recall, delayed recall, and recognition) and RCFT (immediate recall, delayed recall, and recognition) scores are related to verbal memory and visual memory, respectively. The RCFT copying test was used for the evaluation of visuospatial function, and the semantic COWAT and Stroop tests were used to evaluate frontal executive function.
Acquisition of Aβ PET and determination of Aβ PET positivity
All participants underwent Aβ PET (228 18F-florbetaben [FBB] PET and 114 18F-flutemetamol [FMM] PET) scans at SMC using a Discovery STe PET/CT scanner (GE Medical Systems, Milwaukee, WI, USA); 113 18F-florbetaben PET, and 26 18F-flutemetamol PET scans at Dong-A University Medical Center using a Biograph mCT Flow scanner (Siemens Medical Solution USA, Knoxville, TN, USA); and 48 18F-florbetaben PET scans at Kyung Hee University Hospital using a Gemini TF16 scanner (Philips Healthcare, Cleveland, OH, USA). For 18F-florbetaben PET and 18F-flutemetamol PET, a 20 min emission PET scan in dynamic mode (consisting of 4×5 min frames) was performed 90 min after an injection of a mean dose of 311.5 MBq 18F-florbetaben and 197.7 MBq 18F-flutemetamol, respectively. Three-dimensional PET images were reconstructed in a 128×128×48 matrix with 2 mm×2 mm×3.27 mm voxel size using the ordered-subsets expectation maximization (OSEM) algorithm (18F-florbetaben, iteration = 4 and subset = 20; 18F-flutemetamol, iteration = 4 and subset = 20). The median time interval between the neuropsychological tests and Aβ PET was two months (interquartile range, 5.5 months).
Aβ PET images were rated by two experienced doctors (one nuclear medicine physician and one neurologist) who were blinded to clinical information, and the images were dichotomized as either Aβ-positive or -negative, using visual reads. They discussed the discordant results regarding the Aβ positivity in order to achieve consensus. 18F-florbetaben PET was classified as positive if the Aβ plaque load on the florbetaben PET scan was visually rated as 2 or 3 on the brain amyloid plaque load scoring system, and 18F-flutemetamol PET was considered positive when one of five brain regions (frontal, parietal, posterior cingulate and precuneus, striatum, and lateral temporal lobes) systematically reviewed using the flutemetamol PET was positive in either hemisphere [16].
Acquisition of three-dimensional MRI images
We acquired three-dimensional T1-weighted turbo field echo MRI scans from the training cohort, including 342 participants with aMCI at the SMC, using the 3.0T MRI scanner (Philips 3.0T Achieva) with the following imaging parameters: sagittal slice thickness, 1.0 mm with 50% overlap; no gap; repetition time = 9.9 ms; echo time = 4.6 ms; flip angle = 8° and matrix size = 240 pixels×240 pixels reconstructed to 480 pixels×480 pixels over a field view of 240 mm.
MRI data processing for cortical thickness measurements
Images were processed using the CIVET anatomical pipeline (version 2.1.0) [25]. The native MRI images were registered to the Montreal Neurological Institute-152 template using a linear transformation [26] and corrected for intensity non-uniformities using the N3 algorithm [27]. The registered and corrected images were divided into white matter, grey matter, CSF, and background. The Constrained Laplacian-based Automated Segmentation with Proximities algorithm [28] extracted the surfaces of the inner and the outer cortices automatically. The inner and outer surfaces had the same numbers of vertices, and there was a close relationship between the counterpart vertices of the inner and outer cortical surfaces. Cortical thickness, which was defined as the Euclidean distance between the linked vertices of the inner and outer surfaces [29], was not calculated in Talairach space but instead in native brain space, due to the limits of linear stereotaxic normalization. As expected, there was a significant positive correlation between cortical thickness and intracranial volume (ICV) in native space [30]. Controlling for ICV, which reflected brain size effect, was necessary to compare cortical thickness among participants. In a previous study [30], our group proposed that the measurement of native space cortical thickness, followed by analyses that include brain size as a covariate, is an efficient method and explained the relationship between cortical thickness and brain size in detail. ICV is defined as the total volume of grey matter, white matter, and CSF, and is calculated by measuring the total volumes of the voxels within the brain mask, which were made by the Functional Magnetic Resonance Imaging of the Brain Software Library Brain Extraction Tool algorithm [31]. After extracting cortical surface models from the MRI volumes transformed into stereotaxic space, cortical thickness was measured in native space by applying an inverse transformation matrix to the cortical surfaces and reconstructing them in native space [32].
Machine learning modelling approaches
To identify patients with Aβ positivity, logistic regression and four supervised learning algorithms were used in two-phase modelling: the first was model 1, which incorporated the following variables: age, education, gender, diabetes, hypertension, APOE status, and neuropsychological test scores. Variables of model 2 combined the variables of model 1 with additional MRI features. We used hippocampal volume and cortical thickness in each lobe (cingulate, frontal, temporal, parietal, and occipital) as well as the interaction between hippocampal volume and cortical thickness as additional MRI features. Further, to investigate whether the better performance in model 2 was due to addition of MRI to model 1 or it was almost entirely related to MRI feature, we also checked MRI model, which included age, education, gender, diabetes, hypertension, APOE status, and MRI features. In addition to the logistic regression, we developed and validated the following four ML approaches: extreme gradient boosting (XGBoost), random forest (RF), support vector machines (SVM), and gradient boosting machine (GBM). Previous large-scale studies have consistently suggested these models to be robust ML algorithms [13, 34]. RF is the supervised tree ensemble model, which creates multiple decision trees using bootstrap samples and aggregates their decisions by averaging or majority voting [35, 36]. GBM is another tree-based ensemble approach, deriving strong predictions based on weak learners [37]. GBM generates accurate classifiers using a linear combination of base classifiers, which are iteratively adjusted by their weight. SVM provides a binary prediction based on the hyperplane with the maximum margin [38]. XGBoost is designed to enhance the speed of the computation of the tree boosting by automatically using parallel processing [39]. XGBoost delivers high performance and improves model generalization using advanced L1 and L2 regularization. Further information regarding the usefulness and effectiveness of these algorithms are provided in the Supplementary Method 1. The rate of missingness in each variable was observed to see whether large portion of data observations is missing. In addition, we added dichotomous variables as an indicator of missingness, because some ML algorithms detected patterns from the missingness without imputation. Since no missing patterns were observed, we have conducted imputation process with very few missing observations. Missing observations for each variable in the training set were assigned a value obtained using K-nearest neighbor (KNN) [40]. KNN is one of the simplest nonparametric classifies by finding k number of nearest neighbors from the training dataset. The nearest neighbor is estimated by Euclidean distance and the number of k is a user-defined positive integer, typically a small constant. The highly correlated feature and near-zero variance predictors were not observed in the model. In the classification the imbalanced data between the cases and controls induced bias prediction toward the majority class and thus the performance for the model often could be misleading. Moreover, the minority class often treated as outliers would result in decreased accurate prediction. Therefore, we have identified class imbalance in the training cohort. If the data set is imbalanced, we can manage data set imbalance by adjusting the decision threshold or rebalanced sampling approaches, such as oversampling, undersampling, and hybrid sampling. We used four-fold cross validation across 30 repetitions in the training cohort. For a fair comparison, the same cross validation (CV) data partitions were used across all ML models and the performance was estimated by arithmetic means of the outcome. Previous studies have shown that four-folds is the optimal K number in K-fold CV [41]. To determine the optimal model parameter value that led to the best performance, we used hyperband as a parameter search criterion [42], which uses holdout sampling randomly and searches for the optimal parameters using a reinforcement learning strategy. The optimized model was then trained on the test data set, which was an external validation cohort for model 1 and outer-loop test set from nested cross validation for model 2, respectively. The whole process was repeated 30 times, based on the previous guidelines, and then the performances are averaged.
Statistical analysis
Model performance evaluation of all ML approaches
Discrimination performance was evaluated using the Brier score, which is the mean of the squared errors or deviances, on the training, validation, and test sets. The prediction performance of the validation and test sets were reported (Table 2) [43]. The Brier score (mean square error of the estimator of a procedure for estimating an unobserved quantity) measured the average of the square of the error, that is, the average squared difference between the estimated values and the actual values [44]. A smaller Brier score indicates a better prediction [45]. The efficiency of these models was also evaluated by other performance metrics, including area under the receiver operating characteristic curve (AUROC), overall accuracy (ACC), sensitivity, specificity, positive predictive value (PPV or precision), negative predictive value (NPV or recall), F1, balanced accuracy (BA), logarithmic loss, and Kappa [9, 46]. (Supplementary Table 2). We compared AUROC across the ML approaches using the DeLong Test [18]. The ROC curve in Fig. 3 is a graphical plot which represents a trade-off between sensitivity and specificity for every possible cut off value. Patients with Aβ (+) aMCI were regarded as people in the positive class, and those with Aβ (–) aMCI were regarded as the negative class. Precision-recall (PR) curves were estimated to compare the difference between model 1 and model 2 (Supplementary Figures 1 and 2). Precision recall curve indicates the average precision over various recall levels and provide a more robust measure than AUROC when the data is highly imbalanced. Additional machine learning performance measures from the confusion matrix were calculated based on the fixed decision threshold value of 0.5. (Supplementary Table 3). The prediction probability plots were created by plotted observed predictive probability over the range of 0 to 1 in each class (Supplementary Figures 3 and 4). The predictive probability plot demonstrates that the model performed well with high predictive power to each class. For further details of the definition of other performance metrics, see Supplementary Method 1.

Schematic representation of internal-external cross-validation and external validation. Model 1 included clinical demographics, APOE genotype, and neuropsychological test features. Model 2 included features from model 1 and additional brain MRI features.

Receiver Operating Characteristic (ROC) curve of four machine learning approaches using model 1 and model 2. Model 1 included clinical demographics, APOE genotype, and neuropsychological test features. Model 2 included features from model 1 and additional brain MRI features. XGBoost, extreme gradient boosting; RF, random forest; SVM, support vector machine; GBM, gradient boosting machine; logistic, logistic regression.

Most influential input features for GBM across Model 1 and model 2. Model 1 included clinical demographics, APOE genotype, and neuropsychological test features. Model 2 included features from model 1 and additional brain MRI features. A) Top 15 influential features from GBM using baseline predictors; B) Top 15 influential features from GBM using baseline and MRI predictors. SVLT delay, delayed recall task of the Seoul Verbal Learning Test; RCFT delay, delayed recall task of the Rey-Osterrieth Complex Figure Test; KBNT, Korean version of the Boston Naming Test; RCFT immed, immediate recall task of the Rey-Osterrieth Complex Figure Test; RCFT copy, copy task of Rey-Osterrieth Complex Figure Test; COWAT a, Controlled Oral Word Association Test of animals; RCFT recog, recognition task of the Rey-Osterrieth Complex Figure Test; SVLT immed, immediate recall task of the Seoul Verbal Learning Test; CDR SOB, Clinical Dementia Rating sum of boxes; SVLT recog, recognition task of the Seoul Verbal Learning Test; Temporal*HV, interaction between the temporal cortical thickness and hippocampal volume; Parietal*HV, interaction between the parietal cortical thickness and hippocampal volume; HV, hippocampal volume; GBM, gradient boosting machine.
Performance evaluation of the selected ML approach
The ML algorithm with the highest model performance was chosen as an optimal model, and additional performance measures were evaluated. To determine whether the variable had a positive or negative effect on the prediction, we also evaluated partial dependence plots (PDP). For each model, variable importance was estimated to establish which independent variables were influential features for an accurate classification [47], using the mean decreased accuracy (MDA) or the Gini index [35]. Influential variables were ranked by calculating relative importance values, which are calculated by the discrepancy of the error when the variable split the subset space. We report the 15 most influential features in Fig. 3. In the tree-based models such as GBM and RF, when the variables split the tree, the relative importance value of that variable was estimated by the discrepancy of the squared error loss over all trees. A higher relative importance value indicated a greater influence of the variables for classifying Aβ positivity. We conducted a PDP, which is a graphical representation tool, proposed by Friedman [37]. PDP can provide information on whether the feature is positively or negatively associated to the final prediction. In order to avoid over-weighted or underweighted results, a Min-Max normalization [48] was conducted during the PDP process. Details related to the PDP are given in Supplementary Method 2.
To compare the demographic and clinical characteristics, a two-tailed t-test was used for continuous features, and a chi-square test was used for categorical features. The level of statistical significance was set as two-tailed p value <0.05. All statistical analyses were conducted in R software, version 3.6.3 (R Project for Statistical Computing) [49].
RESULTS
Study cohort characteristics
The clinical, neuropsychological, and neuroimaging characteristics of the study participants (172 Aβ (–) and 170 Aβ (+) aMCI) in the training cohort are shown in Table 1. There were no differences in the mean age (p = 0.536), proportion of females (p = 0.908), years of education (p = 0.970), and presence of hypertension (p = 0.103) between the Aβ (–) and Aβ (+) aMCI patients. Patients with Aβ (+) aMCI had a higher frequency of APOE4 genotypes (p < 0.001) and mean COR-SOB scores (p < 0.001) than patients with Aβ (–) aMCI.
Demographics and clinical characteristics of the study participants
Values are presented as mean±standard deviation or number (percentage). Aβ (–), amyloid negative; Aβ (+), amyloid positive; N, number of patients whose data were available for analysis; DM, diabetes mellitus; CDR-SOB, Clinical Dementia Rating Sum of Boxes; DSF, Digit Span Test Forward; NP test, neuropsychological tests; DSB, Digit Span Test Backward; K-BNT, Korean version of the Boston Naming Test; RCFT, Rey-Osterrieth Complex Figure Test; SVLT, Seoul Verbal Learning Test; IM, immediate recall; DR, delayed recall; RE, recognition; COWAT, Controlled Oral Word Association Test; CR, color reading; HV, hippocampal volume.*Significant difference at p < 0.05 between Aβ (–) and Aβ (+) in the training set or external validation set †Significant difference at p < 0.05 between training set and external validation set.
The clinical and neuropsychological characteristics of the study participants (65 Aβ (–) and 122 Aβ (+) aMCI) in the validation cohort are also shown in Table 1. When comparing the characteristics between study participants in the training cohort and external validation cohorts, patients in the training cohort were of an older age (p < 0.001) and had more years of education (p < 0.001) than patients in the external validation cohort.
In our data set, the imbalanced data between the two classes were not observed: the positive class was 49.7% and the negative class was 50.3% in the training cohort. Thus, the rebalanced sampling approaches or adjusted decision threshold method were not applied to the training data set. In the training data set, a total of 4 missing observations were presented in the neuropsychological test score variable. The rate of missing data within observations were 1.2%. In the test data set, total of one missing observation was detected and the rate of missing data within observations was less than 1%. In addition, specific pattern of missingness was not observed.
Model performance and comparisons of modelling approaches
Prediction performance metrics and comparison between the logistic regression and other ML approaches are represented in Table 2. With model 1, GBM (Brier, 0.179; AUROC, 0.837; 95% confidence interval [CI], 0.794–0.881) and XGBoost (Brier, 0.187; AUROC, 0.822; 95% CI, 0.777–0.866) outperformed RF (Brier, 0.205; AUROC, 0.761; 95% CI, 0.708–0.809) and SVM (Brier, 0.208; AUROC, 0.784; 95% CI, 0.701–0.805). The use of ML with same validation data set resulted in an increase in the AUROC as MRI variables were added. Comparison of performance metrics between model 1 and model 2, in which MRI variables were added, are shown in Supplementary Table 4. In the model 2, GBM (Brier, 0.138; AUROC, 0.892; 95% CI, 0.858–0.926) and SVM (Brier, 0.149; AUROC, 0.877–0.841) outperformed XGBoost (Brier, 0.157; AUROC, 0.856; 95% CI, 0.816–0.896) and RF (Brier, 0.166; AUROC, 0.838; 95% CI, 0.796–0.879). Among the five models examined, logistic regression consistently achieved the lowest performance. The AUROC of logistic regression was 0.751 (95% CI, 0.684–0.791) with model 1 variables and 0.784 (95% CI, 0.741–0.827) with model 2 variables. A comparison of AUROC between logistic regression and other ML approaches by Delong’s test shows improved performance across all ML approaches. The outcome of the precision-recall plots (Supplementary Figures 3 and 4) indicated that GBM performed well across the two models. In terms of MRI model, prediction performance metrics and comparison between the logistic regression and other ML approaches are represented in Supplementary Table 5.
Comparison of performance metrics across the machine learning models
Model 1 included clinical demographics, APOE genotype, and neuropsychological test features. Model 2 included features from model 1 and additional brain MRI features. Logistic, logistic regression; XGBoost, extreme gradient boosting; RF, random forest; SVM, support vector machine; GBM, gradient boosting machine. ap values for the DeLong test comparing area under the receiver operating characteristic curves for different models with logistic regression. bBest performance with respect to the metric (lowest Brier score or highest AUROC [C statistic]). * Significant improvement at p < 0.05 compared to model 1 by DeLong test.
A discrepancy in model performance was noted for all models using the external validation cohort or nested cross-validation strategies. When using model 1 in the external validation cohort, the performance of GBM was better than other approaches in terms of AUROC (0.765, 95% CI, 0.690–0.840). XGBoost (AUROC, 0.815; 95% CI, 0.759 –0.852), and SVM (AUROC, 0.776; 95% CI, 0.713–0.814) similarly outperformed when using the model 2 variables. The logistic regression had the lowest AUROC of 0.672 (95% CI, 0.593–0.751) with model 1 and 0.736 (95% CI, 0.663–0.808) with model 2. Based on the findings of the analysis, GBM was selected as the optimal model of all the 5 ML approaches.
Further evaluation of GBM
Figure 3 shows the 15 most influential variables identified by the GBM model, and APOE was selected as the most important feature from both model 1 and model 2. In model 1, scores in the delayed recall tasks of SVLT and RCFT were identified as influential features, followed by K-BNT and Stroop test scores. In model 2, the interaction of cortical thickness of the temporal lobe and hippocampal volume was identified as the most important MRI feature, followed by the interaction of cortical thickness of the parietal lobe and hippocampal volume. Scores in the delayed recall tasks of SVLT and RCFT, and the immediate recall task of RCFT, were also selected as the important features for discriminating amyloid-positive from amyloid-negative patients. Figure 4 represents the PDP of the top three most influential features, which were estimated using the variable importance process of GBM across models 1 and 2. As expected, APOE4 carriers were shown to have a higher risk of amyloid positivity. SVLT delay and RCFT delayed recall scores were negatively related to the accumulation of amyloid. Interaction between the temporal lobe and hippocampal volume was also negatively associated with the accumulation of amyloid.

Partial dependence plots (PDPs) for GBM across Model 1 and model 2. A) PDP of the top three influential features of the GBM using baseline predictors; B) PDP of the top three influential features of the GBM using baseline and MRI predictors. SVLT delay, delayed recall task of Seoul verbal learning test; RCFT delay, delayed recall task of Rey-Osterrieth complex figure test; Temporal*HV, interaction between temporal cortical thickness and hippocampal volume; GBM, gradient boosting machine.
DISCUSSION
In this study, we developed two versions of ML-based prediction models for Aβ positivity using clinical demographics, APOE genotype, neuropsychological tests, and MRI features in participants with aMCI. The major findings of the present study were as follows. First, model 1, which included demographics, APOE genotype, and neuropsychological tests as the predictors, showed good accuracy in the prediction of Aβ positivity (AUROC 0.837) and was validated using an independent population with a fair accuracy (AUROC 0.765). Second, the addition of MRI features to model 1 led an improvement in the prediction performance. Consequently, model 2 showed good accuracy (AUROC 0.892). Finally, APOE genotype, scores of delayed recall tasks, and interaction between cortical thickness in the temporal region and hippocampal volume were selected as the most important features among the demographics, neuropsychological tests, and MRI findings, respectively. Taken together, our results suggest that these prediction models are effective in predicting Aβ positivity at the individual level.
We selected four ML approaches to develop our prediction models. First, previous studies have shown that GBM and RF consistently outperformed in many large-scale studies [13, 34]. Second, generalizability could be improved by comparing the results, because GBM and RF have a complementary learning algorithm. GBM performed well even though the dataset is imbalanced, but the noisy data set is may not be addressed by generating GBM approach. However, RF performed better to handle the outliers. Third, we have compared the interpretable predictions for the reliability issues. Tree-based models estimate the interpretable measures such as variance importance or PDP. Variable importance provides the information of importance ranking of each feature for the classification. PDP can estimate whether the variables had association to the prediction positively or negatively using a marginal distribution [50]. Therefore, we can compare the interpretable results. Lastly, advanced ML approaches has been developed with the exponential growth of computing system and cloud platform. XGBoost is one of these newly developed approaches, which found to produce greater performance consistently. In this study, we developed and tested XGBoost model to predict amyloid positivity and compared the performance to other ML models in our clinical settings.
Our first major finding was that model 1, using demographics, APOE genotype, and neuropsychological tests as predictors, showed good accuracy in the prediction of Aβ positivity (AUROC 0.837) and was validated using an independent population (AUROC 0.765). Several previous studies have proposed predictive models for Aβ positivity based on demographics, APOE genotype, and neuropsychological tests [16, 51–53]. Kander et al. reported an AUROC value of 0.64 using a combination of neuropsychological tests [52]. Our previous study used demographics, APOE genotype, and neuropsychological tests as predictors and achieved an AUROC value of 0.74 [16]. Ezzati et al. developed a machine learning model and reported an AUROC value of 0.72 using similar variables [53]. A common limitation of the previous studies was the absence of external validation through use of an independent dataset. The generalizability of the models was confirmed by external validation in the present study. The results of the external validation replicated our prediction model in the present study, thereby confirming its suitability for the prediction of Aβ positivity from the independent dataset.
The second major finding was that the addition of MRI features to model 1 led to an improvement in the prediction performance, and model 2 showed good accuracy (AUROC 0.892). Considering that MRI model showed only fair accuracy (AUROC 0.750), increased prediction performance was based on the combination effect of neuropsychological and MRI features. Previous studies did not improve performance by adding MRI information to the demographic and neuropsychological test data [52, 53]. However, recent ML studies have shown that interaction effects may lead to the construction of highly effective ML models in practice, because they can be further expanded in a hierarchical order [54–56]. Unlike previously developed MRI features, results from our models, which were trained by incorporating the interaction effects as new variables, achieved a superior predictive performance. Although model 2 showed higher accuracy than model 1, acquisition of MRI information is relatively difficult and costly, compared to baseline demographics and neuropsychological test results alone. Thus, dependent on the clinical situation, clinicians could choose one of the two prediction models.
An important finding from our study was that APOE genotype, scores of delayed recall tasks, and the interaction between cortical thickness in the temporal region and hippocampal volume were selected as the most important features among the demographics, neuropsychological test results, and MRI findings, respectively. APOE4 is a well-known risk factor for Aβ deposition [57, 58]. Many previous studies have shown that the presence of APOE4 is associated with Aβ deposition, as measured by amyloid PET or CSF levels of Aβ [2, 59]. These studies were in accordance with our results. Regarding neuropsychological tests, low scores in the delayed recall task of the SVLT and RCFT, which were used to evaluate episodic memory, were important predictors of Aβ positivity. It is well-known that memory impairment could discriminate a prodromal stage of AD from MCI [60–62]. Previous PET and CSF studies also showed that impaired memory profile was the most important neuropsychological finding in discrimination between Aβ (+) and Aβ (–) in patients with aMCI [52, 63]. In terms of MRI findings, interaction between cortical thickness in the temporal region and hippocampal volume were important features of Aβ positivity prediction rather than absolute cortical thickness in certain regions. Hippocampal atrophy is one of the best established and validated biomarkers of AD [21, 64] and has been used in previous studies to evaluate the progression of Aβ deposition across the whole AD spectrum [65]. However, hippocampal atrophy is known to lack specificity and sensitivity for Aβ (+) aMCI, since it can be observed in other forms of dementia and MCI including vascular dementia [66], semantic dementia [67], and limbic-predominant age-related TDP-43 encephalopathy (LATE) [68]. In this regard, in addition to hippocampal atrophy, it is important to consider the neocortical atrophy pattern when predicting Aβ positivity. In the spectrum of AD, neocortical atrophy occurs in the medial temporal area at a very early stage, and spreads throughout the remainder of neocortex soon after, in the following order: temporal, parietal, and frontal cortices; the primary motor and sensory areas are spared until the late stages of the disease [69–71]. In contrast, vascular dementia leads to neocortical atrophy in the frontal and peri-sylvian regions [72]. Semantic dementia and LATE results in focal cortical atrophy only in the anterior temporal regions and medical temporal regions until the moderate stages, respectively [67, 68]. Thus, hippocampal atrophy along with temporal atrophy may be a more specific imaging biomarker for Aβ (+) aMCI than hippocampal or temporal atrophy alone.
The strength of our study is that we developed and validated risk prediction models of Aβ positivity for aMCI through optimal interpretable ML approaches with multimodal markers. However, our study has several limitations that should be discussed. First, we did not consider the Aβ (–) aMCI patients who would covert to become Aβ positive, and aMCI patients who converted to AD. This limitation is mitigated to a certain extent by the previous studies, which revealed that a small number of Aβ (–) aMCI patients convert to Aβ (+) status [6, 73] and annual increasing rates of Aβ uptake is very low [74, 75]. Second, the models need further validation using populations of individuals with cognitive impairment recruited from the primary care setting, because of different characteristics between individuals visiting the memory clinic in tertiary hospitals and those visiting primary care providers. Third, although postmortem brain autopsy is regarded as a gold standard of detecting Aβ deposition in the brain, we used the visual assessment of Aβ PET as the standard of truth in our models. A previous study showed that the visual assessment of Aβ PET had high agreement with autopsy findings [76], and many researchers used the results of Aβ PET as the standard of truth to make prediction models. Finally, visual assessment was highly concordant with standardized uptake value ratio cut-off categorization for Aβ deposition [77, 78]. Fourth, participants with MCI in our dataset underwent two different 18F-labelled amyloid PET or different methods in defining Aβ positivity. However, this limitation might be mitigated by previous results showing that the accuracies of visual assessment and quantitative assessment in evaluating Aβ positivity were comparable [79, 80]. In addition, our recent study investigated the concordance rate for Aβ positivity between FBB and FMM in 107 participants who underwent both FBB and FMM PET for Aβ deposits. High agreement rates were found between FBB and FMM in visual assessment (94.4%) and SUVR cut-off categorization (98.1%). In addition, both FBB and FMM showed high agreement rates between visual assessment and SUVR cut-off categorization (93.5% in FBB and 91.6% in FMM) [81]. Fifth, we did not use the plasma biomarkers which contributes for better performance of prediction model. One recent study showed that composite plasma biomarker generated by combining level of Aβ A4 precursor protein/Aβ42 and Aβ40/Aβ42 could differentiate Aβ (+) from Aβ (–) with good accuracy (AUROC 0.883) [82]. Ashton et al. developed a machine learning model and reported an AUROC value of 0.891 using variables of plasma biomarker [17]. Further study is needed to improve the performance of our model by the addition of plasma biomarker features. Sixth, we could not validate the model 2 in external validation set, because we could not obtain the MRI dataset in other centers. This limitation is mitigated to a certain extent by testing the performance of model 2 on outer-loop test set from nested cross validation. Nested cross-validation is known to be a common approach that chooses the classification model and features to represent a given outer loop based on features that give the maximum inner-loop accuracy [83]. Differential privacy in outer loop is a related technique to avoid overfitting that uses a privacy preserving noise mechanism to identify features that are stable between training and holdout sets [83]. Nevertheless, our study is notable since our prediction models showed a relatively high accuracy for Aβ prediction compared to previous studies, and model 1 was reproducible in an independent dataset. Further, we suggested that two proposed models could be useful in two different situations: model 2 for clinical AD trials, and model 1 for primary care. In clinical trials targeted at Aβ (+) aMCI patients, model 2 could reduce the number of patients who have to undertake Aβ PET or lumbar puncture. In the primary care setting, a primary physical examination could identify the high-risk individuals for Aβ deposition, using model 1, that should be referred to tertiary hospitals for further evaluation.
In conclusion, we developed two ML models for Aβ prediction and validated these models externally to ascertain their generalizability. In clinical practice, these proposed ML models could help the biomarker-guided diagnosis of prodromal AD through the prediction of Aβ positivity at the individual level.
Footnotes
ACKNOWLEDGMENTS
This study was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI19C1132) and the Building of Artificial Intelligence datasets for Alzheimer’s diagnosis through brain wave imaging, funded by the National Information Society Agency (NIA, Korea).
