Abstract
Background:
The transition from mild cognitive impairment (MCI) to dementia is of great interest to clinical research on Alzheimer’s disease and related dementias. This phenomenon also serves as a valuable data source for quantitative methodological researchers developing new approaches for classification. However, the growth of machine learning (ML) approaches for classification may falsely lead many clinical researchers to underestimate the value of logistic regression (LR), which often demonstrates classification accuracy equivalent or superior to other ML methods. Further, when faced with many potential features that could be used for classifying the transition, clinical researchers are often unaware of the relative value of different approaches for variable selection.
Objective:
The present study sought to compare different methods for statistical classification and for automated and theoretically guided feature selection techniques in the context of predicting conversion from MCI to dementia.
Methods:
We used data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) to evaluate different influences of automated feature preselection on LR and support vector machine (SVM) classification methods, in classifying conversion from MCI to dementia.
Results:
The present findings demonstrate how similar performance can be achieved using user-guided, clinically informed pre-selection versus algorithmic feature selection techniques.
Conclusion:
These results show that although SVM and other ML techniques are capable of relatively accurate classification, similar or higher accuracy can often be achieved by LR, mitigating SVM’s necessity or value for many clinical researchers.
Keywords
INTRODUCTION
Alzheimer’s disease (AD) is a progressive, age-related, neurodegenerative disease and the most common cause of dementia [1–3]. Behaviorally, AD is commonly preceded by mild cognitive impairment (MCI), a syndrome characterized by declines in memory and other cognitive domains that exceed cognitive decrements associated with normal aging [2, 4]. However, the prodromal symptoms of MCI are not prognostically deterministic: individuals with MCI tend to progress to diagnoses of probable AD at a rate of 8%–15%per year, and many conversions are detectable within 3 years of initial presentation [5–7]. Research efforts to provide new insights into the incidence of MCI-to-AD conversion have focused largely on clinically or biologically relevant features (i.e., neuroimaging markers, clinical exam data, neuropsychological test scores) and on different methods for statistical classification [8].
For clinical researchers, however, there may be a tendency to conflate more sophisticated, novel analytic approaches and the value of multimodal information from neuroimaging and clinical assessment. Moreover, whereas statisticians may inherently understand the comparability of different quantitative approaches, the novelty of both big data and data-driven approaches for studying MCI-to-AD conversion may lead clinical researchers to assume that such data-driven methods are inherently superior to more theoretically grounded approaches. Thus, the value of using extant findings and domain expertise to help guide and constrain the application of newer data-driven approaches capable of capitalizing on emergent big data may be a particularly important consideration for clinical researchers.
Statistical classification in clinical research has traditionally utilized binary logistic regression (LR). However, key attributes of modern clinical and neuroimaging data, including high dimensionality and the presence of ground truth estimates of pathology and diagnosis provide new opportunities for quantitative research. This has led to a substantial expansion in the use of data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI; http://adni.loni.usc.edu) for quantitative research and methodological development, particularly by researchers utilizing and developing prediction and classification methods in machine learning (ML). Besides LR, support vector machine (SVM) has quickly become the most common type of ML classifier for diagnostic prediction and classification with ADNI data. In general, LR works well when the data is linearly separable, and the number of data is greater than the number of features. Moreover, SVM and LR have similar misclassification rates (MCRs) when used to diagnose malignant tumors from imaging data [9, 10].
Indeed, before the rapid expansion of ML research and applied work over the past decade, many clinical researchers and those outside of engineering and mathematically intensive disciplines had little exposure to classification approaches other than LR. Despite its growing popularity, the relative benefits of SVM or other forms of ML [11, 12] over LR for such classification are not always apparent. Although this may be of little surprise to statisticians and quantitative researchers, such perspectives are often lost on clinical researchers, whose implicit beliefs in the superiority of ML is driven by the volume of publications, rather than through training or empirical demonstration.
Most efforts to develop new classification methods for prediction of MCI-to-AD conversion are well suited to integrate measures from multiple sources such as demographics, clinical rating scores, neuropsychological testing, neuroimaging, genetic markers, etc. However, identifying which combination of features most accurately classifies conversion from MCI to AD is a key challenge for ADNI, and may vary by method. The L1 norm regularization method (i.e., L1) is a highly used feature selection technique for LR and SVM. L1 is popular for addressing circumstances in which the number of features is quite large or even larger than the sample size. Despite some risk of abusing the statistical terminology, the problem is often generically referred to as the “small n, large p” or high dimensional problem. The L1 technique has dual impacts, namely the algorithm can (i) optimize a higher number of parameters in comparison to sample size, and (ii) reduce the effective number of parameters (i.e., performing variable selection). This powerful technique has been implemented in ADNI data with LR [13]. Furthermore, L1 and other algorithmic feature selection methods used in ML suffer from one key limitation: they are agnostic to theoretical considerations, and as such, they cannot interpret why selected features are meaningful and important to the model. When sampling from a large pool of features, the algorithmic approaches fail to consider prior knowledge of features and their associations with the relevant systems in variable selection. Therefore, domain expertise and prior knowledge may afford additive or differential value for choosing features and interpreting model results over algorithmic feature selection methods alone.
However, most real-world problems occur in the context of additional information about each potential feature and its conceptual relationship with the phenomenon being classified. Other than using L1 feature selection, manually trimming the list of potential predictor variables can also protect against over-fitting, and also offers potential insight into why selected features are important to the model. When guided by prior knowledge, user-guided or ‘manual’ feature selection may be a valuable additional step to help minimize potentially spurious effects. This perspective is frequently lost on applied researchers, as most commonly used variable selection algorithms are context-free—that is, they only look at relationships within the data set, and cannot factor in the wider meanings of variables. Furthermore, this also means that automated algorithms may identify relationships among a large number of predictor variables that are spurious and are unlikely to generalize outside the data set. Although there are a vast number of potential neuroimaging features in ADNI data, the present study focused only on regional brain volumes segmented from structural magnetic resonance imaging (MRI) data, the most common neuroimaging datatype for classifying MCI-to-dementia conversion. In contrast to prior studies that used a limited set of volumetric brain features, the present study utilized data generated by modern multi-atlas segmentation methods and analyses included up to 259 features—anatomically specific gray and white matter volumes. However, the large pool of extant findings from studies evaluating regional brain MRI volumetry in prediction and classification of MCI-to-dementia conversion using both limited and expansive feature sets also provides a valuable set of priors for relevant brain regions [14–19]. Thus, applied researchers are often left with the conundrum of more confirmatory approaches that use few regions in classification or more exploratory methods in which prior findings have little value.
The present study addressed two questions regarding commonly used classification approaches for predicting MCI-to-dementia conversion in multi-modal data from ADNI. First, we compared performance accuracy of binary LR with SVM in classifying MCI-to-dementia conversion. Second, we asked if applying prior knowledge in feature selection outperforms algorithmic variable selection alone. We hypothesized that 1) LR would perform comparably to SVM, and 2) user-guided variable selection would outperform algorithmic variable selection alone. This work is intended to demonstrate to clinical researchers the benefit of using ML in an informed fashion, rather than as a ‘black box’ that obscures clear interpretation. Moreover, we wish to emphasize that this study is not meant to highlight a novel innovation in quantitative methods, but rather to provide an important example to applied researchers regarding the comparable value of ML methods and importance of domain expertise in classification with ADNI data.
MATERIALS AND METHODS
Data used in the preparation of this article were obtained from the ADNI database (http://adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. For up-to-date information, see http://www.adni-info.org.
Determination of sensitive and specific markers of preclinical AD and MCI is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as reduce the time and cost of clinical trials. Data in the present study came from all sites across the U.S and Canada. All ADNI study participants included in the present analyses were between 55 and 90 years old, spoke English or Spanish as their native language, and had a study partner who provided an independent assessment of functioning.
This study used a subset of the 819 participants from ADNI-1 diagnosed with MCI at baseline and for whom the data from demographic, clinical cognitive assessments, APOE4 genotyping, and MRI measurements were also available. To evaluate differences in classification performance due to participant inclusion and drop out, we subdivided the sample into two overlapping groups. After applying other criteria for inclusion, Group One included all patients whose follow-up period was at least 36 months (n = 265); Group Two consisted of all patients with follow-up assessments at 24 months (n = 308). Although the ADNI study protocol includes additional follow-up visits at 6-month intervals, the present study only evaluated baseline data for features (i.e., clinical, neuropsychological, brain volumetric) in classification analyses. In addition, identification of stable versus converting clinical outcomes only considered longer-term outcomes based on assessments at 2 and 3 years after baseline. The final samples included 265 and 308 study participants in Groups One and Two, respectively, who met criteria for inclusion. Both Groups included participants who were stable in their diagnosis (MCI-S) and those who converted to a diagnosis of dementia over the 2 or 3 years (MCI-C). Table 1 shows the participant characteristics. Diagnostic criteria for MCI included a Mini-Mental State Examination (MMSE) score at baseline between 24 and 30, a Clinical Dementia Rating (CDR) score of 0.5, and a subjective memory complaint, in addition to objective memory loss measured by education-adjusted scores on the Logical Memory II subscale of the Wechsler Memory Scale, generally preserved activities of daily living and no dementia. The diagnostic criteria for dementia were an MMSE score between 20 and 26, and a CDR score between 0.5 and 1.0. The clinical status of each participant diagnosed with MCI was re-assessed at each follow-up visit and updated to reflect one of several outcomes (e.g., MCI or dementia subtypes). The MCI-C and MCI-S group designations were based on this follow-up clinical diagnosis and marked as either 1 for MCI-C or 0 for MCI-S in classification study.
Sample Sizes by Timing and Diagnosis: Group One and Two
Table 1 shows the number of MCI-C, MCI-S, and total subjects in Group One and Two. The number of MCI-C patients is higher than MCI-S patients in both groups.
Data used in classification
Evaluation of extant reports of common predictors of conversion from MCI to dementia focused on dimensions of neuropsychological test performance, clinical assessment, genetic data, and regional brain volumes. In the present study, we first divided these variables into two sets of features, with all non-brain volumetric variables in one set and all variables representing regional brain volumes in a second set. In addition, we created a third set of features from the volumetry feature set that only included 26 of the 259 brain volumes. Henceforth, we refer to models that only include one of these three feature sets as ‘single-modality,’ whereas models that combine brain and non-brain feature sets are referred to as ‘multi-modal.’
Clinical cognitive assessment and genetic data
We considered a total of 19 clinical features as potential predictors of MCI-to-AD progression in our classification analyses. These included the following assessment scores: the MMSE, CDR-Sum of Boxes, Alzheimer’s Disease Assessment Scale-cognitive sub-scale (ADAS-cog), Functional Activities Questionnaire (FAQ) measures of activities of daily living, Trail Making Test-B (TRABSCOR), the immediate and delayed recall components of the Rey Auditory Verbal Learning Test (RAVLT), the Digit-Symbol Coding test (DIGT), and the Digit Symbol Substitution Test from the Preclinical Alzheimer Cognitive Composite (mPACCdigit). We also considered genotype for carriers of the epsilon-4 allele of the apolipoprotein E (APOE) gene [8] as a genetic predictor in this study. Table 2 summarizes all 19 clinical, demographic, and genetic features used in this study. Preliminary comparison of six clinical and genetic predictors by MCI-C and MCI-S subgroups showed five of them (APOE4, ADAS4, CDR, MMSE, and RAVLT.learning) significantly differ between the groups, whereas one (SEX) does not. Figures 1 2 illustrate the distribution of these predictors for both groups. Overall, in comparison to MCI-S participants, those in the MCI-C group were more cognitively and functionally impaired at baseline, exhibited greater verbal memory impairments, and included a greater proportion of APOE4 carriers.
Clinical Features and Cognitive Assessment Score of Group One
Table only for Group One where has 265 patients and 36 months follow-up time. Values are shown as mean±standard deviation or percentage. Test statistics and p-values for differences between MCI-S and MCI-C are based on (a) t-test or (b) chi- square test. MCI-S, non-progressive MCI; MCI-P, progressive MCI; APOE, apolipoprotein E; MMSE, Mini-Mental State Examination; RAVLT, The Rey Auditory Verbal Learning Test (immediate: sum of 5 trails; learning: trial 5-trial 1; Forgetting: trial 5-delayed; perc.forgetting: Percent forgetting); DIGT, The Digit- Symbol Coding test; TRAB, Trail Making tests; CDRSB, Clinical Dementia Rating Scaled Response; FAQ, Activities of Daily living Score; ADAS, Alzheimer’s Disease Assessment Scale–Cognitive sub-scale; mPACCdigit, the Digit Symbol Substitution Test from the Preclinical Alzheimer Cognitive Composite.

Comparison of distributions for baseline predictor variables between MCI-S and MCI-C groups. (a) The mean MMSE score in MCI-S is higher than in MCI-C. (b) Mean Learning scores of MCI-C and MCI-S groups are 2.5 and 5.

Comparisons between MCI-S and MCI-C groups on baseline predictor variables. The y-axis of panels (a) through (d) represents the number of participants developing AD. Blue and red bars represent non-converters and converters, respectively. Panel (a) shows a greater number of converters than non-converters for both men and women. Panel (b) shows more than half of MCI-C subjects are APOE4 carriers and approximately 70%MCI-S subjects are non-APOE4 carriers. Panel (c) shows MCI-S subjects have the relatively lower CDR score and MCI-C subjects have higher CDR score. The number of people in MCI-C group has a downward trend as CDR score increases. Panel (d) shows MCI-C subjects have the relatively higher ADASQ4 score. The average of ASADQ4 score of MCI-S and MCI-C subjects are approximately 5 and 8, respectively.
MRI data
Structural MRI data were collected according to the ADNI acquisition protocol using T1-weighted scans (GradWarp, B1 Correction, N3, Scaled) [20]. These data included baseline structural MRI scans of 840 ADNI participants, including 230 diagnosed as cognitively normal, 200 with diagnoses of dementia, and 410 diagnosed with MCI. Processing for region-of-interest (ROI)-based volumetric data used in the present study included brain extraction [21] and a multi-atlas, consensus-based label fusion scheme for anatomical parcellation [22] to generate template-based ROIs deformed to individual subject space. MRI scans were automatically segmented into 145 anatomic ROIs spanning the entire brain. An additional 114 derived ROIs were calculated by combining single ROIs within a tree hierarchy, to obtain volumetric measurements from larger structures [20]. In total, 259 ROIs were measured and used as potential predictors of MCI-to-dementia progression in this study.
One of the goals of this study is to investigate if manually selecting predictors improves a model’s performance. Based on the extant literature [23], we manually selected 26 out of 259 features as theoretically significant predictors of MCI to dementia progression (Table 3) [14–19]. While many brain regions have been reported as showing some relationship to MCI-to-dementia progression, prior reports and reviews clearly implicate hippocampal and entorhinal cortical volumes as markers of such conversion. In addition, we manually selected additional regions based on their common occurrence across reports, including cingulate gyrus, precuneus, amygdala, inferior frontal gyrus, superior parietal lobule, and lobar white matter volumes.
Pre-selected MRI features of Group One
Values are shown as mean±standard deviation or percentage. Test statistics and p-values for differences between MCI-C and MCI-S are based on t-test. MCI-S, non-progressive MCI; MCI-C, progressive MCI; HippoR, Right Hippocampus; HippoL, Left Hippocampus; flWMR, frontal lobe WM right; flWML, frontal lobe WM left; plWMR, parietal lobe WM right; plWML, parietal lobe WM left; tlWMR, temporal lobe WM right; tlWML, temporal lobe WM left; ACgCR, Right ACgG anterior cingulate gyrus; ACgCL, Left ACgG anterior cingulate gyrus; EntR, Right Ent entorhinal area; EntL, Left Ent entorhinal area; MCgCR, Right MCgG middle cingulate gyrus; MCgCL, Left MCgG middle cingulate gyrus; MFCR, Right MFC medial frontal cortex; MFCL, Left MFC medial frontal cortex; OpIFGR, Right OpIFG opercular part of the inferior frontal gyrus; OpIFGL, Left OpIFG opercular part of the inferior frontal gyrus; OrIFGR, Right OrIFG orbital part of the inferior frontal gyrus; OrIFGL, Left OrIFG orbital part of the inferior frontal gyrus; PCgCR, Right PCgG posterior cingulate gyrus; PCgCL, Left PCgG, posterior cingulate gyrus; PCuR, Right PCu precuneus; PCuL, Left PCu precuneus; SPLR, Right SPL superior parietal lobule; SPLL, Left SPL superior parietal lobule.
Method and algorithm
In the following section, we utilize binary LR and SVM classification techniques to investigate which approach yields superior discrimination accuracy in the context of ADNI data. Prior comparisons of logistic regression and SVM have reported that SVM requires fewer variables than logistic regression to achieve an equivalent level of MCR [10, 24]. These also report SVM performs better than LR with microarray expression data [10]. Furthermore, SVMs have a nice dual form, giving sparse solutions when using the kernel trick. In addition, both methods involve minimizing some cost associated with the misclassification based on likelihood ratio for a probabilistic model. Therefore, LR and SVM share common roots in statistical pattern recognition, which we utilize in the comparison of their performance on multi-modal ADNI data.
Logistic regression
LR is the most commonly used machine learning approach for binary classification. In the past decade this has been applied to task of MCI-to-dementia conversion [13, 26]. In the present study, we consider a supervised learning task where we are given M training examples: D = (x
i
,y
i
), i = 1,. . . M. Here each x
i
∈ ℜ
N
is N dimensional feature vectors, and y
i
∈ {0,1} is a class label. The goal of LR is to model the probability p of a random variable y being 1 or 0 given the experimental data x. The logistic regression model is defined as follows:
Logit, the natural logarithm of the odds, is the key concept that underlies logistic regression. The equation for LR is:
LR is usually trained by minimizing an error function; an appropriate choice of such a function for binary classification problems is the cross-entropy error:
The total cost over the data D = (x i ,y i ),i = 1,...M is:
Consider the problem of finding the maximum likelihood estimate (MLE) of the parameters β for the unregularized logistic regression model. To find the optimized weights β, the total cost needs to be minimized. The optimization function can be written:
Solving Equation (6) yields the optimal weights of
Support vector machine
SVM is another classification and regression method that can handle high dimensional feature vectors. Algorithmically, SVMs build optimal boundaries between data sets by solving a constrained quadratic optimization problem [30–34]. The number of studies applying SVM to evaluate classification of conversion from MCI to dementia has grown over the past decade [1, 35–39].
We briefly review basic support vector machines with linear kernel (SVM-linear) for classification problems: Let
such that the distance from the closest point of each class to the hyperplane is 1/||
To make the algorithm work for highly correlated features and improve the fitted model’s prediction accuracy, we reformulate our optimization by adding L1-norm of β, i.e., the lasso penalty as follows:
Experimental design
We built four different classifiers, each designed to classify individual ADNI participants as belonging to either the MCI-C group or the MCI-S group: Classifier 1 is logistic regression (C-LR); Classifier 2 is logistic regression with L1 norm (C-LR-1); Classifier 3 is support vector machine (C-SVM); and Classifier 4 is SVM with L1 norm (C-SVM-1). To test the classifiers’ performance, we constructed five different data sources (Table 4). The first three single-modality data sets included clinical cognitive assessment scores and APOE4 status (CCA), all MRI volumes (ROI-NP), and MRI volumes with preselection (ROI-P), respectively. Two additional multi-modal data sets were constructed by combining the CCA data separately with ROI-NP and ROI-P data sets (i.e., brain volumes with and without preselection). Furthermore, it is interesting to note that the number of MCI-S subjects is 101 (38%) in the Group One and 122 (39%) in Group Two, which makes the data rather imbalanced. Consequently, to precisely report the results obtained from the models, the present study also assessed additional model performance parameters, including AUC score, sensitivity, and specificity (accuracy coefficient is unreliable for imbalanced data). The prediction procedure consisted of three processing stages for Group One (Time = 36 months) and Group Two (Time = 24 months): 1) Split data as training, validation, testing set; 2) Train classifiers using training set, tune hyper-parameter using the validation set, and assess classifiers using testing set, then train classifiers again using L1 norm on the same training set; 3) Report the testing accuracy, AUC score, sensitivity and specificity of each classifier on single-modality data. Specifically, the first stage used 80%of the sample as a training set while the remaining 20%of the data constituted the testing set. In the second stage, the optimal subsets of features of each data source are determined and chosen following application of L1 norm. We then list the top 10 features of each data set for each of the models. In the last stage, we report AUC score, sensitivity (percent of MCI-C subjects correctly classified), and specificity (percent of MCI-S subjects correctly classified) as measures of classification accuracy. To protect against over-fitting and to avoid optimistically-biased estimates of model performance, we report 20 measures of predictive performance for each classifier (1–4); for these different partitions of the data, we report the mean and standard deviation of testing accuracy, AUC score, sensitivity, and specificity (Tables 6 7). We also investigate the relationship between the number of features and model performance. Finally, we compare the performance of LR with SVM based on their ability to handle the problem with a large number of covariates. Figure 3 illustrates the diagram of the prediction framework.
Modalities
LR and SVM performance of Group One (Time = 3 years) for models on single and multi-modal feature sets
Predictive performance of LR and SVM (mean±standard deviation) for all models. Performance estimates include testing accuracy (Test Acc %), area under the cureve (AUC), sensitivity (Sn), and specificity (Sp). The number (#) of features was determined via (1): Classifier 2; (2): Classifier 4.
LR and SVM performance of Group Two (Time = 2 years) for single-data and multi-modal data
For each modality, the predictive performance of LR and SVM are shown (mean±standard deviation), including testing accuracy, AUC, sensitivity (Sn), specificity (Sp), # features is the number of features; this parameter was determined via (1): Classifier 2; (2): Classifier 4.

Flowchart of the LR and SVM method. A) ROI-P: ROI level data with Pre-selection; B) ROI-NP: ROI level data with No Pre-selection; C) CCAR: Clinical, Cognitive assessments score, APOE4, and ROI level data.
RESULTS
Cross-validation and choice of λ
We adopted 10-fold cross-validation to tune the hyper-parameters for each model, which included dividing the data into separate sets for training and validation. The ratio of case in training and validation was 8:2. Here, the training set was used to train the model and the validation set was used to select the hyper-parameters. The results of a 10-fold cross-validation run are summarized with the mean and standard deviation of the model skill scores based on testing data. Cross-validation was also applied to tune the hyper-parameters; λ is used to denote the hyper-parameters for both LR-L1 and SVM-L1. To select the optimized λ, we tried different values of the λ; results reported here include values of λ= 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8 and applied them to the Eq (8) and (11). Next, we selected the λ value based on the best cross-validation score and used the selected λ with Classifiers 2 and 4 to select optimal features. For brevity, the model performance estimates are reported in Tables 6 7 for each different modalities, and the top 10 selected features are reported in Table 5. For example, the best λ for ROI-NP-L1 was 0.01 and the top 3 optimal features selected by LR were left amygdala, right accumbens area, and right middle temporal gyrus. After hyper-parameters were selected, we adopted a 10-fold cross-validation again to avoid optimistically-biased estimates of model performance. In each iteration, 212 of the 265 participants are selected by simple random sampling as training cases and the remaining 53 were used as test cases. The approximate 4:1 ratio of training to test cases is, of course, arbitrary.
Top 10 features of Group One obtained by L1 regularization
AccmR, Right Accumbens Area; AmyL, Left Amygdala; HippoL, Left Hippocampus; InfR, Right Inf Lat. Vent; AOrGL, Left anterior orbital gyrus; AnGR, Left angular gyrus; LOrGL, Left lateral orbital gyrus; MOGL, Left middle occipital gyrus; MOrGL, Left medial orbital gyrus; MTGR, Right middle temporal gyrus; PCgGL, Left posterior cingulate gyrus; POR, Right parietal operculum; POrGR, Right posterior orbital gyrus; PrGR, Right precentral gyrus; PTR, Right planum temporal.
Comparison with different modalities
We compared the performance of each classifier (1–4) on the five different feature sets (Table 4) based on estimates of AUC, sensitivity, and specificity. As shown in Table 6, the results of using LR with L1 regularization (Classifier 2) can achieve the high AUC of 81.2%and sensitivity of 81.4%on single-modality data (CCA), which is considerably better than performance of LR on the other four modalities. Similarly, the best AUC and sensitivity achieved by SVM are 81.4%and 81.6%based on the combination of CCA and SVM-L1. Furthermore, we also found the highest accuracy achieved by both classifiers without applying regularization is based on the single-modality data (CCA); this indicated both classifiers perform best on single-modality data.
Comparison of pre-selection and L1 norm
We found that using prior knowledge to inform feature selection improves model performance and protects against over-fitting. As shown in Table 6, model performance (i.e., AUC) on ROI-P (64.3%) and CCAR-P (76.3%) outperformed ROI-NP (60.6%) and CCAR-NP (60.1%). However, the performance of Classifier 2 on the ROI-NP-L1 and CCAR-NP-L1 data sets had AUC score of 64.1%and 64.0%, while the ROIP-L1 and CCAR-P-L1 had respective AUC scores of 64.3%and 77.9%; this suggests that user-guided pre-selection significantly improved model performance over L1 norm. In addition, the SVM (Classifiers 3 & 4) had similar and comparable results with LR classifiers. First, as with the LR models, the observed AUC estimates for CCAR-P and ROI-P (69.2%and 64.1%, respectively), were superior to AUCs from the CCAR-NP (59.1%) and ROI-NP analyses (61.4%). Classifier 4 exhibited similar performance on the CCAR-P-L1 as Classifier 2, with an AUC value of 79.6%—higher than the model for CCAR-NPL1 (74.0%). Therefore, manually selecting features improves model’s performance whether L1 norm is applied, or not. Second, these results show it is necessary and important to use pre-selection because both LR and SVM models on CCAR-P-L1, with respective AUC estimates of 77.9%and 78.5%, exhibited superior performance over the models without such pre-selection (i.e., LR and SVM on CCAR-NP-L1 had AUC estimates of 64.0%and 74.0%, respectively).
Comparison of groups one and two
In addition to the results from models of Group One (i.e., MCI-to-AD conversion over 36 months), we also evaluated the performance of Group Two (i.e., MCI-to-AD conversion over 24 months) in an effort to gain further insight regarding possible benefits of shorter or longer assessment periods on classification of the progression of MCI to dementia. Table 7 summarizes the predictive performance of LR and SVM for Group Two. Similarly, we also evaluated classifier performance for single- and multi-modality feature sets. The best result is obtained by using SVM-L1 model (Classifier 4) on CCAR-P, and its corresponding AUC, Sn and Sp are 76.2%, 60.1%, and 79.2%, which verifies the assumption that manually selecting techniques improves the model’s performance again. However, it warrants mention that all classifiers’ performance on the Group One data outperformed the same classifiers’ performance on the same data sets in Group Two. For example, Classifier 2 of Group One on CCA achieved AUC and Sn values of 81.2%and 83.1%, which is considerably better than the same classifier of Group Two on CCA (i.e., 76.3%and 79.8%). Similarly, Classifier 3 for ROI-NP had an AUC of 61.4%for Group One and 56.6%for Group Two. The experimental results indicated superior model performance on data obtained using longer than using shorter follow-up periods. Given the uncertainty in conversion, a longer time window for assessment of cognitive and functional change clearly yields more accurate classification.
Comparison of LR and SVM
In addition to comparing classification between different time windows of assessment, we also compared performance differences between LR and SVM. The results, including models’ ability to address the overfitting problem of LR and SVM methods with different modalities are displayed in Tables 6 7 and Figs. 4 5. First, it is worth noting that both LR and SVM do not work well if no L1 penalization used, since Classifiers 2 and 4 outperform Classifiers 1 and 3 on the same data set. Second, it is worth noting that SVM has a better performance on MRI data when the L1 feature selection method is employed. Third, it was possible to obtain good performance accuracy using LR, which had equivalent model performance as SVM for “large p” data (ROI-P), as evidenced by respective AUC estimates for Classifiers 1 and 3 of 64.3%and 64.1%. Finally, as shown in Figs. 5 4, the SVM method is more stable and robust than LR to the large number of features when n is small. To summarize, the best performance of Group One was achieved by Classifier 4 (SVM with L1 norm) when using multi-modal, i.e., CCAR-L1, had an AUC of 81.4%.

Model performance on ROI feature set by number of features for LR and SVM. Panel (a) shows dramatic growth in AUC with LR as the number of features increases from 1 to 30, and then becoming more static at approximately 74%, i.e., as the number of features increases from 30 to 40, but drops significantly when the number of features reaches to 41. Panel (b) shows the AUC increased dramatically as the number of features grows from 1 to 28, but fluctuated after 29. The optimal number of ROI features for both methods are 29 and 28, and their corresponding optimized AUC were approximately 74.0%and 78.0%.

Model performance on CCA feature set by number of features for LR and SVM. Figure (a) shows there is a significant increase in the AUC with LR as the number of features increases from 1 to 5, then there is a slight decrease in the testing accuracy when the number of features is greater than 5. Figure (b) shows the AUC shot up dramatically as the number of features increases from 1 to 4. The optimal number of CCA features obtained by LR and SVM are 5 and 4, and their corresponding optimized AUC are approximately 84.0%and 83.0%.
DISCUSSION
In this study, we applied two machine learning methods under multiple conditions, to test accuracy in classifying patients with MCI who progress to clinically-defined dementia (MCI-C) from those who remain stable (MCI-S). Using multi-modal data from ADNI, we compared LR and SVM classification accuracy and pre-selection dimensional reduction techniques, i.e., feature selection as informed by prior findings in clinical neuroscience and by L1 norm. Notably, the present results demonstrate important boundaries for applying feature selection techniques in statistical classification of MCI-to-dementia conversion. Specifically, we found that while using L1 for pre-selection can improve accuracy, it also benefits from a more limited, theoretically based set of feature inputs. In addition, we found that model performance benefited from a longer window of assessment. These results have implications for studies utilizing multi-modal data for such classification, including features from clinical neuropsychological assessment, demographic and genetic markers, MRI-based volumetric brain measures, and other modalities.
Comparison of user-defined and L1 pre-selection for LR and SVM classifiers yielded multiple noteworthy findings, consistent with previously published reports [1, 35]. First, the classification results showed that the model using multi-modal data with cognitive, clinical, and volumetric data (CCAR) achieved better classification accuracy than the methods based on single-modality (CCA, ROI). Moreover, the AUC of CCAR based on LR or SVM was either statistically significantly or at least numerically greater than those based on the single-modality model. Based in AUC, we reported the highest accuracy was observed for CCAR data at 78.5%by L1 SVM and 77.9%by L1 LR. Second, SVM demonstrated several advantages over LR in discriminating MCI-C from MCI-S (Fig. 4). For one, SVM performance tended to be more stable than LR when the number of features was relatively large. In other words, the model performance of SVM on ROI data remained more stable than LR when using larger numbers of features without user-defined pre-selection. In particular, SVM performance on ROI data improved as the number of features increased from 20 and 30. In contrast, the AUC values for ROI data sets remained fairly static despite increasing the number of features. However, LR model performance decreased gradually after the number of ROI features reached 40. Third, the classification results clearly demonstrate that manually selecting features on MRI data not only improved the model performance and protected the classifier from overfitting, but also affords easier interpretation of each selected feature’s contribution to the model. In addition, we show that pre-selection improves performance: Tables 6 7 suggest it is the best strategy to obtain the maximum model performance, compared to features selection based on L1 norm.
The present findings can also be interpreted in the context of other reports over the past decade that also investigated the prognostic capacity of brain volumetry data to predict the conversion of MCI to dementia, using either SVM or LR, and that also combined volumetry data with other imaging and biomarker modalities such as MRI, functional MRI (fMRI), PET to cerebrospinal fluid (CSF) protein markers [1, 41–43]. In addition, one can vary the degrees of non-linearity and flexibility in the model by employing different kernel functions. For example, Young et al (2013) report [8], results from both SVM and Gaussian process (GP) classification on MCI progression in ADNI data using MRI, PET, APOE4, and CSF biomarkers. In contrast the present study and with other published work that used MCI-C and MCI-S groups as training and test data sets, they trained a classifier to distinguish cognitively normal older adults from those diagnosed as probable AD. They reported that the accuracy using GP, an AUC value of 79.5%, was substantially higher than using any individual modality or using multi-kernel SVM. Other studies of MCI-to-dementia classification reporting high accuracy have also implemented other approaches such as multiple kernel learning (pMKL) classification techniques using clinical, MRI and plasma biomarkers data. One method using this approach to identify the important features first grouped the data set into five different data sources and then applied a filter-wrapper approach of feature selection techniques in combination with Joint Mutual Information (JMI) criterion to achieve an AUC of 82%[23].
We also found consistently superior classification performance in patients classified under a longer window of assessment. MCI-to-dementia conversion is a process that can take several years to reliably track an individual from onset of amnestic MCI to early-stage dementia [8, 45]. For the modeled features to be of use for classification necessitates well-defined, if not orthogonal classes. However, MCI is not inherently prodromal to dementia: a large proportion of individuals with MCI never progress, either reverting to cognitively normal status or remaining rather stable. Furthermore, others may show early evidence of brain atrophy that precedes cognitive impairment by years. In order to account for this variable timing, others have employed methods such as supervised learning using time windows [46]; however, even those methods strongly benefit from longer follow-up periods. Thus, MCI is an inherently heterogeneous and poorly-defined class, particularly in terms of the relationships between brain characteristics and the likelihood and timing of further cognitive decline.
The brain volumetric data evaluated in the present study were to limited baseline MRI scans. Alternatively, classifying cognitive decline may benefit from further extending the model to accommodate repeated measurements from longitudinal data. While the inclusion of repeated volumetric data should improve classification accuracy, quantifying the improvements in model performance may also depend on other factors, such as added noise or redundancy from additional brain parameters, or variability in disease progression. In addition, most recent computational neuroimaging studies in the past few years have utilized features from multiple neuroimaging modalities [5, 47–50]. For example, when Ding et al. applied SVM with PET and MRI data to classify the transition from MCI to AD, they reported the sensitivity and specificity were 66.67%and 64.52%[36]. In addition to PET and structural MRI data, CSF protein markers can be used to predict progression from MCI to AD, in addition to proteomic, demographic, and cognitive data [38, 52]. By applying LR with L1 norm to CSF markers for classifying individual patients as belonging to either the MCI-C and MCI-S group, one study reported a sensitivity and specificity of 80%and 75%[26]. Furthermore, Varatharajah and colleagues (2020) showed SVM-linear outperforms other advanced classification methods, including linear classifiers—multiple kernel learning (MKL) with linear kernels, SVM with a linear kernel, and generalized linear model (GLM), in predicting transition from MCI to AD [42]. In general, LR works well when the data is linearly separable and the number of data is greater than the number of features, whereas SVM with Gaussian Kernel is mostly used when the data is not linearly separable. In addition to LR and SVM, deep neural network approaches also offer benefits [41, 53], but have not had the extent of application in ADNI data as SVM and LR. Using a novel LR, artificial neural network (ANN) model and decision tree (DT) model for classifying the progression of MCI to AD, Kuang (2021) reported that the ANN exhibited the highest sensitivity at 82.1%[43].
In conclusion, models applying prior knowledge for classification and prediction of MCI-to-dementia conversion outperform those without pre-selection. This theoretically guided pre-selection of features from MRI-based regional brain volumes appears to protect the model against over-fitting. In addition, the present findings demonstrate that SVM classifier performance is more stable than LR for dealing with the “large p” problem. Clinical researchers should both note the value of evaluating different classification and pre-selection approaches in application to clinical or research questions and be mindful that not all machine learning techniques are equally beneficial for modeling specific clinical outcomes.
Footnotes
ACKNOWLEDGMENTS
The research is partially supported by NSF-DMS 1945824 and 1924724.
We are grateful to the patients and their families who participated in the ADNI.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (
). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
