Abstract
T1-weighted MRI has been extensively used to extract imaging biomarkers and build classification models for differentiating Alzheimer’s disease (AD) patients from healthy controls, but only recently have brain connectome networks derived from diffusion-weighted MRI been used to model AD progression and various stages of disease such as mild cognitive impairment (MCI). MCI, as a possible prodromal stage of AD, has gained intense interest recently, since it may be used to assess risk factors for AD. Little work has been done to combine information from both white matter and gray matter, and it is unknown how much classification power the diffusion-weighted MRI-derived structural connectome could provide beyond information available from T1-weighted MRI. In this paper, we focused on investigating whether diffusion-weighted MRI-derived structural connectome can improve differentiating healthy controls subjects from those with MCI. Specifically, we proposed a novel feature-ranking method to build classification models using the most highly ranked feature variables to classify MCI with healthy controls. We verified our method on two independent cohorts including the second stage of Alzheimer’s Disease Neuroimaging Initiative (ADNI2) database and the National Alzheimer’s Coordinating Center (NACC) database. Our results indicated that 1) diffusion-weighted MRI-derived structural connectome can complement T1-weighted MRI in the classification task; 2) the feature-rank method is effective because of the identified consistent T1-weighted MRI and network feature variables on ADNI2 and NACC. Furthermore, by comparing the top-ranked feature variables from ADNI2, NACC, and combined dataset, we concluded that cross-validation using independent cohorts is necessary and highly recommended.
Keywords
INTRODUCTION
In Alzheimer’s disease (AD) research, mild cognitive impairment (MCI) is commonly studied as a transitional stage between the cognitive decline expected with normal aging and the more serious decline of dementia. It can involve problems with memory, language, attention, and judgment that are greater than normal age-related changes, but are not significant enough to interfere with daily activities. In [1], the authors analyzed how 5 most well-established AD biomarkers evolve in relation to each other and the onset and progression of clinical symptoms from healthy controls (HC), to MCI and dementia. Even so, it is not yet fully understood how the brain deteriorates from health to MCI and eventually to dementia, and understanding these mechanisms and processes is vital for evaluating preventions and treatments. In addition, more effective prognosis and evaluations for people with MCI could offer an enormous public health benefit.
Providing detailed information on the brain’s gray matter, T1-weighted magnetic resonance imaging (T1w MRI) has been widely used in discovering biomarkers related to AD. By averaging cortical feature variables in an AD population and in matched elderly controls, [2] identified striking profiles of gray matter loss, anatomical variation, and some evidence of cerebral asymmetries in these deficits. Significant gray matter deficits were observed in a broad anatomical region encompassing bilateral temporal and parietal cortices in patients with AD [3]. Recently, machine learning techniques have been introduced to identify changing patterns of brain structures and improve prognostic accuracy. For example, support vector machine (SVM) was used [4–6] to differentiate AD from HC with discrete wavelet transform based features or displacement field based features and achieved a promising classification accuracy. Escudero et al. [7] adopted SVM and logistic regression and obtained reasonable accuracy in classifying individuals as HC versus MCI, and HC versus AD. Ota et al. [8] used voxel-based analyses and SVM-based pattern recognition to analyze 77 three-dimensional T1w MRI data sets from subjects with amnestic MCI and showed a significant cluster of gray matter density reduction, in the left hippocampal region. Zhou et al. [9] formulated the disease progression prediction problem as a multi-task regression problem by considering the prediction at each time point as a task. Their results showed that cortical thickness averages from the left middle temporal gyri, and the left and right entorhinal cortices, and a measure of white matter volume from the left hippocampus play significant roles in predicting decline in a key cognitive measure—the ADAS-Cog [10]—through the course of progression.
While T1w MRI has provided an effective way to capture structural information on the brain’s gray matter, diffusion-weighted MRI (dMRI) is sensitive to microscopic properties of the brain’s white matter that are not detectable with standard anatomical MRI and is widely adopted in brain research. dMRI can provide maps of connections among different brain regions. This type of map is termed the brain’s “structural connectome” or simply a brain network, which may include hundreds to thousands of nodes indicating brain regions-of-interest (ROIs) and the weighted connections (edges) connecting them. Brain structural network analysis has been widely used to study various brain diseases [11–13] including AD [14–20]. By using a model of the brain’s structural network, Wang et al. [19] reported that AD patients had significantly decreased nodal efficiency at the regional level, as well as weaker connections in multiple local cortical and subcortical regions, such as precuneus, temporal lobe, hippocampus, and thalamus. Daianu et al. [16] tested how AD disrupts the ‘rich club’ effect. They found AD patients had a lower nodal degree in cortical regions implicated in the disease, as expected. The normalized rich club coefficient was higher in AD. In [14], the authors studied 34 participants, finding decreased fiber density and disrupted connectivity between the hippocampus and posterior cingulate cortex in early AD. The MCI group showed reduced fiber density from the posterior cingulate cortex and hippocampus to the rest ofthe brain.
The combination of different MRI modalities has been successfully applied for prognosis of some diseases [21–27], including MCI [28–34]. In those studies, the authors used dMRI-derived features such as fractional anistropy (FA) values, apparent diffusion coefficient (ADC) combined with T1w features such as cortical thickness, cortical volume, etc., to classifying MCI with HC. The results showed promising performance by combining different MRI modalities. However, few existing studies use dMRI-derived structural connectome and T1w information to differentiate MCI with HC. This encourages us to explore the effectiveness of combining structural connectome markers and T1w MRI markers. Also, most previous studies only validated their method on one dataset. We believe that cross-validation on multiple datasets can lead to insightful analysis. In this study, we tested the hypothesis that dMRI-derived structural connectome can supplement T1w MRI by boosting the classification performance in differentiating MCI subjects from HC subjects. Specifically, we extracted dMRI-derived network feature variables, i.e. the connections between 113 ROIs [35], and T1w MRI-derived feature variables including cortical volume and cortical thickness by FreeSurfer [36–39], and investigated potential benefits by combining these two types of feature variables. However, in this case, the number of possible feature variables we can assess tends to be much larger than the sample size, which may lead to model overfitting [40]. To tackle this, we introduced a new feature variable dimension reduction framework, in which the importance of each feature variable is quantified by a stability score for it, and then a subset of high scoring feature variables is selected for the classification task [41, 42]. Stability selection is a powerful approach to measure the quality of feature variables by evaluating how sensitive sparse linear models (e.g., sparse logistic regression) are to each selected feature variable, where good feature variables are those more consistently selected in the final classification model, in a bootstrap process. We evaluated this framework on two independent data sets, i.e., ADNI2 [43] and NACC [44].
MATERIALS AND METHODS
Data
Two independent data sets were analyzed in this study. The first dataset is from ADNI2, containing 50 HC and 112 MCI. Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. For up-to-date information, see www.adni-info.org. The second one was from NACC, containing 329 HCl and 57 MCI. Demographic characteristics of the two datasets is summarized in Table 1. dMRI and T1w MRI data for each subject was analyzed. Table 2 summarizes the key data collection parameters for the two cohorts.
Demographic information for the two cohorts (ADNI2 and NACC). The p-values for the difference between ADNI2 and NACC are 0.023 for sex and 3.88e-23 for age. The last column is the p-value for the difference between MCI and HC
Parameters for dMRI and T1w MRI data for ADNI2 and NACC
Feature variables
Two types of feature variables were extracted in this study. The first type is from the gray matter using T1w MRI. FreeSurfer was used to extract 136 measurements including cortical volume and thickness for 68 brain ROIs based on Desikan-Killiany atlas [45]. The second type is from dMRI-derived structural connectome or network. The brain structural connectome was constructed using PICo [46], a whole-brain probabilistic tractography algorithm and 113 ROIs defined on the Harvard Oxford Cortical and subcortical Probabilistic Atlas [45, 47–49]. The details of computing the brain network can be referred to [35]. Each subject’s network has a dimension of 113×113, with 6,328 distinct edges connecting 113 brain ROIs (the edges are not directional and thus the network is symmetric).
Harmonization and removal of confounder effects
There are two distinct data sets used in this study. It is likely that the imaging sequences for the two data sets may have different sensitivity to disease effects and different sources of error. Therefore, we adjusted for cohort effects, as well as age and sex. We created an indicator variable differentiating the two data sets with 1 for all subjects from ADNI2 and -1 for all subjects from NACC.
We use the commonly adopted generalized linear squares approach [50, 51] to remove the confounder effects, which assumes a variable may be linearly dependent on the confounder variables and the effects can be removed by fitting a generalized linear regression. We assume that each feature variable x has an observation value x
obs
, given by the original feature value x
ori
linearly biased by three confounder variables, i.e., age, sex, and cohort index, i.e.,
We repeat this process for all feature variables derived from both T1w MRI and dMRI. In the following analysis, we use the harmonized feature variables.
Classification modeling via sparse logistic regression
In linear models, the sparsity means a feature variable is determined to be irrelevant if the corresponding weight is zero. Therefore, some irrelevant feature variables are discarded in the model and have no contribution to the final classification model. Sparse learning algorithms such as sparse logistic regression for classification—and LASSO [52], for linear regression—are powerful tools to build models from high dimensional data with low computational cost. The sparsity is achieved by adding sparsity-inducing regularization terms on the weight vector w such as λ||w||1 to the objective function, and the final weight, or the model, is sparse with high probability. Let x
i
∈ R
d
denotes one subject where d is the number of feature variables we used, which will be elaborated later. The binary class label of this subject is denoted by y
i
∈ { - 1, 1 }, where a MCI subject is denoted as -1 and a HC subject is denoted as +1. Given n samples {{ x1, y1 }, { x2, y2 }, …, { xn,y
n
}}, the loss function for the sparse logistic regression is:
If the probability of this subject belonging to the HC group is greater than 0.5, this subject will be labeled as HC. Otherwise this subject will be labeled as MCI.
Identify stable feature variables from brain connectome edges and T1w MRI for classification analysis
For both the ADNI2 and NACC cohorts, the number of subjects is limited, especially when we need subjects to have both valid T1w MRI and dMRI available. When performing classification modeling, the dimension of feature variables will be much larger than the sample size for both dMRI and T1w MRI. This would lead to the “curse of dimensionality” problem where our classification models overfit training data and deliver poor generalization power. Since not all feature variables are related to the AD progression, we perform a feature variable selection procedure that ranks all the variables according to their relevance to the classification problem and include only those feature variables in our models. There are three different feature selection models: filter models, wrapper models, and embedded models [53]. Filter models select features based on certain criteria such as Fisher score, mutual information, etc. The major disadvantage of the filter models is that they ignore the effects of the selected feature subset on the performance of the learning algorithm. Wrapper models are approaches that utilize a specific classifier to evaluate the quality of selected features. These methods test the performance of all possible subsets of the features by the classifier to determine the optimal feature subset. The computation cost of the wrapper models is quite high when the feature dimension is not small. Embedded models embed feature selection during the construction of the classifier. They have the advantage of wrapper models which include the interaction with the classification model and the advantage of filter models that are not computationally intensive. One example is sparse logistic regression. Through a theoretically sound convex programming, it eliminates features through shrinking coefficients of irrelevant features to 0. However, when noise of the level is high and sample size is extremely limited, features selected by sparse logistic regression are not stable and vulnerable to slight permutations, especially when the sample size is relatively small compared with feature dimension. Therefore, in our work, we adopt stability selection which has the advantage of the embedded models and overcomes the false selections.
Given a set of regularization parameters {λ1, λ2 … λ
n
}, for each regularization parameter λ we obtain a set of feature variables S
λ
that contribute to the final classification model in the corresponding sparse model. Stability selection is a variable selection method based on subsampling in combination with high-dimensional sparse learning algorithms. Instead of selecting one model, stability selection perturbs the data (e.g., by subsampling) many times, and we identify consistent feature variables that are included in the model, under different values of the parameter λ, across bootstrap datasets [41]. Intuitively, feature variables selected in this way are more consistently relevant to the target problem than feature variables selected only by sparse algorithms. Stability selection works as follows: we first randomly select 50% of training samples and apply sparse logistic regression to the selected training samples with regularization parameter λ
i
to build a sparse model. Let F denote the whole feature variables set and f ∈ F denote the index of a particular feature variable in the set. The set of feature variables selected by this model is denoted by:

The pipeline of computing the stability score. The warmer color indicates a higher probability of selection. (a) Calculating the selection probability using different regularization parameters. (b) illustration of using selection probability to calculate stability score.
Then we vary the regularization parameter many times and calculate selection probability under these regularization parameters. By these selection probabilities, stability score for feature variable f is calculated as follows:
With stability score, we can rank the variables and choose only top k stable variables, or a stability score that is larger than a pre-set threshold. The computation of the stability score is shown in the lower portion of Fig. 1. After selecting feature variables by stability score, the feature dimension is drastically reduced. We will use the new feature variables set to build our model.
Classification analysis from T1w MRI and DTI
We use T1w MRI-derived measurements as classification variables for our base model to differentiate HC from MCI. Since the dataset is not balanced, i.e., there are more MCI than HC for ADNI2 and more HC than MCI for NACC. We apply sampling to avoid the bias introduced by the imbalance [54]. There are two sampling strategies: oversampling and subsampling. Our previous research has shown that subsampling is more effective than oversampling on AD related analysis [55]. Therefore, we randomly subsample MCI for ADNI2 and HC for NACC to make sure the training data for the two groups are balanced, i.e., the sample size of ADNI2 and NACC are the same for all three datasets we used. In this way, the classifier will be less likely biased by the group size. We randomly split the subsampled data into 90% training samples and 10% testing samples. For example, in NACC data, we have 329 HC and 57 MCI. What we have done is to sample (round(57 * 90%) = 51) MCI samples and HC samples without replacement as the training data for each iteration, and use the rest as the testing data for this iteration. Similarly, for ADNI2, we sampled (round(50 * 90%) = 45) HC samples and MCI samples as the training for each iteration, and leave the rest as the testing data for that iteration. For the combined dataset, we just combined the training/testing data of ANDI2 and NACC as the training/testing data for the combined dataset. Stability selection is performed in the training phase to obtain a ranked list of T1w MRI variables and dMRI connectome edges. Sparse logistic regression is applied to build a classifier from the training samples using the top feature variables. The classifier is evaluated on the testing samples by computing AUC. We repeat experiments for 50 different random splits and compute the average AUC. To investigate if information of certain edges from dMRI can improve the performance of this base model, we propose to include the edges as added classification variables. In this paper, we include top {5, 10, …, 45, 50} dMRI edge feature variables as additional classification variables to see how the additional dMRI information affects classification. If the classification performance is improved, then the extra information from the edges included can benefit the classification of MCI, or vice versa if the performance decreases.
RESULTS
Stability selection
We first present the results of stability selection. The regularization parameters we use are 0.01:0.05:0.31 for all dMRI feature variables selections and 0.01:0.01:0.1 for all T1w MRI feature variable selections. We repeat sparse logistic regression 1,000 times on bootstrapped datasets to calculate stability scores. Figure 2 shows the stability scores distribution for T1w MRI of three datasets. From those figures, we see stability scores decrease drastically at the beginning, and then decrease slowly till the minimal score. This means only a small set of feature variables are very stable, and there is a considerable number of feature variables that can provide a certain amount of classification information but are not so stable. There are around 30 feature variables with stability scores larger than 0.6. Hence, the 30 feature variables can be used to train a rather stable model. Some top feature variables such as the left middle temporal thickness and volume, and the right and left entorhinal volume are also reported in many other papers [56, 57], which suggests the validity of the stability selection. In Fig. 3, we present the stability selection results as brain maps.

The performance of T1w stability scores versus the number of T1w feature variables from three datasets (ADNI2 only, NACC only, and the combined dataset).

Brain surface maps showing the spatial distribution of the T1w MRI stability scores for ADNI2 only, NACC only and the combined dataset.
Figure 4 shows the top 14 feature variables (only for visualization purpose) for each dataset. From the figure, we see, among the top 14 feature variables, 7 feature variables selected by ADNI2 and 8 feature variables selected by NACC are also selected by the combined dataset. The top two feature variables in the combined dataset, the cortical thickness of left rostral anterior cingulate, and its gray matter volume, are selected by both individual datasets. Also, these two feature variables have the highest stability score in the combined dataset. The common parts of NACC and ADNI2 are successfully captured by the combined dataset. This suggests that although ADNI2 and NACC are two independent cohorts, they can be successfully combined after adjusting for cohort, age, and sex effects. Since the combined dataset is larger than ADNI2 and NACC alone, models built on the combined dataset are less likely to overfit. But for the combined dataset, the gray matter volume of the left hemisphere frontal pole is not selected in either ADNI2 or NACC.

Top 14 feature variables selected by three datasets for T1w MRI.
Figure 5 shows the stability score distribution for dMRI variables from the three datasets. We see the trend is similar to that of T1w MRI, with few very stable feature variables most feature variables are consistently not important for MCI classification. We note that the key difference between T1w MRI and dMRI datasets is that there are a lot more feature variables (as there are 6,328 feature variables/connectome edges in the dMRI datasets). Robust models from sparse logistic regression will only include a very small number of variables, so there are some feature variables with 0 stability score. These feature variables do not contribute to any sparse models, even though we repeat each model for 1,000 iterations. When combine the two individual cohorts, the decreasing trend is less slow than those in individual cohorts, but it is easier to identify top ranked feature variables, as fewer stand out at the top. This also happens for the T1w MRI feature variables, meaning that variable selection can benefit from relating the two cohorts. We visualize the connectome edges with top stability scores in Fig. 6.

Distribution of dMRI stability scores for three datasets (ADNI2 only, NACC only, and the combined dataset).

Illustration of Connectome edges with top stability scores (only top 10 showed here).
Classification
In this section, we present the results of classification of MCI with HC, using three datasets. For all three datasets, we first select the top 30 T1w MRI feature variables as the base model. Then we add the top {5, 10, …, 45, 50} dMRI feature variables to the base model to see how dMRI feature variables affect classification results. We report the average AUC for each dataset. Figure 7 shows classification results for three datasets. When adding dMRI feature variables to T1w MRI, average AUC for NACC and ADNI2 increases around 5% to 15%. The p-value for model improvement is 0.0128 for ADNI2, 1.84e-17 for NACC and 1.28e-30 for the combined data when adding 50 dMRI feature variables. Bonferroni correction was adopted to correct for multiple comparisons (ADNI only, NACC only and Combine), thus the corrected threshold is 0.05/3 ≈0.0167. All our reported p values are less than this corrected threshold. This shows combination of dMRI feature variables and T1w MRI feature variables performs better than only using T1w MRI feature variables. White matter information captured by brain networks can complement the gray matter measurements from T1w MRI. Also for the combined dataset, average AUC increases consistently with NACC and ADNI2. This again shows that two cohorts can be fruitfully combined. We note that for the three datasets, when adding 10 dMRI feature variables to T1w MRI feature variables, the performance almost reaches the highest point with p-value 0.0167 for ADNI2, 4.72e-17 for NACC and 2.92e-25 for the combined data, which suggest those 10 dMRI feature variables are the key feature variables for classifying MCI with HC. Although there are 6,328 dMRI feature variables that provide a thorough description of brain, around 10 feature variables are sufficient to capture the major differences between MCI and HC. The top 10 feature variables for three datasets are shown in Table 3. Another observation is that the classification results of ANID2 has higher variance than those of the rest two datasets, as illustrated in Fig. 7. One possible reason causing the variance might be the sample size: ADNI2 dataset is quite small as compared with other two datasets, i.e., NACC and the combined dataset. When the sample size is small, and the data distribution is complicated, the training data and the testing data for different iterations might differ a lot, which leads to big differences of the performance fordifferent iterations. Therefore, for ADNI2, the variance is larger than other two datasets. We also use other two different methods to classifying MCI with HC. The two classifiers we use are SVM [58] and random forest [59]. The results are in Fig. 8. We see the pattern is similar to the pattern when using logistic regression except on ADNI2 dataset. Average AUC on ADNI2 dataset first increase then decrease. That may be caused by overfitting, since the sample size of ADNI2 is too small. For both three datasets, we see adding appropriate dMRI feature variables can improve classification performance, which isconsistent with the results of using logistic regression. Those results again show that white matter information together with grey matter provide more detailed descriptive power of cognitive decline than that from grey matter alone. To make the results more reliable, we also compare the results of using other threshold of number of T1w MRI feature variables. We report the results of using top 100 T1w MRI feature variables in Fig. 9. We see the results do not change much compared with the results using top 30 T1w MRI feature variables. Hence, the conclusion does not change with the threshold of number of T1w MRI feature variables.

Classification performance for MCI versus HC using sparse logistic regression. The x-axis represents the number of dMRI feature variables included in the T1w MRI feature variables.

Classification performance for HC versus MCI using SVM and random forest. The x-axis represents the number of dMRI feature variables included in the T1w MRI feature variables.

Classification performance for MCI versus HC when using 100 T1w MRI feature variables to build base model.
Top 10 dMRI feature variables identified using three datasets. Each feature variable represents the connection between ROI1 and ROI2
In the last experiment, we use features selected in NACC to classify MCI and HC on ADNI2, and use features selected in ADNI to classify MCI and HC on NACC. The results are shown in Fig. 10. In Fig. 10, we also provide the results of using NACC selected features to classify HC and MCI on NACC and using ADNI2 selected features to classify HC and MCI on ADNI2 as a comparison. We see that for both datasets, the features selected by its own are better than the features selected by different datasets. That implies two datasets are different and the features selected by ADNI2/NACC are not the optimal features for NACC/ADNI2. Therefore, it is consistent with our observation that the high-rank features for ADNI2 and NACC are different in Fig. 4.

Classification results of using features selected in NACC to classify MCI and HC on ADNI2, and using features selected in ADNI to classify MCI and HC on NACC. The lines with triangle markers are the results of using NACC selected features to classify HC and MCI on NACC and using ADNI2 selected features to classify HC and MCI on ADNI2 as a comparison.
DISCUSSION
There are prior studies of AD using variable ranking and selection techniques [42, 60–64]. However, among those studies, there are significant amount of research done only on ADNI. Few studies have cross-validated the results on independent data sets, and the clinical values of the findings are limited because that the variables identified in one group may not be replicated in others. In our classification modeling process, we first ranked feature variables of T1w MRI and dMRI using their stability scores in the NACC, ADNI2, and also in the combined datasets. Then we selected the top feature variables as the key feature variables to differentiate MCI with HC on the three datasets. We see that some stable variables in independent groups no longer stand out after we combine the data. Our study brings the attention of the existence of data bias and shows evidence suggesting how it could affect the conclusion drawn from an individual cohort. Algorithms identifying key predictor variables typically depend on having a reliable dataset with a large sample size, whereas small sample sizes tend to lead to results with low confidence. By harmonizing the cohorts and combining them into a single dataset, our models are inferred from data with more subjects, so the models are more reliable. In this work, we applied generalized linear regression to harmonize two datasets. ADNI2 and NACC have many other differences including voxel size, b, the number of b0 or non-diffusion-sensitized Images. Moreover, the number of diffusion-sensitized images and other data collecting parameters such as repetition time (TR) and echo time (TE) are varying between these two cohorts. Each of these differences may contribute with a certain degree to the differences in the results derived from two cohorts [65–67]. In order to combine these two datasets to increase the sample size, we summarized all these differences into one cohort index indicator and adopted generalized linear regression to remove all potential confound factors. However, Fig. 4 shows that among the top 14 features for ADNI2 and NACC, there are only two common features. That may suggest that even though generalized linear models are commonly used to remove cohort bias, they may be not enough to remove all the effects, especially when the effects are non-linear. Instead of seeking to remove cohort differences and use one model, one potential direction is to use domain adaptation [68–70], which assumes there are distribution differences in the cohort and learns how to align the two distributions. Domain adaptation aims to transfer models from two data sources which have different distributions. Those two domains even do not have to have homogeneous features. We plan to study these approaches in our future work.
Another challenge we addressed in this study is to do with model complexity. Model complexity is a central issue that is frequently discussed in various data analysis and machine learning models. One principle regarding model complexity is that when using a more complex model, more training samples are needed to prevent model from overfitting [71]. For commonly used linear models, the model complexity is proportional to the number of predictor variable. This has imposed a significant challenge in our analysis: when adding a large amount of dMRI derived variables to T1w MRI variables, the model complexity increases dramatically and the final classification performance is expected to drop, which may weaken the effects of added variables, or even completely override any positive effects they could add. The large amount of possible predictor variables in connectome networks and limited number of subjects even after combining cohorts can exaggerate such effects. We therefore used stability selection to rank the predictors, which has proven capability to control error rates from falsely selected variables [41, 42].
Our results show the improvement in AUC by adding dMRI is more profound in the NACC dataset, possibly because NACC participants represent more general clinical sample with vascular diseases, and where white matter integrity could play important roles in distinguishing MCI from the normals. On the other hand, ADNI is highly selective in excluding those with vascular diseases, and therefore, the additional improvement in adding dMRI is more limited in comparison with the NACC data set.
For three datasets, when adding top 10 dMRI feature variables, the performance can reach the highest AUC. MCI patients are usually suffered from the problems with memory, language, thinking and judgment in compared with normal aging [72]. The top 10 dMRI feature variables, listed in Table 3, are highly related to memory, language, thinking and judgement. In NACC, the top-ranked feature variables are concentrating on edges among the object recognition related regions (e.g., lateral occipital cortex [73], fusiform gyrus [74] and inferior temporal gyrus [75]) and language comprehension regions (e.g., middle and superior temporal gyrus [76]) while the top-ranked feature variables derived from ADNI2 cohort focus on the edges among the subcortical regions (e.g. insular, amygdala, caudate, and putamen). [77] reported the putamen abnormality in MCI patients while the atrophy of insular [78], amygdala [79], and caudate [80] has been found in MCI and AD patients. Obviously, the top-ranked feature variables derived from two cohorts are not identical, which is reasonable. As we demonstrated in Table 2, there are many differences between NACC cohort and ADNI2 cohort. Since we are using the data-driven approach, each of these differences in the cohort may have influences on the downstream feature identification, in other words, the identified disease-associated feature variables, MCI-associated feature variables in this study, from single cohort is also affected by the cohort-specific biases. One observation is in ADNI2 only, all top feature variables are in left hemisphere. To reduce these cohort-specific biases, we combined two cohorts and adopted generalized linear regression model to remove all potential confounds. By this combination, the top ranked feature variables cover the subcortical regions (e.g., insular, caudate, and amygdala), which have been found when using ADNI2 only, and superior and inferior temporal gyrus, which have been found when using NACC only, as well as parahippocampal gyrus and supramarginal gyrus. Atrophy in the parahippocampal gyrus has been reported as an early biomarker of MCI and AD [81], while in [82], the authors found the increased level of connectivity in correspondence of supramarginal gyrus. We believe the feature variables from the combined dataset is less affected by the noise and deserve further analyses in future studies.
The stability of selected features and models is a critical issue in data-driven approaches. In order to classify new samples and benefit clinical practice, we need to deal with the variances among different data sources and different repeats.
I. Stability among different data sources
Since ADNI2 model and NACC model are quite different, and the derived features are different, directly applying any model on the new sample may not be appropriate. One possible solution is to firstly quantify the likelihood of the given new sample being in either one of the datasets, i.e., comparing P (DADNI2|x) and P (DNACC|x), where x is the new sample and DADNI2, DNACC denote the events that this sample is from ADNI2, NACC, respectively. P (DADNI2/NACC|x) can be calculated by Bayes’ rule as follows
II. Stability among different repeats
Interpretability: A set of consistent feature variables are the key to insights of underlying disease pathology. To deal with the possible inconsistent of the selected features and the models built in each iteration for one data set, one common way is to apply stability selection on the whole ADNI2/NACC data, and use the features selected by the whole ADNI2/NACC data as the feature set and build classification model on it. Then, we can use this model to classify the new patient. Predictive power: Because of training/testing strategy used in evaluating machine learning algorithms and selecting hyper-parameters, it is quite common that the models in each repeat have different predictive power towards different samples (each model see different training samples during the training). There are two ways we can eliminate such variances. The first is to train the model using the entire data after we identify the hyper-parameters, and then it will give us just one model for prediction with maximized predictive power, given all the data we have. The second approach to use the fact that, the models with high variances have little estimation bias (see Bias-Variance decomposition in [40]), we can perform the ensemble of these models to aggregate their advantages and produce one robust predictive model. In summary, given a new sample, we first apply the method in II to select features and build models for ADNI2 and NACC separately with the whole available data of the two datasets or ensemble different models. Then, we can compute the likelihood of the given new sample being in either one of the datasets and apply the labeling rule in I to label this new sample.
In our study, we only used a simple way to combine the two imaging modalities, i.e., concatenating the predictors from the two modalities. We expect that more sophisticated algorithms can be designed to fuse the two modalities by exploring structures within the data and may deliver better classification performance. Our study is cross sectional. Some studies show that progression or changes in biomarkers are more closely related to the progression of clinical outcomes [83, 84] than those measured cross-sectionally. Further work is warranted to incorporate changes in biomarkers as well as to develop more sophisticated algorithms for synthesizing two modalities.
Conclusion
In this paper, we studied how dMRI-derived measurements help to differentiate HC subjects from those with MCI. By adding 50 dMRI feature variables to 30 T1w MRI feature variables, the AUC for classification HC with MCI is improved significantly, which indicate white matter information captured by brain networks complemented the gray matter measurements from T1w MRI in differentiating HC from individuals with MCI. Since sample size is limited, we selected feature variables by stability selection which can control the error rates of false discoveries. We performed experiments on two independent datasets, ADNI2 and NACC. Also, to cross-validate the results, we harmonized those two datasets by generalized linear squares approach and did the same experiments on the combined datasets. The classification performance on the combined dataset is similar to that of the two independent datasets, and the top feature variables selected using three datasets are overlapped. The framework we used can also be used to classify between AD and MCI or between HC and AD. We plan to conduct this in our future work.
Footnotes
ACKNOWLEDGMENTS
This study is funded in part by US Office of Naval Research (N00014-14-1-0631 and N00014-17-1-2265 to JZ), National Science Foundation (IIS-1749940, IIS-1565596 and IIS-1615597 to JZ), National Institute of Biomedical Imaging and Bioengineering (U54 EB020403 to PMT), National Institute on Aging (AG11378 and AG041851 to CRJ, P30AG053760 and P30AG008017 to HD, AG056782 to LZ). We also thank NACC staff for help with the NACC data.
The NACC database is funded by NIA/NIH Grant U01 AG016976. NACC data are contributed by the NIA funded ADCs: P30 AG019610 (PI Eric Reiman, MD), P30 AG013846 (PI Neil Kowall, MD), P50 AG008702 (PI Scott Small, MD), P50 AG025688 (PI Allan Levey, MD, PhD), P50 AG047266 (PI Todd Golde, MD, PhD), P30 AG010133 (PI Andrew Saykin, PsyD), P50 AG005146 (PI Marilyn Albert, PhD), P50 AG005134 (PI Bradley Hyman, MD, PhD), P50 AG016574 (PI Ronald Petersen, MD, PhD), P50 AG005138 (PI Mary Sano, PhD), P30 AG008051 (PI Steven Ferris, PhD), P30 AG013854 (PI M. Marsel Mesulam, MD), P30 AG008017 (PI Jeffrey Kaye, MD), P30 AG010161 (PI David Bennett, MD), P50 AG047366 (PI Victor Henderson, MD, MS), P30 AG010129 (PI Charles DeCarli, MD), P50 AG016573 (PI Frank LaFerla, PhD), P50 AG016570 (PI Marie-Francoise Chesselet, MD, PhD), P50 AG005131 (PI Douglas Galasko, MD), P50 AG023501 (PI Bruce Miller, MD), P30 AG035982 (PI Russell Swerdlow, MD), P30 AG028383 (PI Linda Van Eldik, PhD), P30 AG010124 (PI John Trojanowski, MD, PhD), P50 AG005133 (PI Oscar Lopez, MD), P50 AG005142 (PI Helena Chui, MD), P30 AG012300 (PI Roger Rosenberg, MD), P50 AG005136 (PI Thomas Montine, MD, PhD), P50 AG033514 (PI Sanjay Asthana, MD, FRCP), P50 AG005681 (PI John Morris, MD), and P50 AG047270 (PI Stephen Strittmatter, MD, PhD).
The ADNI dataset collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (
). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
