Abstract
Background
Accurately differentiating stable mild cognitive impairment (sMCI) from progressive MCI (pMCI) is clinically relevant, and identification of pMCI is crucial for timely treatment before it evolves into Alzheimer's disease (AD).
Objective
To construct a convolutional neural network (CNN) model to differentiate pMCI from sMCI integrating features from structural magnetic resonance imaging (sMRI) and positron emission tomography (PET) images.
Methods
We proposed a multi-modal and multi-stage region of interest (ROI)-based fusion network (m2ROI-FN) CNN model to differentiate pMCI from sMCI, adopting a multi-stage fusion strategy to integrate deep semantic features and multiple morphological metrics derived from ROIs of sMRI and PET images. Specifically, ten AD-related ROIs of each modality images were selected as patches inputting into 3D hierarchical CNNs. The deep semantic features extracted by the CNNs were fused through the multi-modal integration module and further combined with the multiple morphological metrics extracted by FreeSurfer. Finally, the multilayer perceptron classifier was utilized for subject-level MCI recognition.
Results
The proposed model achieved accuracy of 77.4% to differentiate pMCI from sMCI with 5-fold cross validation on the entire ADNI database. Further, ADNI-1&2 were formed into an independent sample for model training and validation, and ADNI-3&GO were formed into another independent sample for multi-center testing. The model achieved 73.2% accuracy in distinguishing pMCI and sMCI on ADNI-1&2 and 75% accuracy on ADNI-3&GO.
Conclusions
An effective m2ROI-FN model to distinguish pMCI from sMCI was proposed, which was capable of capturing distinctive features in ROIs of sMRI and PET images. The experimental results demonstrated that the model has the potential to differentiate pMCI from sMCI.
Keywords
Introduction
Mild cognitive impairment (MCI) is an intermediate stage between normal aging and Alzheimer's disease (AD), and the prevalence of MCI in adults older than 60 is approximately 6.7% to 25.2%, 1 displaying a high risk of progression to clinically probable AD without early intervention. 2 MCI can be further divided into stable MCI (sMCI) and progressive MCI (pMCI). MCI that transitions to AD during the follow-up period (typically 36 months) is classified as pMCI, while MCI that does not transition to AD during the follow-up period is classified as sMCI.3,4 Differentiating sMCI from pMCI is of great significance for AD early diagnosis and treatment. At present, the conventional imaging-based MCI recognition methods mainly included radiomics-based models and convolutional neural network (CNN)-based models.
The radiomics-based models generally adopted morphological metrics and texture features from the selected dementia-related regions of interest (ROIs) for MCI recognition, which are subsequently modelled by machine learning classifiers. Several studies demonstrated that morphological features of specific anatomical regions may have the potential to identify MCI and AD. Frisoni et al. 5 found atrophy of medial temporal structures was a feasible diagnostic biomarker for MCI recognition and the ratio of hippocampus to whole brain could quantify the progression of disease. Gupta et al. 6 found the morphological characteristics of hippocampus, amygdala, and entorhinal cortex could sever as effective biomarkers to recognize MCI and AD. Abrol et al. 7 concluded that the hippocampus and amygdala subcortical regions in the medial temporal lobe were the most informative regions for early-stage MCI recognition. Schmitter et al. 8 extracted gray matter volume (GMV) from 9 ROIs for MCI recognition and achieved the accuracy of 71% for MCI versus normal controls (NC). Ma et al. 9 integrated GMV, Jacobian determinant, cortical thickness, sulcus depth, gyrification index, and fractal dimension to discriminate MCI from NC with random forest and obtained an accuracy of 80% for MCI patients. Long et al. 10 combined GMV, white matter volume and cerebrospinal fluid volume to identity MCI from NC through support vector machine and got the best accuracy of 92% for MCI identification.
Although these radiomics-based models had achieved encouraging results, they usually adopted some predefined features for MCI identification, while these selected features may be varied across different populations. In contrast, CNN-based models could extract deep semantic features from images by an end-to-end manner, which has been demonstrated to be a powerful tool for MCI recognition. Payan et al. 11 constructed a 3D convolutional neural network to recognize MCI and AD, and achieved an accuracy of 92.11% for MCI versus NC. Huang et al. 12 constructed a multi-modal 3D convolutional neural network to identify MCI, integrating two types of modality data, structural magnetic resonance imaging (sMRI) and 18-Fluoro-DeoxyGlucose positron emission tomography (FDG-PET). They obtained 326 cases of pMCI and 441 cases of sMCI from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, achieving an identification accuracy of 72.2% between sMCI and pMCI. Liu et al. 13 proposed a multi-modality cascaded CNNs to differentiate sMCI from pMCI, 76 pMCI and 128 sMCI subjects with PET and sMRI were selected and achieved an accuracy of 82.9% for pMCI versus NC. However, the existed CNN-based models for MCI diagnosis had some limitations. Firstly, some models adopted whole brain images as input for model training, which would introduce potential bias from unrelated brain regions. Secondly, some models just utilized single modality MRI images, neglecting the informative features from multi-modality. Thirdly, some models used a simple multilayer perceptron (MLP) layer to fuse deep semantic features from multi-modality MRI images, which would lose the complementary information between multi-modality. Finally, most CNN-based models did not involve the contributions of morphological metrics, which may lose the most informative features for MCI recognition.
In this paper, we proposed a multi-modal and multi-stage ROI-based fusion network (m2ROI-FN) CNN model to differentiate pMCI from sMCI. Among them, the ROI-based hierarchical CNN (ROIH-CNN) model were applied on two modality neuroimaging of sMRI and PET, the extracted deep semantic features and multiple morphological metrics derived from two modality images were integrated together for final MCI recognition. In particular, a series of cascaded 3D CNNs were applied to extract low-level deep semantic features, and the specially designed multi-modality integration module were utilized to refine the low-level features from structural MRI and PET into high-level deep semantic features. Finally, ROI-based MLP classifier was built to fuse the high-level features into global features, and a softmax layer was adopted to generate a subject-level recognition outcome.
Methods
Data collection
Subjects adopted in this paper were selected from Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu). ADNI dataset is a longitudinal multi-center study aimed to explore clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of AD. 14 The MCI subjects were furtherly split into two groups, including pMCI and sMCI. pMCI was determined according to the criteria whether the MCI patient would convert to AD within 36 months after baseline time, and sMCI was MCI patient who did not progress to AD during 36 months following baseline visit. In this study, we collected all patients with both T1-weighted structural MRI (volumetric 3D Magnetization Prepared-Rapid Gradient Echo imaging) and FDG-PET images. More details about structural MRI and FGD-PET can be found at ADNI official website. Totally, 1267 subjects were exploited, including 1052 subjects in ADNI-1&2 (sMCI 229, pMCI 153, NC 427, AD 243) and 215 subjects in ADNI-3&GO (sMCI 117, pMCI 27, NC 8, AD 63), the demographic detail of the subjects were shown in Table 1.
Demographic information of the sample used (ADNI-1&2, ADNI-3&GO).
Subject age is reported as mean ± standard deviation. The gender is reported as Male/Female. MMSE: Mini-Mental State Examination.
Data processing
T1-weighted MRI images and FGD-PET images were preprocessed with FreeSurfer software. 15 For the T1-weighted MRI images, ‘recall-all’ processing stream with default parameters was used for skull removal, registering images into MNI standard brain template (MNI305, voxel size: 1*1*1), cortical and subcortical surfaces reconstruction, as well as volumetric reconstruction. For the FDG-PET images, ‘Bet’ command was performed for skull removal, ‘mri_coreg’ command was used for linear registration the images into corresponding T1-weighted MRI images (Degrees of Freedom, -dof12). For the dynamic FGD-PET images (PET data has multiple frames) were averaged all the time frames together by ‘mri_concat’ command. After images preprocessing, Z-score intensity normalization was employed for all voxel.
Morphological metrics
In this study, based on T1-weighted MRI images, five morphological metrics from three cerebral cortex regions and two metrics from two subcortex regions were extracted by FreeSurfer software (version 7.1). Specifically, thickness average (TA), surface area (SA), Gaussian curvature (GC), GMV, and local gyrification index (LGI) were computed from entorhinal cortex, insula and middle temporal gyrus, volume in cubic millimeters (Vmm3), and Intensity normMean (IM) were computed from hippocampus and amygdala. Entorhinal cortex, insula and middle temporal gyrus were segmented according to subject-level Automatic Parcellation of Cortical (Aparc) atlas,
16
hippocampus and amygdala were segmented according to subject-level Automatic Subcortical Segmentation (Aseg) atlas. In order to eliminate the influence of intracranial volume (ICV) on Vmm3, we adjusted the Vmm3 metric according to ICV to accommodate individual differences, the adjusted Vmm3 metric can be formulized as:
Anatomical region, morphological characteristics and Patch size corresponding to different ROI.
TA: thickness average; SA: surface area; GC: Gaussian curvature; GMV: gray matter volume; LGI: local gyrification index; Vmm3: volume in cubic millimeters; IM: intensity mean.
m2ROI-FN model
Our m2ROI-FN model was illustrated in Figure 1(A), which is composed of four components, namely ROI-based CNNs for PET images, ROI-based CNNs for MRI images, multi-modality integration (MMI) module, and ROI-based MLP classifier. The ROI-based CNNs are used to generate ROI-based low-level semantic features. The ROI-based CNN consists of base model and ROI Convolutional (ROI-Conv) module. The base model consists of four sequential convolutional blocks, including 3D Convolutional, BatchNormal, Relu and MaxPooling layers, as shown in blue box in Figure 1(B). The base model structures of different ROIs are completely identical, but the parameters are not shared. The ROI-Conv module receives outputs from the base model to generate ROI-based low-level semantic features, as shown in green box in Figure 1(B). The module structures corresponding to different ROIs are shown in Table 3.

Our multi-modal and multi-stage region of interest (ROI) based fusion network (m2ROI-FN). Part A denotes the framework of our model, and part B denotes the 3D CNNs, and part C denotes the multi-modality integration (MMI) module. TA: thickness average; SA: surface area; GC: Gaussian curvature; GMV: gray matter volume; LGI: local gyrification index; Vmm3: volume in cubic millimeters; IM: intensity mean.
ROI-Conv module structures.
The Conv3D is utilized to extract local features, and the 3D convolutional operation is expressed as:
The specially designed MMI module shown by the part C in Figure 1 was adopted to incorporate the semantic features from the multi-modality images of the same ROI, generating local-stage multi-modality features. The multi-modality integration module consists of sequential fully connected layers with the number of neurons [128, 64], in which 128 stands for 64 MRI features plus 64 PET features. By the usage of integration module, the distinct information between different modality features would be preserved.
Finally, a ROI-based MLP classifier was utilized for MCI recognition, which consists of fully connected layers and softmax layer. The fully connected layers are used to concatenate the local-stage multi-modality features with multiple morphological metrics to generate global-stage deep semantic features, and the softmax layer is adopted for MCI and AD recognition. This multi-stage fusion architecture is designed based on the ‘late-fusion’ method. Comparing with ‘early-fusion’ methods, which integrates shallow features such as edges or textures, the proposed ROI-based MLP utilizes the ‘late-fusion’ pattern to capture global task-related features from local-stage to global-stage. The fully connected layers have the number of neurons [678, 100, 2], where 678 is 64 local features multiply 10 ROIs plus 30 cortex (TA, SA, GC, GMV, LGI from left and right entorhinal cortex, insula and middle temporal gyrus) and 8 subcortex (Vmm3, IM from left and right hippocampus and amygdala) morphological metrics. In this manner, the distinct information from different ROIs could be aggregated to precisely classify MCI and AD patients.
Model performance comparison
In this paper, five experiments were designed to evaluate performance of our m2ROI-FN model on the sMCI versus pMCI tasks. The main evaluation metrics included accuracy (ACC), sensitivity (SEN), specificity (SPE), and area under the receiver operating characteristic curve (AUC). The ablation experiments were shown in Table 4.
Ablation experiments of M2ROI-FN model.
Hipp: hippocampus; Amy: amygdala; Ins: insula; MTG: middle temporal gyrus; EC: entorhinal cortex; LH: left brain hemisphere; RH: right brain hemisphere. Masked denotes patch masked operation, and Mor denotes the morphological metrics.
In order to evaluate whether the m2ROI-FN model could effectively fuse the deep semantic features from different modality and different ROIs, we firstly compared the performance with the concatenating-based CNN (Cat CNN) model utilizing a simply fully connected layer to combine the learned features from multi-modality images (https://github.com/czp19940707/Cat-CNN). Notably, the whole brain images were applied on Cat CNN model training. The Cat CNN could be regarded as simplified version of B2, 12 whose backbone structure consisted of 5 sequential convolutional blocks that could be also seen in. 17 The identical hyperparameter settings, e.g., learning rate, optimizer, loss function and epochs, and identical subjects were adopted for both our proposed model and the Cat CNN.
The second experiment was designed to verify the influences of different ROI combination. Firstly, m2ROI-FN models were trained with different single ROI regions as inputs, including bilateral hippocampus (V1), bilateral entorhinal cortices (V2), bilateral amygdalae (V3), bilateral middle temporal gyri (V4), bilateral insulae (V5). Secondly, m2ROI-FN models were trained with different ROI combinations as inputs, including 4-ROIs (bilateral hippocampal and amygdala–V6), 6-ROIs (bilateral hippocampal, amygdala, and insula–V7), 8-ROIs (bilateral hippocampal, amygdala, insula, and middle temporal gyrus–V8). Finally, the ROIs in the left and right brain hemisphere were respectively utilized as inputs to train m2ROI-FN model (right brain hemisphere–V9, left brain hemisphere–V10).
The third experiment was designed to compare the model performances between multi-modal fusion and single modality, and the m2ROI-FN model were respectively trained on single sMRI (V11), single PET (V12) and MRI + PET (V13) in identical subjects.
The fourth experiment was designed to verify the recognition contribution of morphological metrics, m2ROI-FN model was applied on different morphological metrics as input, and two m2ROI-FN models were created. Among them, (V13): m2ROI-FN model with no morphological metrics (the V9 in experiment 3 and 4 are completely identical)), and (V15): m2ROI-FN model with all morphological metrics.
The fifth experiment was designed to ascertain the influence of adjacent voxels in the patch. A Mask processing on ROI patches was conducted by setting the voxel intensity values within the patch area but out of the ROI to 0 based on Aparc + Aseg atlas, and the intensity values inside the ROI area remained unchanged. (V14) is the m2ROI-FN model trained with masked patch, (V15) is the m2ROI-FN model trained with no-masked patch (the V15 in experiment 4 and 5 are completely identical).
Finally, the m2ROI-FN model was compared with some relevant popular models, which included Liu's combination of ICA and Cox model (ICA-Cox), 18 Suk's hierarchical feature representation and multi-modal fusion model (HFR-MF), 19 Huang's multi-modality 3D convolutional model (B2), 12 Zhu's relational regularization feature selection model (RRF), 20 Hao's multi-modal neuroimaging feature selection with consistent metric constraint model (MMCC). 21
Experiment
For the ADNI dataset, two experiments were designed to distinguish sMCI and pMCI respectively using the constructed model. The ADNI dataset consists of the relatively independent ADNI-1, ADNI-2, ADNI-3, and ADNI-GO datasets. In the first experiment, a complete ADNI dataset was used for model training, and 20% samples were randomly selected for model validation. In the second experiment, ADNI-1 and ADNI-2 were formed into an independent sample for model training, and ADNI-3 and ADNI-GO were formed into another independent sample for multi-center testing, in which 20% of the ADNI-1&2 dataset randomly selected for validation. The overall predictions were adopted to aggregate the cross-validation predictions for the entire dataset (ADNI and ADNI-1&2 columns in Table 5). The experiments ensured that each prediction was made for a sample that was not used in training, providing a reliable estimate of the model's generalization performance. Similarly, the aggregated ROC curve was plotted using the overall predictions from the cross-validation process (see Figures 2 and 3).

Aggregated ROC curves of m2ROI-FN model with different inputs. V1-15 have completely identical definitions to Table 4.

Without morphological metrics and patch masked operation, aggregated ROC curves of m2ROI-FN model and Cat CNN model.
Classification performance of m2ROI-FN for ablation experiments. ADNI column represents m2ROI-FN model trained on complete ADNI with 5-fold cross validation, ADNI-1&2 column represents model trained on ADNI-1&2 with 5-fold cross validation, ADNI-3&GO column represents model trained on ADNI-1&2 and tested on ADNI-3&GO. The values present in ADNI and ADNI-1&2 columns are overall predictions.
The optimal result is represented in bold.
In order to tackle the sample imbalance problem, we adopted an oversample strategy to give more weight to the group with small sample size (e.g., pMCI). The oversample strategy was implemented with Pytorch Weighted Random Sampler toolkit. Our model was deployed using Python (version 3.8) with Pytorch (version 1.9.0) package, and the computing unit contained a single GPU (NVIDIA RTX 3090 24GB). ALL the trainable parameters were randomly initialized in standard normal distribution and optimized using standard back-propagation with stochastic gradient descent by minimizing the loss of cross entropy. The loss can be formulized as:

Loss curves of M2ROI-FN. The blue curve represents the training loss, the yellow curve represents the validation loss, the red curves represent the loss curves smoothed with a window size of 8, and the gray transparent area indicates the standard deviation range of the 5-fold cross-validation loss.
Where
Results
The classification performance of m2ROI-FN model under different ROI inputs has been calculated, and the outcomes are shown in Table 4. The aggregated ROC curves are shown in Figure 4. For the same ROI in both hemispheres of the brain, the m2ROI-FN model with amygdala as ROI input acquired the best ACC and AUC (5-fold cross validation achieved ACC 72.8% and AUC 72.7% on complete ADNI dataset, ACC 72.1% and AUC 73.6% on ADNI-1&2, multicenter testing achieved ACC 72.7% and AUC 65.2% on ADNI-3&GO, see V2 in Table 5). Compared with 4-ROIs, 6-ROIs, m2ROI-FN with 8-ROIs inputs achieved best AUC (5-fold cross validation achieved AUC 75% on complete ADNI dataset, AUC 74.3% on ADNI-1&2, multicenter testing achieved ACC 71.7% and AUC 67.2% on ADNI-3&GO, see V8 in Table 5).
Compared with right brain hemisphere ROIs, left brain hemisphere ROIs achieved better performance to differentiate sMCI from pMCI (5-fold cross validation achieved ACC 75.5%, SEN 89.9%, SPE 51.1% and AUC 74.5% on complete ADNI dataset, ACC 71.2%, SEN 86%, SPE 51.7% and AUC 72.6% on ADNI-1&2, multicenter testing achieved ACC 75.3%, SPE 86.9% and AUC 66% on ADNI-3&GO, see V10 in Table 5). In the single-modality recognition task, m2ROI-FN model achieved comparable accuracy by utilizing sMRI or PET for MCI recognition, with the latter achieving better performance (5-fold cross validation achieved better ACC 74.1%, SPE 58%, and AUC 73.7% on complete ADNI dataset, better SEN 83.5% and AUC 72.3% on ADNI-1&2 dataset, multicenter testing achieved better ACC 72.4%, SPE 54.4% and AUC 71.6% on ADNI-3&GO, see V12 in Table 5), and the multi-modal results achieved better MCI recognition performance than single-modality results (5-fold cross validation achieved better ACC 72.4% and AUC 73.1% on ADNI-1&2 dataset, multicenter testing achieved better SEN 80.6% on ADNI-3&GO, see V13 in Table 5). For the influences of patch masked operation, our model with patch-Masked achieved best ACC 77.4%, SPE 62.5%, AUC 80.14% on complete ADNI dataset, best ACC 73.2%, AUC 73.9% on ADNI-1&2 and best AUC 72.6% on ADNI-3&GO (see V14 in Table 5).
Table 6 denotes the performance comparison of our m2ROI-FN model with Cat CNN model, and the corresponding aggregated ROC curves are shown in Figure 3. For a fair comparison, both the m2ROI-FN model and the Cat-CNN model do not consider morphological metrics and patch masked operation, our model obtained more stable MCI recognition capabilities (5-fold cross validation achieved better SPE 55.1% and AUC 73.6% on complete ADNI dataset, better ACC 71.4%, SEN 88.2%, SPE 64.2% and AUC 68.9% on ADNI-1&2, multicenter testing achieved better ACC 71.2%, SEN 80.6% and AUC 65.6% on ADNI-3&GO).
Classification performance between our m2ROI-FN model and cat CNN model. The definitions of ADNI, ADNI-1&2, and ADNI-3&GO column are completely identical to those in Table 5. The values present in ADNI and ADNI-1&2 columns are overall predictions.
The optimal result is represented in bold.
Though it is challenging to make a fair comparison between the proposed model with the reported models due to the different ways of using the ADNI dataset (i.e., the definition of sMCI/pMCI, different enrolled subjects, different splits of training/validation/testing sets). We roughly compared the sMCI versus pMCI recognition performance of the M2ROI-FN model with a series of models adopted sMRI and PET data. At this time, all models used 5-fold cross-validation on the complete ADNI dataset, as shown in Table 7. Although Suk et al.'s 19 HFR-MF and Zhu et al.'s 20 RRF achieved better specificity for MCI recognition (RRF: SPE 94.6%, HFR-MF: SPE 95.2%), our m2ROI-FN achieved the best sensitivity (ACC 82.9%, SEN 80.3%, AUC 87.6%) with more sufficient sample size.
Classification performance among our m2ROI-FN and several reported models. The values present in ADNI and ADNI-2 columns are overall predictions.
The optimal result is represented in bold. The numbers within the brackets in the “Subjects” column represent the count of subjects with sMCI and pMCI.
Discussion
In this paper, an effective multi-modal and multi-stage ROI based fusion network (m2ROI-FN) CNN model to differentiate pMCI from sMCI was proposed by integrating deep semantic features and multiple morphological metrics from two modality neuroimaging of sMRI and PET. Our proposed fusion network adopted five dementia-related ROIs as the patch for model training including entorhinal cortex, amygdala, hippocampus, middle temporal gyrus, and insula. Different from whole-brain modeling, the ROI-based CNN modeling could effectively eliminate irrelevant brain regions and reduce redundancy information and computational effort. The specially designed MMI module could effectively incorporate the low-level features from each modality, and generate the high-level local features, while retaining the complementary information of the two modalities. Morphological features and deep semantic features are complementary in the recognition of pMCI and sMCI. The integration of the two types of features would further improve the classification performance. The multiple morphological features derived from sMRI images, mainly included cortical volume, voxel intensity mean, surface area, cortical thickness average and standard deviation. Finally, a ROI-based MLP classifier was created to fuse the high-level local features from different ROIs into global features for MCI recognition.
Our experiments indicated that adopting AD-related ROIs could improve the accuracy of MCI recognition. We compared the proposed m2ROI-FN CNN with the Cat CNN, which adopted the whole brain images for model training. Our m2ROI-FN achieved better recognition results than Cat CNN, as shown in Table 6. Choosing ROI over whole brain is similar to performing preliminary feature selection, reducing computational complexity and redundant features in unrelated brain regions, resulting in more accurate recognition results.
The feature fusion of the two modal images is helpful to improve the accuracy of MCI classification. We computed the recognition performance of our m2ROI-FN on sMRI, PET and sMRI + PET respectively. It was obvious that combining sMRI and PET has better specificity in distinguishing sMCI from pMCI, and better performance in distinguishing AD from NC, as shown in Table 5 (see V11, V12, and V13). To some extent, this indicates that the MMI module in our model could capture the respective features of sMRI and PET images well. Moreover, the classification performance of PET image is better than that of sMRI image on complete ADNI dataset and multicenter testing, which is consistent with previous research results (Tong et al. 24 , Liu et al. 13 , Huang et al. 12 ). The main reason for this may be that PET can reveal changes in brain function that predate morphological changes.
The integration of deep semantic features and morphological features is helpful to improve the accuracy of MCI classification. We employed the deep semantic features and the combination of deep semantic features and morphological features as inputs to calculate the recognition performance of the m2ROI-FN model, respectively. Among them, m2ROI-FN obtained relatively reliable classification results when combined with morphological feature input, as shown in Table 5 (V13 and V15). Morphological measures can improve the accuracy of MCI and AD recognition. The main reason is that MCI and AD both yield different degrees of morphological feature alterations, and morphological measures extracted based on radiomic analysis provide contributions to the recognition of MCI and AD that are different from deep semantic features.
Applying a mask operation to ROI Patches has a significantly effect on MCI classification. By masking the ROI based on brain atlas, redundant information can be further filtered out. After Masking ROI patches, MCI recognition performance is improved, as shown in Table 5 (V13 and V14). The main reasons may be twofold: First, Masking ROI patches enables the model to focus on ROI and exclude non-ROI regions that are irrelevant or disruptive to the diagnosis, thereby enhancing features that are highly correlated with MCI and AD diagnosis. Secondly, setting the voxel value of the non-ROI region to zero can reduce the computational complexity of the model to a certain extent and help to further improve the performance of the model.
Neurodegenerative diseases are usually caused by structural and functional changes in multiple regions of the brain. Capturing and integrating the changing features of multiple brain regions can improve the recognition accuracy of MCI. As shown in Table 5, when the m2ROI-FN model applied each single ROI as input (2-ROI), the recognition performance of the model was limited, among which the recognition performance of hippocampus and amygdala was superior to other ROIs. As the number of input ROIs increases, as shown in Table 5 (V6 to V8 and V13), the recognition performance of the model gradually improves. When all ROIs are selected as model inputs (V13, all-ROI), m2ROI-FN has the best specificity in classification of pMCI versus sMCI and the best performance in classification of AD versus NC. The experimental results show that the feature information of different ROI is complementary, and the integration of these features can improve the recognition performance. Furthermore, although the left and right cerebral hemispheres have similar anatomical structures, the m2ROI-FN model applying the ROI of the left cerebral hemisphere has a higher MCI classification performance than the ROI of the right cerebral hemisphere, as shown in Table 5 (V9 and V10), indicating asymmetric vulnerability of brain morphology and function in patients with MCI.
At present, how to accurately differentiate pMCI from sMCI remains a challenge. The existing main models for MCI recognition included ICA-COX (Liu et al. 18 ), HFR-MF (Suk et al. 19 ), B2 (Huang et al. 12 ), RRF (Zhu et al. 20 ), MMC-CNN (Liu et al. 13 ). Compared with these models, our m2ROI-FN model could achieve relatively good recognition performance in differentiating pMCI from sMCI. To be specific, The B2 model incorporated deep semantic features of multi-modal whole brain images through a simple MLP layer, while our m2ROI-FN model could extract more specific ROI-level features, and integrate local-to-global deep semantic features and morphological features. In MMC-CNN model, the whole brain was evenly divided into 3*3*3 partitions as patches, and a small convolutional layer was designed to fuse multi-modal information. However, our m2ROI-FN model adopted AD associated ROI patches and incorporates morphological features. In addition, our model was designed with a fully connected layer to integrate multi-modal information instead of a small convolutional layer, which has relatively better representation ability than a small convolutional layer. Compared to the HFR-MF and RRF models, although the m2ROI-FN model was relatively less specific in distinguishing between pMCI from sMCI, the HFR-MF and RRF models employed relatively few training samples. In contrast, the m2ROI-FN model utilized more training samples, which made it more robust and reliable.
Although our m2ROI-FN model relatively improves pMCI versus sMCI classification performance, it still has some limitations that need to be improved. Firstly, the ROI mask operation filtered out redundant information and improved the recognition accuracy of the M2ROI-FN. However, FreeSurfer may have a certain bias when segmenting the ROI mask in the central brain area, and inaccurate masks may lead to the loss of valuable features and the introduction of noise. Secondly, we only validated and tested the performance of our model using ADNI, and cross-dataset validation will be used in the future to further verify the robustness and generalization of our model. Thirdly, although multi-modal approaches provide more information, PET is a rare and expensive data modality, which limits the scalability of the method to some extent. Finally, due to data drift between different phases of ADNI, the performance of the M2ROI-FN model declined during testing. Expanding the data and applying data augmentation will be used to further mitigate the issue of data drift.
Conclusion
In this paper, we proposed a multi-modal and multi-stage ROI-based (m2ROI-FN) fusion network CNN model to differentiate MCI. Our model achieved accuracy of 77.4% to differentiate pMCI from sMCI with 5-fold cross validation on complete ADNI dataset, 73.2% on ADNI-1&2 and 75% by multicenter testing on ADNI-3&GO. The m2ROI-FN CNN model is capable of capturing distinctive features in ROIs of sMRI and PET images. Different ROI have different contribution to distinguish between pMCI and sMCI, among which hippocampus and amygdala have higher contribution weight in distinguishing between pMCI and sMCI. Mask operations and morphological metrics can effectively improve model recognition performance. By comparing the m2ROI-FN model with different parameter settings, the proposed model shows potential in the challenging task to differentiate pMCI from sMCI.
Supplemental Material
sj-docx-1-alz-10.1177_13872877241295287 - Supplemental material for A multi-modal and multi-stage region of interest-based fusion network convolutional neural network model to differentiate progressive mild cognitive impairment from stable mild cognitive impairment
Supplemental material, sj-docx-1-alz-10.1177_13872877241295287 for A multi-modal and multi-stage region of interest-based fusion network convolutional neural network model to differentiate progressive mild cognitive impairment from stable mild cognitive impairment by Zhenpeng Chen, Beier Qi, Bin Jing, Ruijuan Dong, Rong Chen, Pujie Feng, Yilu Shou and Haiyun Li in Journal of Alzheimer's Disease
Footnotes
Acknowledgements
The authors have no acknowledgments to report.
Author contributions
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by Beijing Natural Science Foundation no. L192044; The open fund project of Beijing Key Laboratory of Fundamental Research on Biomechanics in 364 Clinical Application (2023KF05); and sponsored by Beijing Nova Program (20220484211).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
The data used in this study were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (
). The ADNI data are publicly available to researchers upon application and approval by the ADNI Data Sharing and Publications Committee. Detailed instructions for accessing the ADNI data can be found on the ADNI website.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
