Deep learning based screening framework for early Alzheimer's disease detection in MRI

Abstract

Women are more severely impacted by Alzheimer's disease (AD), a progressive neurodegenerative illness, particularly in the age range of 45 to 60 years, when early signs are often incorrectly diagnosed. To improve the quality of life and slow the disease progression, it is crucial to detect the condition at this early stage. This research proposes a hybrid screening framework, integrating symptom level awareness through a structured survey along with deep learning based Magnetic Resonance Imaging (MRI) assessment. A curated dataset of 8511 preprocessed axial brain MRI image slices sourced from Kaggle's OASIS Database was used, categorized into Cognitively Normal, Moderate Impairment & Mild Cognitive Impairment classes. By utilizing feature extraction from three pretrained convolutional neural networks (AlexNet, ResNet50, and DenseNet121), performance was further improved using ensemble and stacking learning methods. Among the evaluated models, AlexNet, DenseNet121, and ResNet50 achieved accuracies of 0.94, 0.87, and 0.69 respectively, while the ensemble approach improved performance to 0.95 using soft voting and further to 0.97 with a stacking ensemble strategy. Integrated Gradients was used to generate saliency heatmaps that highlight clinically relevant neuroanatomical regions. Furthermore, a symptomatic survey of 169 women from Maharashtra was conducted and analyzed to facilitate scalable early screening. Thus, this study presents a comprehensive, interpretable, and feasible framework for the early detection of Alzheimer's.

Keywords

Alzheimer's disease early diagnosis women's brain health mild cognitive impairment magnetic resonance imaging deep learning ensemble and stacking models integrated gradients symptom aware screening

1 Introduction

Alzheimer's disease (AD) is a neurological condition that gradually deteriorates memory, cognitive abilities, and daily functioning. While it primarily affects older adults, studies reveal that women are disproportionately impacted, especially those aged between 45 and 60. Dr Richard Isaacson, director of Florida Atlantic University's Alzheimer's Prevention Clinic, indicates that two out of three individuals with Alzheimer's disease are women.^1,2 This important finding, combined with previous research, shows that tailored lifestyle changes, like diet improvement, physical activity, stress management, and sleep hygiene can not only lower risk factors for both genders but also prove particularly effective for women.

This highlights the necessity for gender specific approaches to Alzheimer's prevention and diagnosis, especially for women who frequently exhibit early signs of Mild Cognitive Impairment (MCI). Unidentified MCI can lead to irreversible progression to full blown Alzheimer's disease. Thus, early diagnosis and timely intervention are vital for slowing disease development, improving life quality, and mitigating the overall societal and economic burden of the condition.² Despite considerable progress in neuroimaging and diagnostic technologies, accurately diagnosing Alzheimer's disease early remains particularly difficult, especially in its initial phases. The current investigation introduces a deep learning based screening model specifically aimed at women aged 45 to 60. This model seeks to identify subtle brain changes related to MCI and early Alzheimer's using Magnetic Resonance Imaging (MRI) images coupled with sophisticated Convolutional Neural Networks (CNNs).

The proposed system integrates a Self-Administered Gerocognitive Examination (SAGE) questionnaire based cognitive health score with MRI image analysis to enable comprehensive Alzheimer's risk assessment. Individuals with scores above the defined threshold undergo MRI preprocessing and CNN based feature extraction, followed by classification into Alzheimer's, MCI, or Normal categories. Based on the predicted outcome and risk level, the system provides personalized prevention strategies, medical guidance, and periodic monitoring recommendations.

A survey involving 169 women from various regions of Maharashtra supported the development of the proposed framework by capturing real world cognitive and behavioral trends. The questionnaire, adapted from the SAGE,³ included over 20 questions related to early neurodegenerative and menopausal symptoms. Responses were analyzed using graphs and tables to identify significant patterns, and a cognitive health score was derived to indicate whether MRI evaluation may be needed.

The system's performance was evaluated using standard metrics such as accuracy, precision, recall, and F1-score to ensure a balanced and reliable assessment. Based on insights from both the survey responses and model predictions, preventive recommendations were proposed specifically for the target group. Although Alzheimer's disease has been widely studied, limited research focuses on women; despite their increased vulnerability due to hormonal, neurological, and psychological changes during this life stage. The proposed model adopts convolutional architectures commonly used in state-of-the-art (SOTA) Alzheimer's MRI studies. However, the novelty lies in strengthening the evaluation framework through survey analysis, followed by screening oriented feasibility using CNN models. Although the deep learning model used in this work is not gender specific, the proposed system adopts a two stage framework comprising a women focused questionnaire based screening followed by MRI based analysis. The first stage employs the SAGE questionnaire to identify women in the early stage that may be at risk of cognitive decline, after which selected participants undergo MRI scanning. The MRI data are then analyzed using a non–gender specific deep learning model to detect potential abnormalities associated with Alzheimer's progression. Previous longitudinal studies have shown that the SAGE questionnaire can facilitate earlier detection of cognitive decline, supporting its role as an initial screening component.³ Although the deep learning models are not trained on gender specific MRI datasets, the framework incorporates a women focused symptom based survey stage. This two-stage approach enables targeted screening of women aged between 45–60 years of age; while applying general deep learning models for MRI classification.

The remainder of this paper is organized as follows: Sections 2 and 3 present a literature review and methods & materials employed in the research. Section 4 presents the results concerning CNN architectures and their validation. Section 5 details the discussions & limitations, while Section 6, 7 wraps up the paper by highlighting the future work & the significance of detecting early cognitive impairment in summary of the primary conclusion.

2 Literature survey

The literature review highlights recent deep learning approaches for Alzheimer's detection, focusing on MRI based CNNs, recurrent models, attention mechanisms, and ensemble techniques. While these methods show strong performance, most studies are based on general populations and do not consider sex specific risks or symptom progression, particularly in women (Table 1).

Recent advances in biomedical modeling have explored hybrid approaches that integrate mathematical disease modeling with machine learning techniques. For instance, A Hybrid Approach to Heart Disease Prediction Using a Fractional-Order Mathematical Model and Machine Learning Algorithm¹⁵ uses fractional order differential equations to model the temporal evolution of physiological parameters and combines it with a decision tree classifier on the UCI Cleveland dataset for accurate prediction. This highlights the effectiveness of integrating dynamic modeling with interpretable machine learning for improved clinical decision making.

However, most approaches overlook gender-specific biological factors and rarely focus on women, despite evidence of differing disease progression. Many studies rely primarily on imaging data, with limited integration of symptom based or clinical information, restricting holistic risk assessment. Explainability is also often lacking, with few using interpretability methods like Integrated Gradients. These gaps motivate the proposed symptom aware, MRI based framework that has a hybrid screening framework that combines symptom based assessment with MRI based deep learning for early detection of Alzheimer's disease. Unlike purely physiological models, this approach integrates neuroimaging with patient reported cognitive indicators, providing a more comprehensive and practical screening system.

3 Methods and materials

The figure illustrated below (Figure 1) is the complete proposed framework for early Alzheimer's detection, beginning with SAGE based cognitive screening and progressing to AI driven risk prediction and prevention. The details of this framework are divided into sections A, B & C as follows.

Figure 1.

Work flow of the proposed framework for early Alzheimer's detection (figure conceptualized and designed by the authors; generated using OpenAI based on author defined methodology and prompt.).

SECTION A: STATISTICAL ANALYSIS OF SURVEY

The survey involved 169 women in Maharashtra to evaluate neurological, behavioral, and cognitive symptoms associated with early neurodegenerative diseases and MCI. Participants responded to an extensive questionnaire containing over 20 inquiries.

Key questions and answers are illustrated in the table below. Table 2 and Figure 2 demonstrate the analysis and presentation of the data. To assess the relationship between symptoms, we calculated Spearman's rank correlation coefficient ( $ρ$ ) by assigning ordinal rank values (1–4) to survey responses across 169 participants. Spearman correlation analysis showed moderate relationships among key cognitive symptoms (ρ ≈ 0.42–0.51). Memory loss correlated with task difficulty (ρ ≈ 0.46) and repetition (ρ ≈ 0.51), while confusion was linked to concentration issues (ρ ≈ 0.42), indicating co-occurring early cognitive impairments. This endorse an early screening approach and provide insights into the cognitive difficulties women experience at early age.

Figure 2.

Responses from the participants.

Table 1.

Related work.

Author, year	Aim	Method used	Results
Jyoti Islam et al. (2018)⁴	Early AD Diagnosis with Deep Learning Architectures	CNN, Inception-v4 and ResNet	Inception-v4: Precision 0.81, Recall 0.91, F1-score 0.86. ResNet: Precision 0.82, Recall 1.00, F1-score 0.90. CNN: Precision 0.99, Recall 0.99, F1-score 0.99.
Chiyu Feng et al. (2019)⁵	AD Diagnosis using 3D-CNN and FSBi-LSTM	3D-CNN combined with FSBi-LSTM	Accuracies: 94.82% (AD vs NC), 86.36% (pMCI vs NC), 65.35% (sMCI vs NC).
Julian Fritsch et al. (2019)⁶	Automatic AD Diagnosis using Language Models	Neural Network Language Models (NNLMs) with LSTM	Achieved 85.6% accuracy with equal error rate.
Francisco J. Martinez-Murcia et al. (2020)⁷	Multi-level AD analysis using Autoencoders	CNN (Convolutional Autoencoders)	Achieved >80% classification accuracy based on MMSE or ADAS11 scores.
Italo A. D. de Oliveira et al. (2020)⁸	AD Classification using Hippocampal Asymmetry	MRI classification using hippocampal asymmetry features	Accuracy: 71.23% (CN × MCI), 80.43% (CN × AD); F1-scores: 0.67 and 0.75 respectively.
Ali Nawaz et al. (2021)⁹	Deep CNN with MRI for AD Classification	2D-CNN	Achieved 99.89% accuracy for AD, MCI, and normal control classification.
Yusera Farooq Khan et al. (2021)¹⁰	AD Identification from Audio Transcripts	Hybrid CNN + BiLSTM (Stacked Deep Dense Network – SDDNN)	Achieved 93.31% accuracy using GloVe embedding with hyperparameter tuning.
Wenyong Zhu et al. (2021)¹¹	AD Diagnosis using Dual Attention MIL	Dual Attention Multi-Instance Deep Learning Network (DA-MIDL)	Demonstrated superior classification performance and generalizability by identifying key pathological regions.
Klingenberg et al. (2023)¹²	Gender-based performance comparison in AD detection	MRI Deep Learning classifier	AUC: 0.91 (Women), 0.85 (Men).
Naveen and Cholli (2024)¹³	Transfer Learning for Early AD Detection	Pretrained CNN	∼95% accuracy (binary classification).
Morris et al. (2024)¹⁴	CNN + Explainable AI for Dementia	CNN with Explainable AI (XAI)	Accuracy: ∼90–93%; AUC: ∼0.94.

Table 2.

Responses of the survey.

Survey metric	Options / findings	% response
Total Respondents	169
Awareness about rising Parkinson's/AD cases	Yes / No	79.3% Yes, 20.7% No
Memory loss affecting daily life	Never / Rarely / Sometimes / Frequently	24.3%, 35.5%, 37.9%, ∼2.3% respectively
Difficulty completing familiar tasks	Never / Rarely / Sometimes / Frequently	20.1%, 24.3%, 53.8%, ∼1.8% respectively
Confusion with time or place	Never / Rarely / Sometimes / Frequently	17.2%, 32.5%, 48.5%, ∼1.8% respectively
Trouble with visual/spatial understanding	Never / Rarely / Sometimes / Frequently	7.7%, 17.8% 73.4%, ∼1.1% respectively
Speech or writing issues	Never / Rarely / Sometimes / Frequently	10.7%, 20.1%, 68%, ∼1.2% respectively
Misplacing items / retracing difficulty	Never / Rarely / Sometimes / Frequently	29%, 27.2%, 37.9%, ∼5.9% respectively
Poor judgment	Never / Rarely / Sometimes / Frequently	17.8%, 35.5%, 46.2%, ∼0.5% respectively
Mood or personality changes	Never / Rarely / Sometimes / Frequently	7.7%, 27.2%, 39.1%, ∼26% respectively
Difficulty finding words	Never / Rarely / Sometimes / Frequently	29.6%, 34.9%, 31.4%, ∼4.1% respectively
Repeating questions/stories	Never / Rarely / Sometimes / Frequently	14.2%, 27.2%, 55.6%, ∼3% respectively
Trouble managing finances	Never / Rarely / Sometimes / Frequently	13.6%, 19.5%, 63.3%, ∼3.6% respectively
Difficulty planning or solving problems	Never / Rarely / Sometimes / Frequently	14.8%, 24.3%, 58.6%, ∼2.3% respectively
Recognizing familiar people	Never / Rarely / Sometimes / Frequently	15.4%, 29%, 51.5%, ∼4.1% respectively
Feeling overwhelmed by daily tasks	Never / Rarely / Sometimes / Frequently	21.3%, 33.1%, 40.2%, ∼5.4% respectively
Trouble concentrating for long periods	Never / Rarely / Sometimes / Frequently	24.9%, 27.2%, 42.6%, ∼5.3% respectively
Level of concern	Not concerned at all / Slightly / Moderately / Very concerned	13%, 45%, 37.3%, 13% respectively

Table 3.

Dataset class distribution (train/test splits).

Class	Train count	Test count	Total images	Description
Mild Cognitive Impairment (MCI)	2560	179	2739	MRI scans showing early-stage cognitive impairment.
Moderate Impairment (MI)	2560	12	2572	MRI scans representing mid-stage Alzheimer's impairment.
No Impairment/CN	2560	640	3200	MRI scans of cognitively healthy individuals.

Respondents were also asked to cross off any symptoms that applied to them from a list that included:

The SAGE scoring method also helps quantify survey responses and supports early identification of cognitive impairment.³ It assesses key domains such as memory, orientation, language, reasoning, executive function, and visuospatial ability. A score of 16 or below suggests potential cognitive decline and indicates the need for further evaluation. This scoring framework helps guide decisions on whether MRI screening should be recommended, enabling timely detection and early intervention to slow progression toward dementia.

SECTION B: METHODOLOGY

As shown in Figure 3, the proposed framework uses a multistep pipeline for accessible and accurate early Alzheimer's screening. MRI data are preprocessed to improve quality, followed by CNN based classification to detect disease related patterns. The resulting models can be deployed through web or mobile platforms using ONNX compatibility. This approach shows potential as a clinically useful screening tool pending further validation. While Figure 4 represents the MRI based CNN ensemble and stacking pipeline, showing how multiple deep learning models collaborate to produce a robust final classification output.

Figure 3.

Deep learning for the proposed study.

Figure 4.

MRI based CNN ensemble and stacking pipeline (figure conceptualized and designed by the authors; generated using OpenAI based on author defined methodology and prompt.).

All experiments were conducted on Google Colab using an NVIDIA Tesla T4 GPU (16 GB) with TensorFlow/Keras in a Python environment. NumPy, scikit-learn, and Matplotlib were used for data preprocessing, evaluation, and visualization. Reproducibility was ensured by fixing random seeds (NumPy and TensorFlow seed = 42), disabling Python hash randomization (PYTHONHASHSEED = 0), and enabling deterministic TensorFlow operations. The exact library versions were recorded at runtime, enabling reliable comparison of AlexNet, ResNet50, DenseNet121, and their ensemble models.

3.1 Dataset & preprocessing

The research utilizes the Kaggle Alzheimer's axial MRI dataset,¹⁶ with each image representing a 2D axial slice from complete 3D MRI volumes. The axial plane provides a horizontal cross-sectional view of the brain. The dataset comprises slicewise 2D images derived from volumetric acquisitions and does not include demographic details of the patients or subject IDs. This limitation reflects a common challenge in medical imaging research where publicly available datasets lack detailed demographic metadata. Future work will focus on training and validating the model using clinically curated datasets with gender and age annotations to enable truly gender specific screening.

The data employed for this research project is a preprocessed version of the Kaggle Alzheimer's MRI data.¹⁶ In this study, there have been experimental simulations done using a WGAN-GP model to generate MRIs to overcome the problem of dataset insufficiency and unbalanced classes, since the Moderate class initially contains only two subjects. To gauge the efficacy of these simulations and models, parameters like FID, SSIM, PSNR, sharpness difference (SD), and Seaborn's Distplot have been taken into consideration. These simulations show very close similarities to real MRIs, with mean FID of 0.13, SSIM of 0.97, PSNR of 32 dB, and SD of 0.04. There is also improvement in classification through the use of these models, resulting in 11.77% gain in Balanced Accuracy, 15% gain in Matthew's Correlation Coefficient (MCC), and a 91.4% gain in minority class performance at the cost of 1% loss in majority class performance. This comparative analysis further proves that these models perform better than the DC-GAN models., a WGAN-GP model generated additional synthetic MRI slices to balance all three categories. The SMOTE technique employed to mitigate class imbalance by generating synthetic samples through linear interpolation between minority instances is defined as

x_{s y n} = x_{i} + λ (x_{j} - x_{i}), λ ϵ [0, 1]

(1)

3.1.1 Dataset class distribution (train/test splits)

In total, 8511 MRI images were divided into training and testing sets for this study. There are 3200 images classified as Cognitively Normal (CN), 2572 images categorized as Moderately Impaired (MI), and 2739 images labeled as MCI across the sets (with samples displayed in Figure 5(a), (b), & (c) below). The key features detected by the model are depicted in the visual samples.

Figure 5.

Sample 2D axial slice images per class (a) MCI, (b) MI, & (c) CN.

To ensure stable evaluation and assess the effects of different data splits, the dataset was divided at the image level, as there were no subject IDs available (Table 3).

For this project, each image undergoes preprocessing, which includes converting it to grayscale, resizing it to 64 × 64 pixels, and flattening it into a 4096 dimensional vector. Data augmentation was applied to the training set to improve generalization and reduce overfitting. Random rotations (±15°), horizontal flipping, and zooming (0.9–1.1) were used to introduce controlled geometric variability, while brightness adjustment (0.9–1.1) modeled illumination changes. Each training image was augmented twice using randomly sampled transformations. Augmentation was not applied to validation or test data to ensure unbiased evaluation.

3.1.2 Input resolution ablation study

We performed an ablation study across multiple input resolutions (64 × 64, 128 × 128, and 224 × 224) to evaluate the trade off between computational efficiency and anatomical detail preservation. The ablation study was conducted on the said dataset¹⁶ using an ImageNet pretrained DenseNet121 model with frozen backbone layers and a custom classification head, trained using the Adam optimizer (batch size = 16, categorical crossentropy loss) for 3 epochs on a GPU enabled environment, while keeping all hyperparameters constant across 64 × 64 and 128 × 128 resolutions for fair comparison.

Increasing the input resolution from 64 × 64 to 128 × 128 led to only marginal improvements in accuracy, while noticeably increasing the training time. As summarized in Table 4 and illustrated in Figure 6(a) and (b), the higher resolution offers limited performance gain at a substantially greater computational cost. Therefore, 64 × 64 was selected as it provides a more practical balance between classification performance and computational efficiency.

Figure 6.

(a) Resolution vs validation accuracy, (b) resolution vs training time per epoch.

Table 4.

Per epoch comparison of training accuracy, validation accuracy, and computational time for 64 × 64 and 128 × 128 input resolutions.

Epoch	Resolution	Train accuracy (%)	Validation accuracy (%)	Time (s)
1	64 × 64	74.1	88.9	172
1	128 × 128	85.5	85.2	200
2	64 × 64	94.2	96.3	15
2	128 × 128	95.4	98.4	29
3	64 × 64	97.1	97.6	15
3	128 × 128	97.9	97.2	41

3.1.3 Impact of data augmentation and SMOTE

To investigate the effect of data balancing and augmentation strategies, four experimental configurations were tested using ResNet50, DenseNet121, and AlexNet architectures:

Baseline (original training data)

Augmentation only

SMOTE only

SMOTE followed by augmentation

The baseline experiments demonstrated strong overall performance, with AlexNet achieving the highest validation accuracy (0.9512), followed by DenseNet121 (0.9115) and ResNet50 (0.7949) (Table 5). These results indicate that the dataset preserves meaningful discriminative features despite the presence of class imbalance. However, applying geometric and photometric augmentation alone led to a substantial drop in performance across all models (≈0.33–0.37), suggesting that such spatial transformations may distort subtle anatomical biomarkers that are critical for MRI based diagnosis.

Table 5.
Validation accuracy under different data balancing strategies.

Model Baseline Augmentation only SMOTE only SMOTE + Augmentation

ResNet50 0.7949 0.3307 0.8027 0.3861

DenseNet121 0.9115 0.3704 0.9141 0.3548

AlexNet 0.9512 0.3548 0.9863 0.3548

Model	Baseline	Augmentation only	SMOTE only	SMOTE + Augmentation
ResNet50	0.7949	0.3307	0.8027	0.3861
DenseNet121	0.9115	0.3704	0.9141	0.3548
AlexNet	0.9512	0.3548	0.9863	0.3548

In contrast, SMOTE based oversampling consistently improved validation accuracy across all architectures, most notably for AlexNet (0.9863). This highlights the impact of class imbalance on model learning and demonstrates that synthetic minority sampling can enhance class separability and decision boundaries. When SMOTE was combined with augmentation, performance again declined, likely due to the introduction of amplified noise and altered feature distributions. These findings suggest that conventional augmentation strategies may not always be appropriate for small scale medical imaging datasets where fine structural details are essential.

3.1.4 2-D t-SNE visualization of the training images

Figure 7 displays the two dimensional t-SNE visualization of MRI feature embeddings subsequent to dimensionality reduction through Principal Component Analysis (PCA). Initially, the original 4096 dimensional feature vectors representing the preprocessed MRI images are reduced to 50 dimensions using PCA to minimize any kind of noise and ensure numerical stability. Finally, the reduced dimensions are further transformed into a 2-D space using t-SNE to evaluate the class separation of various classes for CN, MCI, and MI. The visualization highlights clear clustering patterns, with CN samples forming relatively tight and distinctly separated areas, while MCI and MI samples show some overlap, which reflects the gradual and progressive nature of Alzheimer's disease. This overlap carries clinical significance, as initial stage cognitive impairment often resembles structural features associated with normal ageing and moderate impairment.

Figure 7.

2-D t-SNE visualization of the training images.

Figure 8.

Architecture of AlexNet.

Figure 9.

Architecture of ResNet50.

Figure 10.

Architecture of DenseNet121.

3.2 Feature extraction using CNN

In this study, features are extracted from three CNN architectures (AlexNet, ResNet50, and DenseNet) since CNNs are well-regarded in medical imaging for their ability to learn hierarchical spatial features.

3.2.1 CNN based feature extraction

Recent studies have demonstrated that CNNs are highly effective for medical image classification across diverse disease domains, particularly when combined with preprocessing, transfer learning, and ensemble strategies. Moreover, the integration of explainable CNN models is increasingly emphasized to enhance clinical trust and interpretability in medical image–based diagnosis.¹⁷ CNNs generate hierarchical feature representations through consecutive convolution, nonlinear activation, and pooling operations. For an input image X, the convolutional transformation at the l-th layer can be expressed as¹⁸:

X^{(l + 1)} = f (W^{(l)} * X^{(l)} + b^{(l)})

(2)

where W^(l) and b^(l) are the learnable kernel weights and biases, respectively, and f(⋅) is a nonlinear activation function like the Rectified Linear Unit (ReLU). This process enables the network to learn localized discriminative features.

To achieve spatial down-sampling and improve translation invariance, CNNs utilize pooling operations, which can be formulated as^18,19:

X^{(l + 1)} = Pool (X^{(l)})

(3)

where Pool(⋅) represents either max-pooling or average-pooling.

After the convolutional and pooling stages, the resulting feature maps are flattened and passed through one or more fully connected (FC) layers. The output of an FC layer is given by:

Z = r (W_{f c} X_{flattened} + b_{f c})

(4)

where W_fc and b_fc are the weights and biases of the dense layer, and r(⋅) is an activation function applied to the final feature vector. This step combines the learned features for downstream tasks like classification or regression.

The proposed research uses three CNN Architectures: AlexNet, ResNet50, and DenseNet121. Their predictions were combined using simple ensemble averaging and a stacking meta-learner.

AlexNet: A lightweight AlexNet inspired architecture was adopted and tailored for 64 × 64 input images. It serves as an efficient baseline model while drawing from early CNN successes in image classification. AlexNet¹⁸ is a groundbreaking deep convolutional neural network that first implemented ReLU activations and training on GPUs while extracting complex hierarchical features for large scale image classification. The architecture retains the core design of AlexNet but has been modified to accommodate smaller 64 × 64 input images. It initiates with a large kernel convolutional layer (96 filters, 11 × 11, stride 4) for swift spatial reduction and initial feature extraction. Following this, max pooling is applied to decrease dimensionality (Figure 8).

The subsequent convolutional layers comprise a 256-filter 5 × 5 layer with batch normalization, and three additional convolutional layers (384, 384, and 256 filters, each with 3 × 3 kernels), all normalized to enhance learning stability. A concluding max pooling layer organizes the acquired feature maps for the fully connected classifier. The classification section contains two high capacity Dense layers with 4096 neurons each, reflecting the original AlexNet structure. Dropout regularization (0.5) is applied after each Dense layer to mitigate overfitting. The output layer employs softmax activation to classify three categories of Alzheimer's (Mild, Moderate, No Impairment). 2.

ResNet-50: ResNet-50¹⁹ is a deep convolutional neural network that applies residual learning, enabling very deep models to be trained effectively without encountering vanishing gradients. Pre-trained on ImageNet, ResNet50 is selected for its deep residual learning capability and strong transfer learning performance. The architecture employed in this research is based on a pre-trained ResNet50 network with include_top = False. The convolutional backbone begins with a 7 × 7 convolution, followed by max-pooling, and consists of four residual stages (Conv2_x, Conv3_x, Conv4_x, Conv5_x) housing a total of 16 bottleneck blocks that serve as a fixed feature extractor. The input MRIs are resized to 64 × 64 × 3 and processed through the frozen convolutional layers to yield a 7 × 7 × 2048 feature map, which is subsequently converted into a 2048-dimensional vector using Global Average Pooling. A custom classification head is added featuring Dense (256)- BatchNorm-Dropout (0.5), Dense(128)- BatchNorm-Dropout (0.3), and a Dense layer with Softmax activation for three output classes (Mild, Moderate, No Impairment) as shown in Figure 9.

The final model was put together using the Adam optimizer (learning rate 0.0001) with categorical cross-entropy loss. The training took place with a batch size of 8 for up to 40 epochs, employing early stopping based on validation loss to avoid overfitting. During this process, solely the newly introduced classifier layers were trained, allowing for efficient transfer learning in the context of MRI based Alzheimer's classification. This architecture effectively merges stable pre-trained feature extraction with a streamlined classifier designed specifically for Alzheimer's MRI classification. 3.

DenseNet-121: DenseNet-121²⁰ implements dense connectivity, enabling each layer to access feature maps from all preceding layers that promote feature reuse and efficient gradient flow. This makes it particularly suitable for smaller and imbalanced datasets, such as the Moderate Alzheimer's class. In this study, DenseNet121 is used as a deep convolutional feature extractor aimed at Alzheimer's MRI classification. The pre-trained DenseNet121 model (with ImageNet weights, include_top = False) was selected to exploit its densely connected convolutional architecture, which facilitates feature reuse and efficient gradient propagation as shown in Figure 10.

During the initial training phase, the DenseNet121 backbone was frozen to prevent overfitting while retaining the general feature representations learned from large scale datasets. The input MRI scans were resized to 64 × 64 × 3 and passed through the network, followed by a Global Average Pooling layer to obtain a compact feature vector. This representation was then processed by a custom classification head consisting of fully connected layers with ReLU activation, batch normalization, dropout regularization, and a final softmax layer for three class prediction.

3.3 Optimizer & training configuration

3.3.1 Optimizer

We opted for the Adam optimizer due to its adaptive learning rate mechanism, which combines the advantages of both AdaGrad and RMSProp. Adam adeptly handles sparse gradients and ensures quicker convergence, making it an excellent option for training deep neural networks. The training of these models underwent for 50 epochs using the Adam optimizer (learning rate = 1 × 10⁻⁴) and categorical cross-entropy loss, with a batch size of 8. This value of batch is decided on experimentation and it worked well for our use case. Early stopping based on validation loss was utilized to restore the optimal weights. This arrangement facilitated the efficient extraction of hierarchical spatial features from MRI slices while maintaining computational efficiency.

3.3.2 Hyperparameter tuning

A grid search strategy was employed to determine the optimal dropout rate and learning rate for each architecture. Two dropout values (0.3 and 0.5) were evaluated with a fixed learning rate of 0.0001. Across all models, a dropout rate of 0.3 yielded superior validation accuracy. The best configurations achieved validation accuracies of 80.27% for ResNet50, 93.23% for DenseNet121, and 97.20% for AlexNet. These settings were used for final model training and evaluation.

3.3.3 Cross-validation

To assess robustness against random partition effects, the models were additionally evaluated across five independent stratified random image level splits. Mean and standard deviation of performance metrics were computed to quantify stability. DenseNet121 exhibited the most consistent behavior indicating strong robustness across different partitions. AlexNet achieved the highest mean accuracy but showed larger variability, suggesting increased sensitivity to data splits. ResNet50 demonstrated moderate performance variability reflecting its comparatively weaker class separation on this dataset. These findings support the use of ensemble and stacking strategies to reduce variance and improve reliability. The results of Statistical Significance (McNemar Test) are shown in the Table 6 and described in the Results section 4.6.

Table 6.
McNemar test results for pairwise model comparisons (Bonferroni-adjusted α = 0.025).

Model comparison n01 n10 χ² statistic p-value Significant

AlexNet vs DenseNet121 93 33 27.627 1.471 × 10⁻⁷ Yes

Stacking vs ResNet50 — — 106.667 <0.0001 Yes

Stacking vs Ensemble — — 0.083 0.7728 No

Model comparison	n01	n10	χ² statistic	p-value	Significant
AlexNet vs DenseNet121	93	33	27.627	1.471 × 10⁻⁷	Yes
Stacking vs ResNet50	—	—	106.667	<0.0001	Yes
Stacking vs Ensemble	—	—	0.083	0.7728	No

3.4 Performance of individual models

After we assessed three pretrained CNN architectures: ResNet50, DenseNet121, and AlexNet; on the three class test set (Mild, Moderate, Non-Impaired), which included 831 images; the findings were as shown in the Table 7 below.

Table 7.
Accuracy & Macro-F1 for ResNet50, DenseNet121, and AlexNet.

Model Accuracy Macro-F1 Observation

ResNet50 0.69 0.52 Weak separation of Mild & Moderate classes

DenseNet121 0.87 0.84 Strong performance, especially on Moderate

AlexNet 0.94 0.89 Best single model

Model	Accuracy	Macro-F1	Observation
ResNet50	0.69	0.52	Weak separation of Mild & Moderate classes
DenseNet121	0.87	0.84	Strong performance, especially on Moderate
AlexNet	0.94	0.89	Best single model

Table 8.

Feature space separability analysis (t-SNE representation).

Metric	Value
Silhouette Score	0.2531
Average Intra-class Distance	50.9914
Average Inter-class Distance	86.8929
Overlap Indicator (Intra/Inter)	0.5868

Table 9.

Accuracy & Macro-F1 for ensemble vs stacking.

Strategy	Accuracy	Macro-F1	AUC	Result
1. Simple Ensemble (average logits)	0.95	0.95	0.98	Improvement over all individual models.
2. Stacking Meta-Learner (Logistic Regression)	0.92(Validation)	0.97	0.99	Best overall performance.

The relatively lower accuracy of ResNet50 (0.69) indicates variability in base model performance, which may be attributed to differences in architecture depth and feature extraction capability. This highlights the importance of selecting models best suited to the dataset characteristics. AlexNet produced the clearest feature clusters and had the strongest performance as a single model.

3.5 Visualization of feature separability using t-SNE

t-Distributed Stochastic Neighbor Embedding (t-SNE)²¹ is a nonlinear technique used for dimensionality reduction. It is frequently utilized to visualize high dimensional data in two or three dimensions. To evaluate the discriminative power of extracted MRI features, t-SNE was used to project high dimensional CNN embeddings into two dimensional space on the test set. Visualizations were generated before and after feature selection to assess improvements in class separability. AlexNet showed well defined and distinct clusters across classes, indicating strong feature discrimination. DenseNet121 demonstrated moderate separation with some overlap, particularly for Mild cases. ResNet50 exhibited considerable class overlap, suggesting less distinct feature boundaries for this dataset. This analysis was conducted on the test sets to confirm that the observed structure is applicable beyond the training data (Figure 11(a), (b), and (c)).

Figure 11.

(a) AlexNet embeddings, (b) DenseNet embeddings, and (c) ResNet embeddings.

3.5.1 Feature space analysis

Feature space separability was evaluated using silhouette score and inter/intra-class distance analysis as shown in Table 8. The silhouette score of 0.2531 indicates moderate clustering structure, suggesting partial but meaningful separation between cognitive classes.

The average intra-class distance (50.99) was substantially lower than the inter-class distance (86.89), yielding an overlap ratio of 0.5868. This confirms that while classes are distinguishable, noticeable feature overlap exists, which justifies the need for ensemble learning to improve decision boundary refinement.

3.6 Ensemble learning module

Soft-voting ensemble is applied to reduce prediction variance by averaging outputs from multiple models. To reduce any model specific bias, this soft-voting ensemble was formed using the three top performing models selected following hyper parameter tuning: (1) ResNet50, (2) DenseNet121, and (3) AlexNet.

The soft-voting ensemble computes the final class probability as the average of the classwise probabilities predicted by individual models:

P_{e n s e m b l e} (c) = \frac{1}{N} \sum_{i = 1}^{N} P_{i} (c)

(5)

Where N = 3 (models),

Pi(c) = is probability for class c.

The ensemble was evaluated on the validation set using:

Input: X_val

Ground truth: y_val

Metric: Accuracy

The ensemble achieved a validation accuracy of:

A c c u r a c y_{e n s e m b l e} = \frac{1}{N} \sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i})

(6)

Where N is the number of validation images.

3.6.1 Stacked ensemble learning

Stacking based ensemble learning integrates predictions from various CNN architectures to enhance accuracy based on MRI data.²² Stacking (Logistic Regression) is implemented to learn the optimal weighted combination of base model predictions. To advance classification further, we combined the 3 refined deep learning models: ResNet50, DenseNet121, and AlexNet, as foundational learners. In contrast to soft voting, stacking involves training a meta-learner to optimize the combination of the output probabilities from the base models.

Each foundational model generates a probability distribution for the three classes (Mild, Moderate, Non-Impaired). For each MRI sample, these outputs are merged to create a 9-dimensional meta-feature vector:

X_{m e t a} = [p_{1}^{(R e s N e t)}, p_{2}^{(R e s N e t)}, p_{3}^{(R e s N e t)} | p_{1}^{(D e n s e N e t)}, p_{2}^{(D e n s e N e t)}, p_{3}^{(D e n s e N e t)} | p_{1}^{(A l e x N e t)}, p_{2}^{(A l e x N e t)}, p_{3}^{(A l e x N e t)}]

(7)

A Logistic Regression classifier (max_iter = 1000, one-vs-rest mode) was used as the meta model. The final class probability for stacking is computed as:

P_{s t a c k i n g} (c) = σ (W \cdot [P_{R e s N e t} (c) + P_{D e n s e N e t} (c) + P_{A l e x N e t} (c)] + b)

(8)

where, W and b are the learned weights and bias of the logistic regression model. The sigmoid function σ(

\cdot

) is used for binary classification, while the softmax function is applied for multiclass classification.

The final predicted label is obtained as:

\hat{y} = \arg max_{c} g (X_{m e t a})_{c}

(9)

where g(⋅) represents the logistic regression function.

The stacked model was trained on the meta features generated from the training set and evaluated on the validation set. The stacking ensemble achieved a validation accuracy of:

A c c u r a c y_{s t a c k i n g} = \frac{1}{N} \sum_{i = 1}^{N} 1 ({\hat{y}}_{i} = y_{i})

(10)

indicating improved performance compared to some individual models.

3.6.2 Feature diversity analysis of ensemble learning and stacking

Each of the three architectures generates unique feature embedding's (as shown by the t-SNE embedding's): 1. AlexNet identifies distinct structural patterns. 2. DenseNet121 yields compact representations boosted by feature reuse. 3. ResNet50 captures broader abstractions, contributing to diversity.

Relying on a single network restricts feature extraction, often leading to lower accuracy, higher validation loss, and poor F1-scores, especially for minority classes. Stacking and ensemble learning combine complementary feature representations, improving robustness and achieving superior classification performance compared to any single backbone model. We assessed two fusion methods: Ensemble versus Stacking, detailed as in Table 9.

The stacking model successfully integrated discriminative low-level features from AlexNet, the dense feature reuse from DenseNet121, and the abstract representations from ResNet50, achieving improved robustness and a balanced class distribution.

3.7 Explainable artificial intelligence (XAI) using integrated gradients

To assess the interpretability of the proposed deep learning framework, Integrated Gradients (IG) was employed, which is a gradient based technique that measures how individual pixels affect the predicted result.²³ The pipeline successfully generated Integrated Gradients maps for 20 test samples using the tuned_alexnet_model. This helps visualize which pixels in the input images are most crucial for AlexNet's predictions.

Figure 12 illustrates the IG saliency maps for the three architectures: ResNet50, DenseNet121, and AlexNet; on samples categorized under (MCI) Mild. The heatmaps highlight key brain regions such as cortical gray matter, periventricular areas, and Sulcal gyral contours, which are known to show subtle early stage disease changes. Warmer colors (red/yellow) indicate regions strongly influencing predictions, while cooler areas have minimal impact. Consistent patterns across architectures confirm that the models focus on clinically meaningful features rather than artifacts.

Figure 12.

IG saliency maps for AlexNet, DenseNet121, and ResNet50 for Mild Class.

Deletion Test (Faithfulness): The plot in Figure 13 illustrates the faithfulness of AlexNet's attributions. It shows how the model's prediction confidence decreases as the most important pixels (identified by IG) are progressively removed from the image. This demonstrates that AlexNet relies on these highlighted regions for its predictions

Figure 13.

Faithfulness of AlexNet's attributions.

The final plot (Figure 14) displays an average attribution map for AlexNet across all 20 samples. This highlights features or regions that are consistently important for AlexNet's predictions across a representative set of images in the test data. These results provide insights into how the tuned_alexnet_model makes its predictions, indicating which parts of the MRI scans it focuses on.

Figure 14.

Population-level mean attribution map.

4 Results and performance analysis

This section presents a detailed performance analysis of the proposed CNN and ensemble models for multi-class classification. The evaluation includes comparative metrics, cross-validation results, statistical significance testing, uncertainty analysis, robustness assessment, and inference efficiency. To ensure robustness and reduce overfitting, multiple validation strategies were applied including five-fold stratified cross-validation, early stopping based on validation loss, bootstrap confidence intervals, and statistical significance testing using McNemar's test.

4.1 Class wise comparative performance of CNN and ensemble models

Table 10 demonstrates that ensemble & stacking model outperformed individual CNN models across most evaluation metrics performed class-wise. The class-wise analysis demonstrates balanced diagnostic capability across all dementia stages. Notably, the Stacking model achieves high sensitivity for Mild (MCI) (0.8939) and perfect detection of Moderate cases (MI) (1.0000), indicating strong suitability for early-stage clinical screening where minimizing false negatives is critical. In addition to overall accuracy, class-wise sensitivity, specificity, and ROC-AUC were computed to ensure balanced evaluation across dementia categories and enhance clinical screening relevance.

Table 10.
Class wise comparative performance of CNN and ensemble models.

Metric ResNet50 DenseNet121 AlexNet Ensemble Stacking

Sensitivity

MCI 0.5475 0.7207 0.9944 0.9665 0.8939

MI 0.9167 0.9167 1.0000 1.0000 1.0000

CN 0.7344 0.9172 0.6391 0.8203 0.9531

Specificity

MCI 0.7868 0.9202 0.6457 0.8267 0.9540

MI 0.9267 0.9963 1.0000 0.9976 1.0000

CN 0.7225 0.7435 0.9948 0.9686 0.9005

ROC-AUC

MCI 0.7750 0.9323 0.9853 0.9645 0.9838

MI 0.9753 0.9989 1.0000 1.0000 1.0000

CN 0.8220 0.9342 0.9853 0.9649 0.9845

Macro ROC-AUC 0.8574 0.9551 0.9902 0.9764 0.9894

Metric	ResNet50	DenseNet121	AlexNet	Ensemble	Stacking
Sensitivity
MCI	0.5475	0.7207	0.9944	0.9665	0.8939
MI	0.9167	0.9167	1.0000	1.0000	1.0000
CN	0.7344	0.9172	0.6391	0.8203	0.9531
Specificity
MCI	0.7868	0.9202	0.6457	0.8267	0.9540
MI	0.9267	0.9963	1.0000	0.9976	1.0000
CN	0.7225	0.7435	0.9948	0.9686	0.9005
ROC-AUC
MCI	0.7750	0.9323	0.9853	0.9645	0.9838
MI	0.9753	0.9989	1.0000	1.0000	1.0000
CN	0.8220	0.9342	0.9853	0.9649	0.9845
Macro ROC-AUC	0.8574	0.9551	0.9902	0.9764	0.9894

Table 11.

Clinically weighted error analysis highlighting the prioritization of false negative reduction in Alzheimer's MRI classification.

Model	False negatives (FN)(missed AD / pathology)	Clinical implication of FN	False positives (FP) (over-diagnosis)	Clinical implication of FP	Overall clinical suitability
ResNet50	High (185 AD cases misclassified as CN/MCI)	Risk of delayed diagnosis and missed early intervention; clinically undesirable	Moderate	Additional follow-up imaging or cognitive tests	Limited due to high FN burden
DenseNet121	Moderate (57 AD cases misclassified as CN)	Improved sensitivity compared to ResNet50, but some missed pathology remains	Moderate (CN → AD)	Acceptable clinical cost	Moderate suitability
AlexNet	Very Low for advanced AD (0 missed AD cases)	Strong detection of severe pathology	Higher FP for CN	May increase unnecessary clinical evaluations	Suitable for late-stage detection
Ensemble	Very Low (8 AD cases misclassified as CN)	Minimal missed diagnoses; clinically favorable	Low	Acceptable over-screening	High clinical suitability
Stacking	Minimal (2 AD cases misclassified as CN)	Best clinical behavior; near-optimal sensitivity	Very Low	Minimal additional burden	Highest clinical suitability

Table 12.

Classification accuracy & class-wise evaluation metrics for all the 5 models on the TEST SET.

Model	Accuracy	Macro precision	Macro recall	Macro F1-Score	Weighted precision	Weighted recall	Weighted F1-Score
ResNet50	0.69	0.50	0.75	0.52	0.79	0.69	0.72
DenseNet121	0.87	0.84	0.84	0.84	0.87	0.87	0.87
AlexNet	0.94	0.97	0.83	0.89	0.94	0.94	0.93
Ensemble (Soft Voting)	0.95	0.97	0.94	0.95	0.95	0.95	0.95
Stacking Ensemble	0 . 97	0.98	0.96	0.97	0.97	0.97	0.97

4.2 3 × 3 confusion matrix analysis

To gain deeper insights into how the proposed models perform classification, we generated confusion matrices for ResNet50, DenseNet121, AlexNet, as well as ensemble learning and stacking based approaches applied on the test dataset, which are illustrated in Figure 15. The confusion matrix of ResNet50 reveals frequent confusion between Mild (MCI) and Non-Impaired (CN) cases, suggesting difficulty in capturing subtle early Alzheimer's changes. DenseNet121 performs better, especially in recognizing (MI) Moderate cases, though some overlap between Mild (MCI) and Non-Impaired (CN) stages remains. AlexNet further improves classification by reducing false positives and correctly identifying most Non-Impaired subjects, with only a small number of Mild (MCI) cases misclassified. The ensemble model strengthens overall reliability by combining multiple learners, leading to fewer inter-class errors and accurate identification of (MI) Moderate cases. The stacking model achieves the most consistent performance, with predictions concentrated along the diagonal, indicating stable and dependable classification across all disease stages.

Figure 15.

Confusion matrix of ResNet50, DenseNet121, AlexNet, ensemble & stacking.

Overall, the confusion matrix results show that both the ensemble based models reflect clinical priorities in Alzheimer's diagnosis better by focusing on higher sensitivity and reducing missed disease cases. Most misclassifications occur between neighboring disease stages, which mirrors the gradual and progressive course of Alzheimer's rather than sharply defined categories. This clinically meaningful error pattern as seen in the Table 11 suggests that the proposed ensemble and stacking models are well suited for supporting real world Alzheimer's MRI diagnosis and clinical decision making.

4.3 Classification accuracy & class-wise precision, recall, and F1-scores as given in the figure below

Among all evaluated models (Table 12 & Figure 16), the stacking based ensemble attained the highest accuracy at 97% and showed the best macro-averaged F1-score, confirming its effectiveness in classifying Alzheimer's disease. For each model, we calculated class probabilities for every test MRI scan. The final predictions were derived from: The Argmax of probabilities for individual models, The mean probability aggregation for the ensemble & The logistic regression based meta-learner for stacking.

Figure 16.

Training evaluation metrics chart for all the 5 models.

Let the true test labels be $y^{t e s t}$ and predicted labels $\hat{y}$ . We computed test accuracy as follows:

A c c u r a c y = \frac{1}{N} \sum_{i = 1}^{N} 1 ({y^{t e s t}}_{i} = {\hat{y}}_{i})

(12)

For all the 5 models, we generated: Training vs. validation accuracy curves.

Figure 17 presents the training and validation accuracy curves for ResNet50, DenseNet121, and AlexNet over the training epochs. The ResNet50 model successfully completed its training for 46 epochs, but due to early stopping, the best weights were restored from epoch 39, as this epoch achieved the lowest validation loss of 0.1775. DenseNet121 displays quicker convergence with a stable learning progression, as both training and validation accuracies show steady increases and remain closely aligned. This model completed training for 50 epochs, but the best weights were restored from epoch 49 with validation loss of 0.12833. This indicates effective feature reuse and better generalization capability. While AlexNet completed training in 10 epochs and restored the best weights at 9th epoch since val_loss did not improve from 0.36869. This exhibits rapid overfitting, as evidenced by near-perfect training accuracy coupled with highly unstable validation performance. This underscores the necessity of ensemble and stacking strategies to bolster robustness and enhance classification accuracy.

Figure 17.

Training and validation accuracy curves for ResNet50, DenseNet121, and AlexNet models.

Thus, the overfitting was controlled using early stopping with a patience of 7 epochs, monitored on validation loss. Additional regularization was achieved through dropout (rate = 0.5) in the classifier layers and batch normalization, with the final model selected based on the epoch yielding the lowest validation loss.

4.4 Performance evaluation metrics of stacking model for 8 epochs

As detailed in Table 13, optimal performance of stacking model was attained at epoch 8, after which there were no further enhancements in validation metrics. Therefore, the results presented correspond to the best performing epoch determined by early stopping. The evaluation metrics offer a thorough overview of the model's learning behavior, generalization capability, and classification quality.

Table 13.
Performance evaluation metrics of stacking model for 8 epochs.

Epoch Accuracy val_accuracy Loss val_loss AUC val_auc f1_score val_f1_score

1 0.591882467 0.970392158 0.669380486 0.490162462 0.613981783 0.496069312 0.693930864 0.990098894

2 0.681150377 0.926776946 0.570583165 0.390428364 0.750966787 0.835437059 0.735952079 0.961729288

3 0.710719645 0.602941155 0.544238389 0.578326344 0.782503247 0.866852999 0.75701952 0.746379614

4 0.71576196 0.978553951 0.542538583 0.190506577 0.785709858 0.860280633 0.761292279 0.98912704

5 0.753797293 0.867034316 0.492984265 0.45325467 0.828711748 0.882531822 0.789100409 0.927642524

6 0.812873483 0.556985319 0.413537413 1.882233739 0.886735141 0.967041254 0.840750158 0.708114624

7 0.858067751 0.641544104 0.34315148 1.870511174 0.923748374 0.961699367 0.879377782 0.776290536

8 0.879668832 0.920955896 0.304451168 0.204930186 0.940827131 0.996626019 0.898460805 0.957994103

Epoch	Accuracy	val_accuracy	Loss	val_loss	AUC	val_auc	f1_score	val_f1_score
1	0.591882467	0.970392158	0.669380486	0.490162462	0.613981783	0.496069312	0.693930864	0.990098894
2	0.681150377	0.926776946	0.570583165	0.390428364	0.750966787	0.835437059	0.735952079	0.961729288
3	0.710719645	0.602941155	0.544238389	0.578326344	0.782503247	0.866852999	0.75701952	0.746379614
4	0.71576196	0.978553951	0.542538583	0.190506577	0.785709858	0.860280633	0.761292279	0.98912704
5	0.753797293	0.867034316	0.492984265	0.45325467	0.828711748	0.882531822	0.789100409	0.927642524
6	0.812873483	0.556985319	0.413537413	1.882233739	0.886735141	0.967041254	0.840750158	0.708114624
7	0.858067751	0.641544104	0.34315148	1.870511174	0.923748374	0.961699367	0.879377782	0.776290536
8	0.879668832	0.920955896	0.304451168	0.204930186	0.940827131	0.996626019	0.898460805	0.957994103

The stacking model shows steady improvement across epochs, with training accuracy increasing from 59.18% to 87.96% and AUC from 0.61 to 0.94, while loss decreases consistently. Despite minor fluctuations, the final epoch achieves strong validation performance (accuracy: 92.09%, AUC: 0.9966), indicating effective learning and high classification capability.

4.5 Cross-validation

The robustness of the models against partition dependent variability, performance stability was analyzed across five independent stratified image level splits (see Table 14). These observations reinforce the motivation for ensemble and stacking approaches to enhance reliability.

Table 14.
Mean ± std of 3 base models.

Model Accuracy (mean ± std) Macro-F1 (mean ± std) AUC (mean ± std)

ResNet50 0.8845 ± 0.0207 0.8832 ± 0.0223 0.9759 ± 0.0052

DenseNet121 0.9316 ± 0.0028 0.9315 ± 0.0028 0.9883 ± 0.0009

AlexNet 0.9551 ± 0.0280 0.9546 ± 0.0287 0.9976 ± 0.0014

Model	Accuracy (mean ± std)	Macro-F1 (mean ± std)	AUC (mean ± std)
ResNet50	0.8845 ± 0.0207	0.8832 ± 0.0223	0.9759 ± 0.0052
DenseNet121	0.9316 ± 0.0028	0.9315 ± 0.0028	0.9883 ± 0.0009
AlexNet	0.9551 ± 0.0280	0.9546 ± 0.0287	0.9976 ± 0.0014

Table 15.

Bootstrap based performance estimates (95% CI).

Model	Accuracy (95% CI)	Macro-F1 (95% CI)
ResNet50	0.6567 (0.6233–0.6907)	0.5316 (0.4698–0.5946)
DenseNet121	0.8442 (0.8195–0.8688)	0.7555 (0.6854–0.8175)
AlexNet	0.9504 (0.9350–0.9651)	0.8811 (0.7872–0.9437)
Ensemble	0.9296 (0.9122–0.9471)	0.8893 (0.8154–0.9370)
Stacking	0.9566 (0.9422–0.9699)	0.9253 (0.8692–0.9660)

4.6 Statistical significance (McNemar test)

To determine whether the observed performance differences were statistically meaningful, McNemar's test was conducted on paired predictions from the same independent test set. Since multiple pairwise comparisons were performed, Bonferroni correction was applied, resulting in an adjusted significance threshold of α = 0.025.

As shown in Table 6, AlexNet significantly outperforms DenseNet121 (χ² = 27.627, p = 1.471 × 10⁻⁷), with substantially more disagreement cases favoring AlexNet (93 vs 33). Similarly, the proposed stacking framework demonstrates a statistically significant improvement over ResNet50 (χ² = 106.667, p < 0.0001). However, the comparison between the stacking model and the ensemble approach did not yield a statistically significant difference (χ² = 0.083, p = 0.7728), indicating comparable performance between these two strategies. Overall, these results confirm that the reported improvements over individual backbone architectures are statistically supported and not attributable to random variation.

4.7 Robustness, uncertainty and calibration analysis

4.7.1 Bootstrap confidence intervals

To quantify statistical uncertainty, bootstrap resampling (1000 iterations) was applied on the test set and 95% confidence intervals (CI) were computed for overall performance metrics as seen in Table 15.

The stacking model achieved the highest performance with consistently narrow confidence intervals, indicating stable generalization.

4.7.2 Class-wise uncertainty

Class-wise bootstrap sensitivity analysis revealed wider confidence intervals for the Moderate Impairment (MI) class (n = 12), reflecting uncertainty due to limited sample size rather than model instability. Larger classes (Mild (MI) and Non-Impaired (CN)) showed narrow intervals, indicating stable class-wise performance.

4.7.3 Calibration and reliability

Calibration quality was assessed using reliability diagrams (Figure 18) and Expected Calibration Error (ECE). The obtained ECE value of 0.0135 indicates excellent probability calibration, demonstrating strong agreement between predicted confidence and observed outcomes.

Figure 18.

Reliability diagram.

4.7.4 Screening oriented decision threshold

Since the framework targets early screening, decision thresholds were optimized to reduce false negatives. A screening threshold of 0.10 achieved a sensitivity of 0.953 for Alzheimer positive cases (Mild + Moderate), prioritizing early detection at the expense of acceptable false positives.

These analyses demonstrate that the proposed framework is not only accurate but also statistically reliable, well-calibrated, and clinically aligned for screening oriented deployment.

4.8 Inference efficiency

Inference efficiency was evaluated in terms of average inference time per image, total batch inference time, and overall model complexity measured by the number of parameters. Each architecture was tested on 100 randomly selected test images to compute per image latency. AlexNet achieved the lowest average inference time of 0.0134 s per image, followed by ResNet50 at 0.0877 s, while DenseNet121 recorded the highest latency at 0.1819 s per image. Batch inference on the complete test set (batch size = 32) required 0.2682 s (AlexNet), 0.7937 s (ResNet50), and 0.6728 s (DenseNet121). In terms of model complexity, ResNet50 had 24,147,075 parameters, AlexNet 21,598,595 parameters, and DenseNet121 7,334,723 parameters, highlighting the trade-off between architectural depth, parameter size, and computational efficiency. A warm-up run was performed prior to timing to eliminate initialization overhead and ensure fair latency measurement.

SECTION C: DEMENTIA/ALZHEIMER'S PREVENTION PLAN FOR HIGH-RISK WOMEN

Recent advancements in dementia research have identified several lifestyle, dietary, medical, and technological strategies that can help prevent or delay the onset of Alzheimer's disease, especially in women at high risk due to genetic, hormonal, or environmental factors. One of the most effective strategies is following the MIND diet, which combines elements of the Mediterranean and DASH diets. This dietary pattern emphasizes eating brain protective foods like leafy greens, berries, nuts, whole grains, legumes, olive oil, and fatty fish. A recent long-term study involving over 92,000 participants found that women who significantly improved their adherence to the MIND diet over ten years had up to a 25% lower risk of developing dementia, even if they made dietary changes later in life.^24,25

In addition to nutrition, regular physical activity is crucial for brain health. Moderate aerobic exercise and resistance training improve blood flow to the brain and lower the risk of vascular issues such as high blood pressure and type 2 diabetes, which contribute to cognitive decline.²⁶ Mental stimulation and social engagement also help maintain cognitive function. Activities like reading, learning new skills, and participating in community programs can enhance cognitive reserve and reduce isolation, a significant risk factor for dementia progression.²⁷

Managing cardiovascular risk factors is also essential. The 2024 update to the Lancet Commission added untreated vision loss and high LDL cholesterol to the list of modifiable dementia risk factors. Effectively managing blood pressure, cholesterol, obesity, and diabetes has been shown to lower dementia incidence, especially in mid-life.^1,2 When medication is needed, recently approved drugs like benzgalantamine (Zunveyl), a cholinesterase inhibitor, provide symptom management for mild to moderate Alzheimer's disease. Approved by the FDA in July 2024, benzgalantamine improves neurotransmitter function and may slightly delay cognitive decline.

Technological advances are helping with early and accurate diagnosis. AI based tools, including convolutional neural networks with explainable AI features, have shown promise in predicting dementia progression from MRI scans.^5,28 Blood biomarker tests are approaching clinical readiness and offer over 90% diagnostic accuracy for early Alzheimer's detection without the invasiveness of Positron Emission Tomography (PET) scans or lumbar punctures.^29,30 Combined with lifestyle interventions, these innovations create a comprehensive prevention plan aimed at preserving brain health and delaying dementia onset in younger women.

Additionally, maintaining nutritional balance is key. Omega-3 and Omega-6 fatty acids should be consumed to keep the body fat ratio of Omega 3 to Omega 6 at 1:1, with about 1.5 to 2 grams of Omega-3 and 30 grams of protein daily. I also recommend the clinically formulated compound ‘Neuroban Fort’, a neuroprotective supplement containing vitamin B12, which supports cognitive health and helps manage early-stage dementia. To further protect against nerve damage, 500 milligrams of curcumin should be taken during treatment, along with a probiotic containing 50 billion CFU (Colony Forming Units) to support gut health.³¹

5 Discussions & limitations

Although the symptom awareness survey was conducted with women from Maharashtra, the proposed framework is not limited to this region. The survey is based on standard cognitive and functional indicators in line with the SAGE framework. These indicators work regardless of geographical, gender and cultural differences. The MRI based deep learning models were trained using multi-institutional, publicly available datasets from Kaggle repositories.⁴ This approach included diverse imaging characteristics from different populations.

However, certain limitations should be acknowledged for a balanced understanding of the findings. First, the lack of subject-level identifiers in the aggregated MRI dataset meant that data was split at the image level rather than the subject level. This could introduce a risk of data leakage. Nonetheless, stratified splitting, augmentation restricted to training data, and careful separation of training, validation, and test sets were used to reduce potential information leakage under image level data constraints. Also since the MRI dataset does not include demographic metadata such as sex or age, hence the proposed framework currently focuses on women through the symptom based screening stage rather than through gender specific model training. Second, the study used two dimensional axial MRI slices, which might not fully show the three dimensional structural changes associated with Alzheimer's disease progression. Then, the survey component involved a relatively small sample of 169 participants from Maharashtra, which may limit generalizability to broader populations. Fourth, while Integrated Gradients provided valuable qualitative insights into model decision making, quantitative validation using specific anatomical regions of interest was not part of this study.

From a practical point, the proposed framework is meant to serve as an early screening and decision support tool rather than a standalone diagnostic system for perimenopausal women. Combining symptom aware assessment with explainable MRI analysis offers a scalable way to identify early risks, especially in resource limited clinical settings. Future clinical validation across different populations and healthcare environments will further enhance the potential of this approach.

6 Future work

In the next phase of this research, we plan to validate the proposed model using multicenter MRI datasets collected from longitudinal subject level datasets from repositories such as Alzheimer’s Disease Neuroimaging Initiative (ADNI) and OASIS, different scanners, protocols, and patient populations to ensure stronger generalizability and real-world reliability. The framework can also be expanded to include multimodal data integration, combining MRI with PET imaging, blood based biomarkers, hormonal profiles, and genetic risk factors such as Apolipoprotein E (APOE) status for a more comprehensive assessment. More advanced architectures, including 3D CNNs, attention-based mechanisms, and longitudinal models, can be explored to capture subtle spatiotemporal patterns. To improve clinical trust, explainability techniques like Grad-CAM will be applied, along with quantitative XAI measures to objectively evaluate explanation consistency and reliability. Finally, lightweight deployment strategies such as pruning and quantization can be investigated to enable scalable and practical implementation in real-world clinical environments.

7 Conclusion

This study presents a comprehensive, women-focused real-world symptom awareness framework for early Alzheimer's disease detection by merging MRI-based deep learning, and ensemble modeling. By targeting women aged 45–60 age group, a group at higher risk due to hormonal and biological factors, this survey driven deep learning approach fills an important gap in current Alzheimer's research. The combination of three complementary CNN architectures using a stacking based ensemble greatly improved classification performance, achieving high accuracy, F1-score, and AUC. Ablation studies and t-SNE visualizations confirmed the advantages of feature diversity across models. Moreover, using Integrated Gradients for explainability showed that the system learns important brain regions linked to early neurodegeneration. The inclusion of survey derived SAGE scoring creates a practical, noninvasive screening method to facilitate timely MRI referrals and preventive measures.

Footnotes

List of abbreviations

Ethical approval and consent to participate

This study did not involve human or animal experimentation directly. Instead, it utilized publicly available datasets from Kaggle that are anonymized and ethically cleared for academic research purposes. No additional ethical approval was required.

Informed consent

Since the study included a questionnaire-based survey of 169 women; Participation was voluntary & No identifiable data was collected. No medical interventions or diagnoses were made. The survey was treated as general awareness feedback, not clinical research. Hence, no institutional ethical approval/ IRB approval was required.

Author's contribution

Mrs. Snehal Rohit Shinde conceptualized the research idea, conducted the literature review, performed the dataset preprocessing, implemented the deep learning models, and carried out the result analysis and interpretation. She was also responsible for drafting and revising the manuscript.

Dr Swati V. Sankpal provided expert supervision throughout the research, guided the methodology refinement, assisted in experimental design and validation, and critically reviewed the manuscript to enhance its technical and scientific quality. She also contributed to the final approval of the version to be submitted.

Both authors have read and approved the final manuscript and agree to be accountable for all aspects of the work.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Availability of data and material

The dataset utilized can be accessed publicly from the data repositories Kaggle mentioned in references.⁴ All data used in the research were anonymized and accessed in accordance with the terms and conditions of the respective data repositories. No identifiable personal information was collected or processed.

Use of AI technology

AI-based language assistance tools were used solely for grammatical correction and improvement of readability. No AI system was used to generate original research content or scientific claims. 2 figures are conceptualized and designed by the authors but generated using OpenAI based on author-defined methodology and prompt.

ORCID iDs

Snehal Rohit Shinde

Swati Vijay Sankpal

References

Ferretti

Iulita

Cavedo

, et al. Sex differences in Alzheimer disease — the gateway to precision medicine. Nat Rev Neurol 2018; 14: 457–469.

Livingston

Huntley

Liu

, et al. Dementia prevention, intervention, and care: 2024 report of the Lancet Standing Commission. Lancet 2024; 404: 572–628.

Scharre

Chang

Nagaraja

, et al. Self-administered gerocognitive examination: longitudinal cohort testing for the early detection of dementia conversion. Alzheimers Res Ther 2021; 13: 192.

Islam

Zhang

. Early diagnosis of Alzheimer’s disease: a neuroimaging study with deep learning architectures. Brain Inf 2018; 5: 1–11.

Feng

Elazab

Yang

, et al. Deep learning framework for Alzheimer’s disease diagnosis via 3D-CNN and FSBi-LSTM. IEEE Access 2019; 7: 63605–63618.

Martinez-Murcia

Ortiz

Gorriz

, et al. Studying the manifold structure of Alzheimer’s disease: a deep learning approach using convolutional autoencoders. IEEE J Biomed Health Inform 2019; 24: 17–26.

Fritsch

Wankerl

Nöth

. Automatic diagnosis of Alzheimer’s disease using neural network language models. In: Proceedings of ICASSP 2019 – IEEE international conference on acoustics, speech and signal processing (ICASSP), 2019, pp.5841–5845.

Nawaz

Anwar

Liaqat

, et al. Deep convolutional neural network based classification of Alzheimer’s disease using MRI data. In: Proceedings of 2020 IEEE 23rd international multi-topic conference (INMIC), 2020, pp.1–6. DOI: 10.1109/INMIC50486.2020.9318172.

de Oliveira

IAD

, et al. Exploring hippocampal asymmetrical features from MRI for Alzheimer’s disease classification. Int J Imaging Syst Technol 2020; 30: 393–406.

10.

Zhu

Sun

Huang

, et al. Dual attention multi-instance deep learning for Alzheimer’s disease diagnosis with structural MRI. IEEE Trans Med Imaging 2021; 40: 2354–2366.

11.

Khan

Kaushik

Rahmani

MKI

, et al. Stacked deep dense neural network model to predict Alzheimer’s dementia using audio transcript data. IEEE Access 2022; 10: 32750–32765.

12.

Klingenberg

Stark

Eitel

, et al. Higher performance for women than men in MRI-based Alzheimer’s disease detection. Alzheimers Res Ther 2023; 15: 1–13.

13.

Naveen

Cholli

. Predicting Alzheimer’s onset: leveraging pretrained deep neural networks and transfer learning for early detection. Int J Intell Syst Appl Eng 2024; 12: 1735.

14.

Morris

Liu

, et al. Using a convolutional neural network and explainable AI to diagnose dementia based on MRI scans. arXiv preprint arXiv:2406.18555, 2024.

15.

Amilo

Sadri

Hincal

. A hybrid approach to heart disease prediction using a fractional-order mathematical model and machine learning algorithm. Comput Methods Biomech Biomed Eng 2025: 1–30. DOI: 10.1080/10255842.2025.2523313

16.

Chugh

. Best Alzheimer MRI dataset (99% accuracy) [data set]. Kaggle, https://www.kaggle.com/datasets/lukechugh/best-alzheimer-mri-dataset-99-accuracy (2023).

17.

Chen

Isa

NAM

Liu

. A review of convolutional neural network based methods for medical image classification. Comput Biol Med 2025; 185: 109507, ISSN 0010-4825.

18.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. NIPS, 2012.

19.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016.

20.

Huang

Liu

van der Maaten

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017.

21.

van der Maaten

Hinton

. Visualizing data using t-SNE. Journal of Machine Learning Research 2008; 9(86): 2579–2605.

22.

Wolpert

. Stacked generalization. Neural Netw 1992; 5: 241–259.

23.

Sundararajan

Taly

Yan

. Axiomatic attribution for deep networks. In: Proceedings of the 34th international conference on machine learning (ICML), 2017, pp.3319–3328.

24.

Morris

Tangney

Wang

, et al. MIND Diet associated with reduced incidence of Alzheimer’s disease. Alzheimer’s Dement 2015; 11: 1007–1014.

25.

Agarwal

Dhana

Barnes

, et al. Changes in MIND diet adherence and risk of dementia in a large prospective cohort. Neurology 2023; 100: e198–e207.

26.

Erickson

Hillman

Stillman

, et al. Physical activity, cognition, and brain outcomes: a review of the 2018 Physical Activity Guidelines. Med Sci Sports Exerc 2019; 51: 1242–1251.

27.

Stern

. Cognitive reserve in ageing and Alzheimer’s disease. Lancet Neurol 2012; 11: 1006–1012.

28.

Samek

Wiegand

Müller

K-R

. Explainable artificial intelligence: understanding, visualizing, and interpreting deep learning models. ITU J. 2017; 1: 39–48.

29.

Grande

, et al. Blood biomarkers of Alzheimer’s disease and progression across stages of cognitive decline. Nat Med 2025.

30.

Varesi

, et al. Blood-based biomarkers for Alzheimer’s disease diagnosis. J Alzheimer's Dis 2022; 88: 861–887.

31.

Desai

. Dementia/Alzheimer’s prevention plan in early age of women (Validated expert report). Faculty of Science, Somaiya Vidyavihar University, 2025.