Abstract
Women are more severely impacted by Alzheimer's disease (AD), a progressive neurodegenerative illness, particularly in the age range of 45 to 60 years, when early signs are often incorrectly diagnosed. To improve the quality of life and slow the disease progression, it is crucial to detect the condition at this early stage. This research proposes a hybrid screening framework, integrating symptom level awareness through a structured survey along with deep learning based Magnetic Resonance Imaging (MRI) assessment. A curated dataset of 8511 preprocessed axial brain MRI image slices sourced from Kaggle's OASIS Database was used, categorized into Cognitively Normal, Moderate Impairment & Mild Cognitive Impairment classes. By utilizing feature extraction from three pretrained convolutional neural networks (AlexNet, ResNet50, and DenseNet121), performance was further improved using ensemble and stacking learning methods. Among the evaluated models, AlexNet, DenseNet121, and ResNet50 achieved accuracies of 0.94, 0.87, and 0.69 respectively, while the ensemble approach improved performance to 0.95 using soft voting and further to 0.97 with a stacking ensemble strategy. Integrated Gradients was used to generate saliency heatmaps that highlight clinically relevant neuroanatomical regions. Furthermore, a symptomatic survey of 169 women from Maharashtra was conducted and analyzed to facilitate scalable early screening. Thus, this study presents a comprehensive, interpretable, and feasible framework for the early detection of Alzheimer's.
Keywords
Introduction
Alzheimer's disease (AD) is a neurological condition that gradually deteriorates memory, cognitive abilities, and daily functioning. While it primarily affects older adults, studies reveal that women are disproportionately impacted, especially those aged between 45 and 60. Dr Richard Isaacson, director of Florida Atlantic University's Alzheimer's Prevention Clinic, indicates that two out of three individuals with Alzheimer's disease are women.1,2 This important finding, combined with previous research, shows that tailored lifestyle changes, like diet improvement, physical activity, stress management, and sleep hygiene can not only lower risk factors for both genders but also prove particularly effective for women.
This highlights the necessity for gender specific approaches to Alzheimer's prevention and diagnosis, especially for women who frequently exhibit early signs of Mild Cognitive Impairment (MCI). Unidentified MCI can lead to irreversible progression to full blown Alzheimer's disease. Thus, early diagnosis and timely intervention are vital for slowing disease development, improving life quality, and mitigating the overall societal and economic burden of the condition. 2 Despite considerable progress in neuroimaging and diagnostic technologies, accurately diagnosing Alzheimer's disease early remains particularly difficult, especially in its initial phases. The current investigation introduces a deep learning based screening model specifically aimed at women aged 45 to 60. This model seeks to identify subtle brain changes related to MCI and early Alzheimer's using Magnetic Resonance Imaging (MRI) images coupled with sophisticated Convolutional Neural Networks (CNNs).
The proposed system integrates a Self-Administered Gerocognitive Examination (SAGE) questionnaire based cognitive health score with MRI image analysis to enable comprehensive Alzheimer's risk assessment. Individuals with scores above the defined threshold undergo MRI preprocessing and CNN based feature extraction, followed by classification into Alzheimer's, MCI, or Normal categories. Based on the predicted outcome and risk level, the system provides personalized prevention strategies, medical guidance, and periodic monitoring recommendations.
A survey involving 169 women from various regions of Maharashtra supported the development of the proposed framework by capturing real world cognitive and behavioral trends. The questionnaire, adapted from the SAGE, 3 included over 20 questions related to early neurodegenerative and menopausal symptoms. Responses were analyzed using graphs and tables to identify significant patterns, and a cognitive health score was derived to indicate whether MRI evaluation may be needed.
The system's performance was evaluated using standard metrics such as accuracy, precision, recall, and F1-score to ensure a balanced and reliable assessment. Based on insights from both the survey responses and model predictions, preventive recommendations were proposed specifically for the target group. Although Alzheimer's disease has been widely studied, limited research focuses on women; despite their increased vulnerability due to hormonal, neurological, and psychological changes during this life stage. The proposed model adopts convolutional architectures commonly used in state-of-the-art (SOTA) Alzheimer's MRI studies. However, the novelty lies in strengthening the evaluation framework through survey analysis, followed by screening oriented feasibility using CNN models. Although the deep learning model used in this work is not gender specific, the proposed system adopts a two stage framework comprising a women focused questionnaire based screening followed by MRI based analysis. The first stage employs the SAGE questionnaire to identify women in the early stage that may be at risk of cognitive decline, after which selected participants undergo MRI scanning. The MRI data are then analyzed using a non–gender specific deep learning model to detect potential abnormalities associated with Alzheimer's progression. Previous longitudinal studies have shown that the SAGE questionnaire can facilitate earlier detection of cognitive decline, supporting its role as an initial screening component. 3 Although the deep learning models are not trained on gender specific MRI datasets, the framework incorporates a women focused symptom based survey stage. This two-stage approach enables targeted screening of women aged between 45–60 years of age; while applying general deep learning models for MRI classification.
The remainder of this paper is organized as follows: Sections 2 and 3 present a literature review and methods & materials employed in the research. Section 4 presents the results concerning CNN architectures and their validation. Section 5 details the discussions & limitations, while Section 6, 7 wraps up the paper by highlighting the future work & the significance of detecting early cognitive impairment in summary of the primary conclusion.
Literature survey
The literature review highlights recent deep learning approaches for Alzheimer's detection, focusing on MRI based CNNs, recurrent models, attention mechanisms, and ensemble techniques. While these methods show strong performance, most studies are based on general populations and do not consider sex specific risks or symptom progression, particularly in women (Table 1).
Recent advances in biomedical modeling have explored hybrid approaches that integrate mathematical disease modeling with machine learning techniques. For instance, A Hybrid Approach to Heart Disease Prediction Using a Fractional-Order Mathematical Model and Machine Learning Algorithm 15 uses fractional order differential equations to model the temporal evolution of physiological parameters and combines it with a decision tree classifier on the UCI Cleveland dataset for accurate prediction. This highlights the effectiveness of integrating dynamic modeling with interpretable machine learning for improved clinical decision making.
However, most approaches overlook gender-specific biological factors and rarely focus on women, despite evidence of differing disease progression. Many studies rely primarily on imaging data, with limited integration of symptom based or clinical information, restricting holistic risk assessment. Explainability is also often lacking, with few using interpretability methods like Integrated Gradients. These gaps motivate the proposed symptom aware, MRI based framework that has a hybrid screening framework that combines symptom based assessment with MRI based deep learning for early detection of Alzheimer's disease. Unlike purely physiological models, this approach integrates neuroimaging with patient reported cognitive indicators, providing a more comprehensive and practical screening system.
Methods and materials
The figure illustrated below (Figure 1) is the complete proposed framework for early Alzheimer's detection, beginning with SAGE based cognitive screening and progressing to AI driven risk prediction and prevention. The details of this framework are divided into sections A, B & C as follows.

Work flow of the proposed framework for early Alzheimer's detection (figure conceptualized and designed by the authors; generated using OpenAI based on author defined methodology and prompt.).
SECTION A: STATISTICAL ANALYSIS OF SURVEY
The survey involved 169 women in Maharashtra to evaluate neurological, behavioral, and cognitive symptoms associated with early neurodegenerative diseases and MCI. Participants responded to an extensive questionnaire containing over 20 inquiries.
Key questions and answers are illustrated in the table below. Table 2 and Figure 2 demonstrate the analysis and presentation of the data. To assess the relationship between symptoms, we calculated Spearman's rank correlation coefficient (

Responses from the participants.
Related work.
Responses of the survey.
Dataset class distribution (train/test splits).
Respondents were also asked to cross off any symptoms that applied to them from a list that included:
The SAGE scoring method also helps quantify survey responses and supports early identification of cognitive impairment. 3 It assesses key domains such as memory, orientation, language, reasoning, executive function, and visuospatial ability. A score of 16 or below suggests potential cognitive decline and indicates the need for further evaluation. This scoring framework helps guide decisions on whether MRI screening should be recommended, enabling timely detection and early intervention to slow progression toward dementia.
SECTION B: METHODOLOGY
As shown in Figure 3, the proposed framework uses a multistep pipeline for accessible and accurate early Alzheimer's screening. MRI data are preprocessed to improve quality, followed by CNN based classification to detect disease related patterns. The resulting models can be deployed through web or mobile platforms using ONNX compatibility. This approach shows potential as a clinically useful screening tool pending further validation. While Figure 4 represents the MRI based CNN ensemble and stacking pipeline, showing how multiple deep learning models collaborate to produce a robust final classification output.

Deep learning for the proposed study.

MRI based CNN ensemble and stacking pipeline (figure conceptualized and designed by the authors; generated using OpenAI based on author defined methodology and prompt.).
All experiments were conducted on Google Colab using an NVIDIA Tesla T4 GPU (16 GB) with TensorFlow/Keras in a Python environment. NumPy, scikit-learn, and Matplotlib were used for data preprocessing, evaluation, and visualization. Reproducibility was ensured by fixing random seeds (NumPy and TensorFlow seed = 42), disabling Python hash randomization (PYTHONHASHSEED = 0), and enabling deterministic TensorFlow operations. The exact library versions were recorded at runtime, enabling reliable comparison of AlexNet, ResNet50, DenseNet121, and their ensemble models.
The research utilizes the Kaggle Alzheimer's axial MRI dataset, 16 with each image representing a 2D axial slice from complete 3D MRI volumes. The axial plane provides a horizontal cross-sectional view of the brain. The dataset comprises slicewise 2D images derived from volumetric acquisitions and does not include demographic details of the patients or subject IDs. This limitation reflects a common challenge in medical imaging research where publicly available datasets lack detailed demographic metadata. Future work will focus on training and validating the model using clinically curated datasets with gender and age annotations to enable truly gender specific screening.
The data employed for this research project is a preprocessed version of the Kaggle Alzheimer's MRI data.
16
In this study, there have been experimental simulations done using a WGAN-GP model to generate MRIs to overcome the problem of dataset insufficiency and unbalanced classes, since the Moderate class initially contains only two subjects. To gauge the efficacy of these simulations and models, parameters like FID, SSIM, PSNR, sharpness difference (SD), and Seaborn's Distplot have been taken into consideration. These simulations show very close similarities to real MRIs, with mean FID of 0.13, SSIM of 0.97, PSNR of 32 dB, and SD of 0.04. There is also improvement in classification through the use of these models, resulting in 11.77% gain in Balanced Accuracy, 15% gain in Matthew's Correlation Coefficient (MCC), and a 91.4% gain in minority class performance at the cost of 1% loss in majority class performance. This comparative analysis further proves that these models perform better than the DC-GAN models., a WGAN-GP model generated additional synthetic MRI slices to balance all three categories. The SMOTE technique employed to mitigate class imbalance by generating synthetic samples through linear interpolation between minority instances is defined as
In total, 8511 MRI images were divided into training and testing sets for this study. There are 3200 images classified as Cognitively Normal (CN), 2572 images categorized as Moderately Impaired (MI), and 2739 images labeled as MCI across the sets (with samples displayed in Figure 5(a), (b), & (c) below). The key features detected by the model are depicted in the visual samples.

Sample 2D axial slice images per class (a) MCI, (b) MI, & (c) CN.
To ensure stable evaluation and assess the effects of different data splits, the dataset was divided at the image level, as there were no subject IDs available (Table 3).
For this project, each image undergoes preprocessing, which includes converting it to grayscale, resizing it to 64 × 64 pixels, and flattening it into a 4096 dimensional vector. Data augmentation was applied to the training set to improve generalization and reduce overfitting. Random rotations (±15°), horizontal flipping, and zooming (0.9–1.1) were used to introduce controlled geometric variability, while brightness adjustment (0.9–1.1) modeled illumination changes. Each training image was augmented twice using randomly sampled transformations. Augmentation was not applied to validation or test data to ensure unbiased evaluation.
We performed an ablation study across multiple input resolutions (64 × 64, 128 × 128, and 224 × 224) to evaluate the trade off between computational efficiency and anatomical detail preservation. The ablation study was conducted on the said dataset 16 using an ImageNet pretrained DenseNet121 model with frozen backbone layers and a custom classification head, trained using the Adam optimizer (batch size = 16, categorical crossentropy loss) for 3 epochs on a GPU enabled environment, while keeping all hyperparameters constant across 64 × 64 and 128 × 128 resolutions for fair comparison.
Increasing the input resolution from 64 × 64 to 128 × 128 led to only marginal improvements in accuracy, while noticeably increasing the training time. As summarized in Table 4 and illustrated in Figure 6(a) and (b), the higher resolution offers limited performance gain at a substantially greater computational cost. Therefore, 64 × 64 was selected as it provides a more practical balance between classification performance and computational efficiency.

(a) Resolution vs validation accuracy, (b) resolution vs training time per epoch.
Per epoch comparison of training accuracy, validation accuracy, and computational time for 64 × 64 and 128 × 128 input resolutions.
To investigate the effect of data balancing and augmentation strategies, four experimental configurations were tested using ResNet50, DenseNet121, and AlexNet architectures:
Baseline (original training data) Augmentation only SMOTE only SMOTE followed by augmentation
The baseline experiments demonstrated strong overall performance, with AlexNet achieving the highest validation accuracy (0.9512), followed by DenseNet121 (0.9115) and ResNet50 (0.7949) (Table 5). These results indicate that the dataset preserves meaningful discriminative features despite the presence of class imbalance. However, applying geometric and photometric augmentation alone led to a substantial drop in performance across all models (≈0.33–0.37), suggesting that such spatial transformations may distort subtle anatomical biomarkers that are critical for MRI based diagnosis.
Validation accuracy under different data balancing strategies.
Validation accuracy under different data balancing strategies.
In contrast, SMOTE based oversampling consistently improved validation accuracy across all architectures, most notably for AlexNet (0.9863). This highlights the impact of class imbalance on model learning and demonstrates that synthetic minority sampling can enhance class separability and decision boundaries. When SMOTE was combined with augmentation, performance again declined, likely due to the introduction of amplified noise and altered feature distributions. These findings suggest that conventional augmentation strategies may not always be appropriate for small scale medical imaging datasets where fine structural details are essential.
Figure 7 displays the two dimensional t-SNE visualization of MRI feature embeddings subsequent to dimensionality reduction through Principal Component Analysis (PCA). Initially, the original 4096 dimensional feature vectors representing the preprocessed MRI images are reduced to 50 dimensions using PCA to minimize any kind of noise and ensure numerical stability. Finally, the reduced dimensions are further transformed into a 2-D space using t-SNE to evaluate the class separation of various classes for CN, MCI, and MI. The visualization highlights clear clustering patterns, with CN samples forming relatively tight and distinctly separated areas, while MCI and MI samples show some overlap, which reflects the gradual and progressive nature of Alzheimer's disease. This overlap carries clinical significance, as initial stage cognitive impairment often resembles structural features associated with normal ageing and moderate impairment.

2-D t-SNE visualization of the training images.

Architecture of AlexNet.

Architecture of ResNet50.

Architecture of DenseNet121.
In this study, features are extracted from three CNN architectures (AlexNet, ResNet50, and DenseNet) since CNNs are well-regarded in medical imaging for their ability to learn hierarchical spatial features.
CNN based feature extraction
Recent studies have demonstrated that CNNs are highly effective for medical image classification across diverse disease domains, particularly when combined with preprocessing, transfer learning, and ensemble strategies. Moreover, the integration of explainable CNN models is increasingly emphasized to enhance clinical trust and interpretability in medical image–based diagnosis.
17
CNNs generate hierarchical feature representations through consecutive convolution, nonlinear activation, and pooling operations. For an input image X, the convolutional transformation at the l-th layer can be expressed as
18
:
To achieve spatial down-sampling and improve translation invariance, CNNs utilize pooling operations, which can be formulated as18,19:
After the convolutional and pooling stages, the resulting feature maps are flattened and passed through one or more fully connected (FC) layers. The output of an FC layer is given by:
The proposed research uses three CNN Architectures: AlexNet, ResNet50, and DenseNet121. Their predictions were combined using simple ensemble averaging and a stacking meta-learner.
The subsequent convolutional layers comprise a 256-filter 5 × 5 layer with batch normalization, and three additional convolutional layers (384, 384, and 256 filters, each with 3 × 3 kernels), all normalized to enhance learning stability. A concluding max pooling layer organizes the acquired feature maps for the fully connected classifier. The classification section contains two high capacity Dense layers with 4096 neurons each, reflecting the original AlexNet structure. Dropout regularization (0.5) is applied after each Dense layer to mitigate overfitting. The output layer employs softmax activation to classify three categories of Alzheimer's (Mild, Moderate, No Impairment).
The final model was put together using the Adam optimizer (learning rate 0.0001) with categorical cross-entropy loss. The training took place with a batch size of 8 for up to 40 epochs, employing early stopping based on validation loss to avoid overfitting. During this process, solely the newly introduced classifier layers were trained, allowing for efficient transfer learning in the context of MRI based Alzheimer's classification. This architecture effectively merges stable pre-trained feature extraction with a streamlined classifier designed specifically for Alzheimer's MRI classification.
During the initial training phase, the DenseNet121 backbone was frozen to prevent overfitting while retaining the general feature representations learned from large scale datasets. The input MRI scans were resized to 64 × 64 × 3 and passed through the network, followed by a Global Average Pooling layer to obtain a compact feature vector. This representation was then processed by a custom classification head consisting of fully connected layers with ReLU activation, batch normalization, dropout regularization, and a final softmax layer for three class prediction.
Optimizer
We opted for the Adam optimizer due to its adaptive learning rate mechanism, which combines the advantages of both AdaGrad and RMSProp. Adam adeptly handles sparse gradients and ensures quicker convergence, making it an excellent option for training deep neural networks. The training of these models underwent for 50 epochs using the Adam optimizer (learning rate = 1 × 10−4) and categorical cross-entropy loss, with a batch size of 8. This value of batch is decided on experimentation and it worked well for our use case. Early stopping based on validation loss was utilized to restore the optimal weights. This arrangement facilitated the efficient extraction of hierarchical spatial features from MRI slices while maintaining computational efficiency.
Hyperparameter tuning
A grid search strategy was employed to determine the optimal dropout rate and learning rate for each architecture. Two dropout values (0.3 and 0.5) were evaluated with a fixed learning rate of 0.0001. Across all models, a dropout rate of 0.3 yielded superior validation accuracy. The best configurations achieved validation accuracies of
Cross-validation
To assess robustness against random partition effects, the models were additionally evaluated across five independent stratified random image level splits. Mean and standard deviation of performance metrics were computed to quantify stability. DenseNet121 exhibited the most consistent behavior indicating strong robustness across different partitions. AlexNet achieved the highest mean accuracy but showed larger variability, suggesting increased sensitivity to data splits. ResNet50 demonstrated moderate performance variability reflecting its comparatively weaker class separation on this dataset. These findings support the use of ensemble and stacking strategies to reduce variance and improve reliability. The results of Statistical Significance (McNemar Test) are shown in the Table 6 and described in the Results section 4.6.
McNemar test results for pairwise model comparisons (Bonferroni-adjusted α = 0.025).
McNemar test results for pairwise model comparisons (Bonferroni-adjusted α = 0.025).
After we assessed three pretrained CNN architectures: ResNet50, DenseNet121, and AlexNet; on the three class test set (Mild, Moderate, Non-Impaired), which included 831 images; the findings were as shown in the Table 7 below.
Accuracy & Macro-F1 for ResNet50, DenseNet121, and AlexNet.
Accuracy & Macro-F1 for ResNet50, DenseNet121, and AlexNet.
Feature space separability analysis (t-SNE representation).
Accuracy & Macro-F1 for ensemble vs stacking.
The relatively lower accuracy of ResNet50 (0.69) indicates variability in base model performance, which may be attributed to differences in architecture depth and feature extraction capability. This highlights the importance of selecting models best suited to the dataset characteristics. AlexNet produced the clearest feature clusters and had the strongest performance as a single model.
t-Distributed Stochastic Neighbor Embedding (t-SNE) 21 is a nonlinear technique used for dimensionality reduction. It is frequently utilized to visualize high dimensional data in two or three dimensions. To evaluate the discriminative power of extracted MRI features, t-SNE was used to project high dimensional CNN embeddings into two dimensional space on the test set. Visualizations were generated before and after feature selection to assess improvements in class separability. AlexNet showed well defined and distinct clusters across classes, indicating strong feature discrimination. DenseNet121 demonstrated moderate separation with some overlap, particularly for Mild cases. ResNet50 exhibited considerable class overlap, suggesting less distinct feature boundaries for this dataset. This analysis was conducted on the test sets to confirm that the observed structure is applicable beyond the training data (Figure 11(a), (b), and (c)).

(a) AlexNet embeddings, (b) DenseNet embeddings, and (c) ResNet embeddings.
Feature space separability was evaluated using silhouette score and inter/intra-class distance analysis as shown in Table 8. The silhouette score of 0.2531 indicates moderate clustering structure, suggesting partial but meaningful separation between cognitive classes.
The average intra-class distance (50.99) was substantially lower than the inter-class distance (86.89), yielding an overlap ratio of 0.5868. This confirms that while classes are distinguishable, noticeable feature overlap exists, which justifies the need for ensemble learning to improve decision boundary refinement.
Ensemble learning module
Soft-voting ensemble is applied to reduce prediction variance by averaging outputs from multiple models. To reduce any model specific bias, this soft-voting ensemble was formed using the three top performing models selected following hyper parameter tuning: (1) ResNet50, (2) DenseNet121, and (3) AlexNet.
The soft-voting ensemble computes the final class probability as the average of the classwise probabilities predicted by individual models:
Where N = 3 (models),
Pi(c) = is probability for class c.
The ensemble was evaluated on the validation set using:
The ensemble achieved a validation accuracy of:
Where N is the number of validation images.
Stacking based ensemble learning integrates predictions from various CNN architectures to enhance accuracy based on MRI data.
22
Each foundational model generates a probability distribution for the three classes (Mild, Moderate, Non-Impaired). For each MRI sample, these outputs are merged to create a 9-dimensional meta-feature vector:
A Logistic Regression classifier (max_iter = 1000, one-vs-rest mode) was used as the meta model. The final class probability for stacking is computed as:
The final predicted label is obtained as:
The stacked model was trained on the meta features generated from the training set and evaluated on the validation set. The stacking ensemble achieved a validation accuracy of:
Each of the three architectures generates unique feature embedding's (as shown by the t-SNE embedding's): 1. AlexNet identifies distinct structural patterns. 2. DenseNet121 yields compact representations boosted by feature reuse. 3. ResNet50 captures broader abstractions, contributing to diversity.
Relying on a single network restricts feature extraction, often leading to lower accuracy, higher validation loss, and poor F1-scores, especially for minority classes. Stacking and ensemble learning combine complementary feature representations, improving robustness and achieving superior classification performance compared to any single backbone model. We assessed two fusion methods: Ensemble versus Stacking, detailed as in Table 9.
The stacking model successfully integrated discriminative low-level features from AlexNet, the dense feature reuse from DenseNet121, and the abstract representations from ResNet50, achieving improved robustness and a balanced class distribution.
Explainable artificial intelligence (XAI) using integrated gradients
To assess the interpretability of the proposed deep learning framework, Integrated Gradients (IG) was employed, which is a gradient based technique that measures how individual pixels affect the predicted result. 23 The pipeline successfully generated Integrated Gradients maps for 20 test samples using the tuned_alexnet_model. This helps visualize which pixels in the input images are most crucial for AlexNet's predictions.
Figure 12 illustrates the IG saliency maps for the three architectures: ResNet50, DenseNet121, and AlexNet; on samples categorized under (MCI) Mild. The heatmaps highlight key brain regions such as cortical gray matter, periventricular areas, and Sulcal gyral contours, which are known to show subtle early stage disease changes. Warmer colors (red/yellow) indicate regions strongly influencing predictions, while cooler areas have minimal impact. Consistent patterns across architectures confirm that the models focus on clinically meaningful features rather than artifacts.

IG saliency maps for AlexNet, DenseNet121, and ResNet50 for Mild Class.
Deletion Test (Faithfulness): The plot in Figure 13 illustrates the faithfulness of AlexNet's attributions. It shows how the model's prediction confidence decreases as the most important pixels (identified by IG) are progressively removed from the image. This demonstrates that AlexNet relies on these highlighted regions for its predictions

Faithfulness of AlexNet's attributions.
The final plot (Figure 14) displays an average attribution map for AlexNet across all 20 samples. This highlights features or regions that are consistently important for AlexNet's predictions across a representative set of images in the test data. These results provide insights into how the tuned_alexnet_model makes its predictions, indicating which parts of the MRI scans it focuses on.

Population-level mean attribution map.
This section presents a detailed performance analysis of the proposed CNN and ensemble models for multi-class classification. The evaluation includes comparative metrics, cross-validation results, statistical significance testing, uncertainty analysis, robustness assessment, and inference efficiency. To ensure robustness and reduce overfitting, multiple validation strategies were applied including five-fold stratified cross-validation, early stopping based on validation loss, bootstrap confidence intervals, and statistical significance testing using McNemar's test.
Class wise comparative performance of CNN and ensemble models
Table 10 demonstrates that ensemble & stacking model outperformed individual CNN models across most evaluation metrics performed class-wise. The class-wise analysis demonstrates balanced diagnostic capability across all dementia stages. Notably, the Stacking model achieves high sensitivity for Mild (MCI) (0.8939) and perfect detection of Moderate cases (MI) (1.0000), indicating strong suitability for early-stage clinical screening where minimizing false negatives is critical. In addition to overall accuracy, class-wise sensitivity, specificity, and ROC-AUC were computed to ensure balanced evaluation across dementia categories and enhance clinical screening relevance.
Class wise comparative performance of CNN and ensemble models.
Class wise comparative performance of CNN and ensemble models.
Clinically weighted error analysis highlighting the prioritization of false negative reduction in Alzheimer's MRI classification.
Classification accuracy & class-wise evaluation metrics for all the 5 models on the TEST SET.
To gain deeper insights into how the proposed models perform classification, we generated confusion matrices for ResNet50, DenseNet121, AlexNet, as well as ensemble learning and stacking based approaches applied on the test dataset, which are illustrated in Figure 15. The confusion matrix of ResNet50 reveals frequent confusion between Mild (MCI) and Non-Impaired (CN) cases, suggesting difficulty in capturing subtle early Alzheimer's changes. DenseNet121 performs better, especially in recognizing (MI) Moderate cases, though some overlap between Mild (MCI) and Non-Impaired (CN) stages remains. AlexNet further improves classification by reducing false positives and correctly identifying most Non-Impaired subjects, with only a small number of Mild (MCI) cases misclassified. The ensemble model strengthens overall reliability by combining multiple learners, leading to fewer inter-class errors and accurate identification of (MI) Moderate cases. The stacking model achieves the most consistent performance, with predictions concentrated along the diagonal, indicating stable and dependable classification across all disease stages.

Confusion matrix of ResNet50, DenseNet121, AlexNet, ensemble & stacking.
Overall, the confusion matrix results show that both the ensemble based models reflect clinical priorities in Alzheimer's diagnosis better by focusing on higher sensitivity and reducing missed disease cases. Most misclassifications occur between neighboring disease stages, which mirrors the gradual and progressive course of Alzheimer's rather than sharply defined categories. This clinically meaningful error pattern as seen in the Table 11 suggests that the proposed ensemble and stacking models are well suited for supporting real world Alzheimer's MRI diagnosis and clinical decision making.
Among all evaluated models (Table 12 & Figure 16), the stacking based ensemble attained the highest accuracy at 97% and showed the best macro-averaged F1-score, confirming its effectiveness in classifying Alzheimer's disease. For each model, we calculated class probabilities for every test MRI scan. The final predictions were derived from: The Argmax of probabilities for individual models, The mean probability aggregation for the ensemble & The logistic regression based meta-learner for stacking.

Training evaluation metrics chart for all the 5 models.
Let the true test labels be
For all the 5 models, we generated: Training vs. validation accuracy curves.
Figure 17 presents the training and validation accuracy curves for ResNet50, DenseNet121, and AlexNet over the training epochs. The ResNet50 model successfully completed its training for 46 epochs, but due to early stopping, the best weights were restored from epoch 39, as this epoch achieved the lowest validation loss of 0.1775. DenseNet121 displays quicker convergence with a stable learning progression, as both training and validation accuracies show steady increases and remain closely aligned. This model completed training for 50 epochs, but the best weights were restored from epoch 49 with validation loss of 0.12833. This indicates effective feature reuse and better generalization capability. While AlexNet completed training in 10 epochs and restored the best weights at 9th epoch since val_loss did not improve from 0.36869. This exhibits rapid overfitting, as evidenced by near-perfect training accuracy coupled with highly unstable validation performance. This underscores the necessity of ensemble and stacking strategies to bolster robustness and enhance classification accuracy.

Training and validation accuracy curves for ResNet50, DenseNet121, and AlexNet models.
Thus, the overfitting was controlled using early stopping with a patience of 7 epochs, monitored on validation loss. Additional regularization was achieved through dropout (rate = 0.5) in the classifier layers and batch normalization, with the final model selected based on the epoch yielding the lowest validation loss.
As detailed in Table 13, optimal performance of stacking model was attained at epoch 8, after which there were no further enhancements in validation metrics. Therefore, the results presented correspond to the best performing epoch determined by early stopping. The evaluation metrics offer a thorough overview of the model's learning behavior, generalization capability, and classification quality.
Performance evaluation metrics of stacking model for 8 epochs.
Performance evaluation metrics of stacking model for 8 epochs.
The stacking model shows steady improvement across epochs, with training accuracy increasing from 59.18% to 87.96% and AUC from 0.61 to 0.94, while loss decreases consistently. Despite minor fluctuations, the final epoch achieves strong validation performance (accuracy: 92.09%, AUC: 0.9966), indicating effective learning and high classification capability.
The robustness of the models against partition dependent variability, performance stability was analyzed across five independent stratified image level splits (see Table 14). These observations reinforce the motivation for ensemble and stacking approaches to enhance reliability.
Mean ± std of 3 base models.
Mean ± std of 3 base models.
Bootstrap based performance estimates (95% CI).
To determine whether the observed performance differences were statistically meaningful, McNemar's test was conducted on paired predictions from the same independent test set. Since multiple pairwise comparisons were performed, Bonferroni correction was applied, resulting in an adjusted significance threshold of α = 0.025.
As shown in Table 6, AlexNet significantly outperforms DenseNet121 (χ2 = 27.627, p = 1.471 × 10−7), with substantially more disagreement cases favoring AlexNet (93 vs 33). Similarly, the proposed stacking framework demonstrates a statistically significant improvement over ResNet50 (χ2 = 106.667, p < 0.0001). However, the comparison between the stacking model and the ensemble approach did not yield a statistically significant difference (χ2 = 0.083, p = 0.7728), indicating comparable performance between these two strategies. Overall, these results confirm that the reported improvements over individual backbone architectures are statistically supported and not attributable to random variation.
Robustness, uncertainty and calibration analysis
Bootstrap confidence intervals
To quantify statistical uncertainty, bootstrap resampling (1000 iterations) was applied on the test set and 95% confidence intervals (CI) were computed for overall performance metrics as seen in Table 15.
The stacking model achieved the highest performance with consistently narrow confidence intervals, indicating stable generalization.
Class-wise uncertainty
Class-wise bootstrap sensitivity analysis revealed wider confidence intervals for the Moderate Impairment (MI) class (n = 12), reflecting uncertainty due to limited sample size rather than model instability. Larger classes (Mild (MI) and Non-Impaired (CN)) showed narrow intervals, indicating stable class-wise performance.
Calibration and reliability
Calibration quality was assessed using reliability diagrams (Figure 18) and Expected Calibration Error (ECE). The obtained ECE value of 0.0135 indicates excellent probability calibration, demonstrating strong agreement between predicted confidence and observed outcomes.

Reliability diagram.
Since the framework targets early screening, decision thresholds were optimized to reduce false negatives. A screening threshold of 0.10 achieved a sensitivity of 0.953 for Alzheimer positive cases (Mild + Moderate), prioritizing early detection at the expense of acceptable false positives.
These analyses demonstrate that the proposed framework is not only accurate but also statistically reliable, well-calibrated, and clinically aligned for screening oriented deployment.
Inference efficiency
Inference efficiency was evaluated in terms of average inference time per image, total batch inference time, and overall model complexity measured by the number of parameters. Each architecture was tested on 100 randomly selected test images to compute per image latency. AlexNet achieved the lowest average inference time of
SECTION C: DEMENTIA/ALZHEIMER'S PREVENTION PLAN FOR HIGH-RISK WOMEN
Recent advancements in dementia research have identified several lifestyle, dietary, medical, and technological strategies that can help prevent or delay the onset of Alzheimer's disease, especially in women at high risk due to genetic, hormonal, or environmental factors. One of the most effective strategies is following the MIND diet, which combines elements of the Mediterranean and DASH diets. This dietary pattern emphasizes eating brain protective foods like leafy greens, berries, nuts, whole grains, legumes, olive oil, and fatty fish. A recent long-term study involving over 92,000 participants found that women who significantly improved their adherence to the MIND diet over ten years had up to a 25% lower risk of developing dementia, even if they made dietary changes later in life.24,25
In addition to nutrition, regular physical activity is crucial for brain health. Moderate aerobic exercise and resistance training improve blood flow to the brain and lower the risk of vascular issues such as high blood pressure and type 2 diabetes, which contribute to cognitive decline. 26 Mental stimulation and social engagement also help maintain cognitive function. Activities like reading, learning new skills, and participating in community programs can enhance cognitive reserve and reduce isolation, a significant risk factor for dementia progression. 27
Managing cardiovascular risk factors is also essential. The 2024 update to the Lancet Commission added untreated vision loss and high LDL cholesterol to the list of modifiable dementia risk factors. Effectively managing blood pressure, cholesterol, obesity, and diabetes has been shown to lower dementia incidence, especially in mid-life.1,2 When medication is needed, recently approved drugs like benzgalantamine (Zunveyl), a cholinesterase inhibitor, provide symptom management for mild to moderate Alzheimer's disease. Approved by the FDA in July 2024, benzgalantamine improves neurotransmitter function and may slightly delay cognitive decline.
Technological advances are helping with early and accurate diagnosis. AI based tools, including convolutional neural networks with explainable AI features, have shown promise in predicting dementia progression from MRI scans.5,28 Blood biomarker tests are approaching clinical readiness and offer over 90% diagnostic accuracy for early Alzheimer's detection without the invasiveness of Positron Emission Tomography (PET) scans or lumbar punctures.29,30 Combined with lifestyle interventions, these innovations create a comprehensive prevention plan aimed at preserving brain health and delaying dementia onset in younger women.
Additionally, maintaining nutritional balance is key. Omega-3 and Omega-6 fatty acids should be consumed to keep the body fat ratio of Omega 3 to Omega 6 at 1:1, with about 1.5 to 2 grams of Omega-3 and 30 grams of protein daily. I also recommend the clinically formulated compound ‘Neuroban Fort’, a neuroprotective supplement containing vitamin B12, which supports cognitive health and helps manage early-stage dementia. To further protect against nerve damage, 500 milligrams of curcumin should be taken during treatment, along with a probiotic containing 50 billion CFU (Colony Forming Units) to support gut health. 31
Discussions & limitations
Although the symptom awareness survey was conducted with women from Maharashtra, the proposed framework is not limited to this region. The survey is based on standard cognitive and functional indicators in line with the SAGE framework. These indicators work regardless of geographical, gender and cultural differences. The MRI based deep learning models were trained using multi-institutional, publicly available datasets from Kaggle repositories. 4 This approach included diverse imaging characteristics from different populations.
However, certain limitations should be acknowledged for a balanced understanding of the findings. First, the lack of subject-level identifiers in the aggregated MRI dataset meant that data was split at the image level rather than the subject level. This could introduce a risk of data leakage. Nonetheless, stratified splitting, augmentation restricted to training data, and careful separation of training, validation, and test sets were used to reduce potential information leakage under image level data constraints. Also since the MRI dataset does not include demographic metadata such as sex or age, hence the proposed framework currently focuses on women through the symptom based screening stage rather than through gender specific model training. Second, the study used two dimensional axial MRI slices, which might not fully show the three dimensional structural changes associated with Alzheimer's disease progression. Then, the survey component involved a relatively small sample of 169 participants from Maharashtra, which may limit generalizability to broader populations. Fourth, while Integrated Gradients provided valuable qualitative insights into model decision making, quantitative validation using specific anatomical regions of interest was not part of this study.
From a practical point, the proposed framework is meant to serve as an early screening and decision support tool rather than a standalone diagnostic system for perimenopausal women. Combining symptom aware assessment with explainable MRI analysis offers a scalable way to identify early risks, especially in resource limited clinical settings. Future clinical validation across different populations and healthcare environments will further enhance the potential of this approach.
Future work
In the next phase of this research, we plan to validate the proposed model using multicenter MRI datasets collected from longitudinal subject level datasets from repositories such as Alzheimer’s Disease Neuroimaging Initiative (ADNI) and OASIS, different scanners, protocols, and patient populations to ensure stronger generalizability and real-world reliability. The framework can also be expanded to include multimodal data integration, combining MRI with PET imaging, blood based biomarkers, hormonal profiles, and genetic risk factors such as Apolipoprotein E (APOE) status for a more comprehensive assessment. More advanced architectures, including 3D CNNs, attention-based mechanisms, and longitudinal models, can be explored to capture subtle spatiotemporal patterns. To improve clinical trust, explainability techniques like Grad-CAM will be applied, along with quantitative XAI measures to objectively evaluate explanation consistency and reliability. Finally, lightweight deployment strategies such as pruning and quantization can be investigated to enable scalable and practical implementation in real-world clinical environments.
Conclusion
This study presents a comprehensive, women-focused real-world symptom awareness framework for early Alzheimer's disease detection by merging MRI-based deep learning, and ensemble modeling. By targeting women aged 45–60 age group, a group at higher risk due to hormonal and biological factors, this survey driven deep learning approach fills an important gap in current Alzheimer's research. The combination of three complementary CNN architectures using a stacking based ensemble greatly improved classification performance, achieving high accuracy, F1-score, and AUC. Ablation studies and t-SNE visualizations confirmed the advantages of feature diversity across models. Moreover, using Integrated Gradients for explainability showed that the system learns important brain regions linked to early neurodegeneration. The inclusion of survey derived SAGE scoring creates a practical, noninvasive screening method to facilitate timely MRI referrals and preventive measures.
Footnotes
List of abbreviations
Ethical approval and consent to participate
This study did not involve human or animal experimentation directly. Instead, it utilized publicly available datasets from Kaggle that are anonymized and ethically cleared for academic research purposes. No additional ethical approval was required.
Informed consent
Since the study included a questionnaire-based survey of 169 women; Participation was voluntary & No identifiable data was collected. No medical interventions or diagnoses were made. The survey was treated as general awareness feedback, not clinical research. Hence, no institutional ethical approval/ IRB approval was required.
Author's contribution
Both authors have read and approved the final manuscript and agree to be accountable for all aspects of the work.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Availability of data and material
The dataset utilized can be accessed publicly from the data repositories Kaggle mentioned in references. 4 All data used in the research were anonymized and accessed in accordance with the terms and conditions of the respective data repositories. No identifiable personal information was collected or processed.
Use of AI technology
AI-based language assistance tools were used solely for grammatical correction and improvement of readability. No AI system was used to generate original research content or scientific claims. 2 figures are conceptualized and designed by the authors but generated using OpenAI based on author-defined methodology and prompt.
