Abstract
Background
With the advent of anti-amyloid-β monoclonal antibody therapies and the growing societal burden of dementia, early identification of Alzheimer's disease and related dementias has become a clinical priority.
Objective
To evaluate the diagnostic accuracy of a machine learning model using a neuropsychological battery to classify individuals as Healthy controls, mild cognitive impairment (MCI), or Dementia, and to identify neuropsychological tests and cognitive domains that contributed most to classification accuracy, determining optimal tests for dementia screening.
Methods
In this retrospective cross-sectional single-center study, we analyzed 590 participants evaluated for suspected dementia. The final sample comprised 74 Healthy controls, 190 individuals with MCI, and 326 with Dementia (including 269 with Alzheimer's disease). Scores from the Mini-Mental State Examination (MMSE), Montreal Cognitive Assessment Japanese version (MoCA-J), Rivermead Behavioural Memory Test (RBMT), Japanese Adult Reading Test (JART), and Wechsler Adult Intelligence Scale-III were input into a random forest machine learning model. Model performance was assessed using the area under the ROC curve (AUC). A variable importance analysis determined each test's relative contribution to classification.
Results
The multiclass model achieved an AUC of 0.898. RBMT was the strongest contributor, exceeding MMSE and MoCA-J. In borderline MMSE/MoCA-J subsets, adding RBMT improved classification performance for both Healthy versus MCI and MCI versus Dementia decisions.
Conclusions
RBMT provides substantial incremental value for dementia-related diagnostic discrimination, particularly as a second-line assessment when brief screening results are borderline. However, its administration time may limit its role as a universal first-line screening tool.
Introduction
Dementia represents a significant global health challenge, particularly prevalent in aging populations. 1 This trend raises concerns about the escalating demands on healthcare systems and social welfare resources as populations continue to age. Early detection of dementia is crucial for addressing this issue, as early intervention has the potential to reduce costs and enhance quality of life. 2 New anti-amyloid-β antibodies like lecanemab are indicated for those with mild cognitive impairment (MCI) or mild Alzheimer's disease (AD), highlighting the need for early detection. 3 Common screening tests for MCI and dementia include the Mini-Mental State Examination (MMSE), 4 the Japanese version of the Montreal Cognitive Assessment (MoCA-J), 5 and the Addenbrooke's Cognitive Examination-III. 6 These tests have contributed to early identification of cognitive impairment. 7 Machine learning can now extract information from multiple data sources for accurate diagnosis. Studies show it can facilitate early detection of MCI or dementia using clinical findings and neuroimaging. 8 In this study, we used machine learning to classify participants as Healthy, MCI, or Dementia using multiple neuropsychological tests. We investigated which tests best supported diagnostic classification, examining the importance of memory assessment using the Rivermead Behavioural Memory Test (RBMT) to identify MCI and dementia.
Methods
Study design, setting, and participants
This retrospective, single-center study was conducted at a university hospital in Japan from March 2015 to February 2020. Participants were patients evaluated for suspected dementia during this period. Consent was obtained from participants using the opt-out method. Participants received physician interviews and underwent brain magnetic resonance imaging (MRI), cerebral blood flow single-photon emission computed tomography, MMSE, 4 MoCA-J, 5 RBMT, 9 Japanese Adult Reading Test (JART), 10 and Wechsler Adult Intelligence Scale-III (WAIS-III). 11 All neuropsychological tests were administered in validated Japanese versions.4,5,12–14 Dementia or MCI was diagnosed by psychiatrists with three years minimum psychiatric experience using DSM-5. 15 Individuals with normal cognitive function not meeting MCI criteria were assigned to the Healthy group. Patients who met the criteria for AD, vascular dementia (VaD), dementia with Lewy bodies (DLB), or frontotemporal dementia (FTLD) were placed in the dementia group. For multiple dementia types, the primary DSM-5 diagnosis was used. Exclusion criteria were: 1) Missing values in core evaluations (MMSE, MoCA-J, RBMT, JART, and WAIS-III). 2) Cognitive decline from causes other than dementia (e.g., normal pressure hydrocephalus and depression). Initially, 868 patients enrolled. Of these, 233 had missing neuropsychological test data, leaving 635. An additional 45 patients were excluded due to cognitive decline other than dementia. The final 590 participants included: 74 Healthy, 190 with MCI, and 326 with Dementia (AD = 269, DLB = 30, VaD = 20, FTLD = 7) (Figure 1). Clinical diagnosis, age, and sex had no missing data.

Selection criteria and process for research participants.
All machine learning analyses were performed in a single complete-case cohort (N = 590). Therefore, the sample size was constant across all primary and supplementary machine learning analyses. Subset analyses were defined within this fixed cohort and did not involve any additional exclusions beyond the prespecified subset definitions.
Data, including demographics, DSM-5 clinical diagnoses, and neuropsychological test results, were extracted from medical records. Study size was determined by eligible patients with complete neuropsychological data during this period.
Sampling bias
To minimize selection bias, we clearly defined inclusion and exclusion criteria. However, the exclusion of patients with missing data could potentially introduce bias, a limitation discussed later.
Statistical analysis of participant characteristics
We investigated whether sex ratio, age, MMSE total score, MoCA-J total score, RBMT total standard profile score (PS), JART-predicted full IQ (FIQ), and WAIS-III FIQ differed among Healthy, MCI, and Dementia groups. Sex ratio differences were analyzed using chi-square tests. If significant, pairwise chi-square tests with Bonferroni correction identified group differences. For continuous variables (age, MMSE, MoCA-J, RBMT, JART FIQ, WAIS-III FIQ), we used Kruskal-Wallis test, followed by Dunn's post-hoc test where appropriate.
WAIS-III handling
Since premorbid values of the WAIS-III were generally unavailable, we tested three methods to incorporate the JART-predicted IQ:
Plain: Raw WAIS-III scores (FIQ, VIQ, PIQ, VCI, WMI, PSI, POI). Gap: Difference between WAIS-III and JART (e.g., WAIS-III FIQ − JART FIQ). Ratio: The ratio of WAIS-III to JART (e.g., WAIS-III FIQ/JART FIQ).
We used 1000 bootstrap resamplings to compare these approaches for three-group classification (Healthy, MCI, Dementia), controlling for sex, age, total MMSE score, total MoCA-J score, and RBMT total score. Weighted resampling was applied to reflect the Japanese population prevalence (60.2% Healthy, 23.4% MCI, 16.4% Dementia).
16
The sample proportions for each group were denoted as SHealthy for the Healthy group, SMCI for the MCI group, and SDementia for the Dementia group. The weights (W) were calculated as follows:
Statistical comparison of items contributing to diagnostic classification
We used a random forest model with total MMSE, MoCA-J, RBMT total PS, selected WAIS-III approach (Plain, Gap, or Ratio), and JART (FIQ, VIQ, PIQ) as predictors. Random forest variable importance identified the tests that best distinguished Healthy, MCI, and Dementia.
RBMT sub-items to diagnostic classification
To examine RBMT components’ role, we conducted another random forest analysis using RBMT total score, prospective memory, visual memory, verbal memory, and sub-item scores to classify participants as Healthy, MCI, or Dementia. We determined which subtest contributed most to classification accuracy. For multiclass classification, AUC was computed using a one-versus-rest strategy with macro-averaging across classes.
Statistical comparison of items contributing to diagnostic classification of each dementia group
As supplementary analysis, we investigated how neuropsychological tests helped distinguish between four dementia subtypes (AD, DLB, VaD, and FTLD). We built another random forest model using MMSE, MoCA-J, RBMT PS, selected WAIS-III approach, and JART IQ measures. We performed pairwise diagnostic classification to examine variables differentiating each subtype.
Secondary screening analyses in borderline MMSE/MoCA-J cases
To evaluate whether RBMT-PS adds value when brief cognitive screening is equivocal, we compared a baseline random forest model using MMSE and MoCA-J (Model A) with an otherwise identical model additionally including RBMT-PS (Model B) for two binary decisions: Healthy versus MCI and MCI versus Dementia.
Borderline cases were defined using stratified 5-fold cross-validated out-of-fold predicted probabilities from Model A to avoid information leakage. For each participant, Model A produced three predicted probabilities (for Healthy, MCI, and Dementia) that summed to 1. For each clinical decision (Healthy versus MCI; MCI versus Dementia), we retained only participants whose two highest predicted probabilities corresponded to the two diagnoses being compared (Healthy and MCI for the Healthy–MCI decision; MCI and Dementia for the MCI–Dementia decision). We then computed a two-class probability for the positive (more impaired) class within that diagnostic pair by ignoring the third class and rescaling the remaining two probabilities to sum to 1. Specifically, for the Healthy–MCI decision, the positive class was MCI and we calculated P(MCI)/[P(Healthy) + P(MCI)]; for the MCI–Dementia decision, the positive class was Dementia and we calculated P(Dementia)/[P(MCI) + P(Dementia)]. Participants with conditional probabilities between 0.35 and 0.65 were classified as borderline.
Performance was evaluated within each borderline set among participants with clinical diagnoses in the corresponding pair, using out-of-fold predictions for both models. We report accuracy, balanced accuracy, macro-F1, positive-class F1, and AUC. Incremental performance (Model B − Model A) was summarized using stratified bootstrap resampling (5000 iterations) to obtain 95% confidence intervals. Sensitivity analyses used alternative borderline bands (0.40–0.60 and 0.30–0.70).
Methods for statistical analysis and machine learning model building
All statistical analyses used Python (3.10.12), pandas (2.2.2), numpy (1.26.4), tableone (0.9.1), scipy (1.13.1), scikit-posthocs (0.10.0), statsmodels (0.14.4), and scikit-learn (1.4.1. post1). We trained random forest models with 100 decision trees on a 70/30 split of training and test data. Predictor variables were standardized using z-score normalization (StandardScaler; mean = 0, SD = 1) before model fitting. The standardized predictors were age, sex (binary-encoded), MMSE total score, MoCA-J total score, RBMT profile score (PS), and the WAIS-III gap variables relative to JART-predicted IQ (FIQ–JART FIQ, VIQ–JART VIQ, PIQ–JART PIQ, VCI–JART VIQ, POI–JART PIQ, WMI–JART VIQ, and PSI–JART PIQ). The outcome variable (clinical diagnosis) was not scaled. Models incorporated demographic variables (age, sex) alongside neuropsychological test scores as predictors. Including age and sex helped account for confounding effects when assessing test importance. The study focused on predictive classification accuracy and key features, rather than estimating adjusted effect sizes.
Reporting guidelines
This observational study followed Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines.
Ethical considerations
This study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Review Board of Osaka Medical and Pharmaceutical University (Ethics Committee Approval Number 2022-169). Participants consented using an opt-out method. All data were anonymized before analysis.
Results
Patient profile
A total of 590 individuals were included in this study: 74 Healthy, 190 with MCI, and 326 with Dementia (269 AD, 30 DLB, 20 VaD, and 7 FTLD). Table 1 shows the mean and standard deviation of sex, age, MMSE, MoCA-J, RBMT PS, JART FIQ, and WAIS-III FIQ for each group. A chi-square test indicated that the sex ratio differed among the groups (p = 0.034). Pairwise chi-square tests with Bonferroni correction found a significant difference between the Dementia and Healthy groups (p = 0.048), while differences between Dementia and MCI (p = 1.000) and MCI and Healthy (p = 0.101) were not significant.
Demographic characteristics and neuropsychological test scores of each group.
For the Healthy (n = 74), MCI (n = 190), and Dementia (n = 326) groups, means ± SD were calculated for age (Healthy < MCI, Dementia; MCI versus Dementia not significant), MMSE (Dementia < MCI < Healthy), MoCA-J (Dementia < MCI < Healthy), RBMT PS (Dementia < MCI < Healthy), JART FIQ (Dementia < Healthy; MCI ≈ Healthy), and WAIS-III FIQ (Dementia < MCI, Healthy; MCI versus Healthy not significant). The chi-square test showed a significant sex difference between the groups.
F: female; JART FIQ: Japanese Adult Reading Test Full-Scale IQ; MCI: mild cognitive impairment; MMSE: Mini-Mental State Examination; MoCA-J: Montreal Cognitive Assessment Japanese version; RBMT PS: Rivermead Behavioural Memory Test profile score; SD: standard deviation; WAIS-III FIQ: Wechsler Adult Intelligence Scale-III Full-Scale IQ.
The Kruskal-Wallis test revealed significant differences among the three groups for age, MMSE, MoCA-J, RBMT total PS, JART FIQ, and WAIS-III FIQ scores (p < 0.001 for most comparisons). Dunn's post-hoc test found that the Dementia group was significantly older than the Healthy group (p < 0.001), and the MCI group was older than the Healthy group (p = 0.037); however, the Dementia versus MCI comparison did not show a significant age difference (p = 0.223). For the MMSE, there were significant differences among all three groups (Dementia < MCI < Healthy). For the MoCA-J, the same pattern emerged (Dementia < MCI < Healthy). The RBMT total PS showed significant differences among all three groups (Dementia < MCI < Healthy, p < 0.001). The JART-predicted FIQ differed significantly between Dementia and Healthy participants (p < 0.001), although the comparisons with MCI were near significance (Dementia versus MCI, p = 0.057; MCI versus Healthy, p = 0.055). For the WAIS-III FIQ, the Dementia group had significantly lower scores than the MCI (p < 0.001) and Healthy (p < 0.001) groups, but the MCI and Healthy groups did not differ (p = 0.445).
Handling WAIS-III scores
To identify the best approach for incorporating WAIS-III data, we compared three-group classification models (Healthy, MCI, Dementia) that used Plain, Gap, or Ratio for WAIS-III indices (FIQ, VIQ, PIQ, VCI, WMI, PSI, POI). Controlling for sex, age, MMSE, MoCA-J, and RBMT PS, the resulting AUCs were 0.772 (95% CI: 0.740–0.801) for Plain, 0.801 (95% CI: 0.776–0.823) for Gap, and 0.798 (95% CI: 0.773–0.820) for Ratio. When comparing the median AUC of these methods (via bootstrap), Gap produced the highest accuracy (p < 0.001 versus Plain and p < 0.001 versus Ratio). Consequently, we adopted the gap approach (i.e., subtracting the JART-predicted IQ from the WAIS-III) throughout the subsequent analyses.
Statistical comparison of items contributing to diagnostic classification
Using random forest analysis with sex, age, MMSE total, MoCA-J total, RBMT PS, and the Gap versions of the WAIS-III indices, we classified the Healthy, MCI, and Dementia groups. The model achieved an AUC of 0.898, Kappa of 0.633, and Matthews Correlation Coefficient (MCC) of 0.633. Figure 2 shows the top 10 items ranked by variable importance. RBMT total PS contributed the most to the classification, followed by MMSE, MoCA-J, WAIS-III PSI - JART PIQ, and WAIS-III PIQ-JART PIQ. Age was also ranked among the top 10 contributors, whereas sex was ranked lower, reflecting smaller overall influence of sex on cognitive test performance in this sample.

10 neuropsychological tests most contributed to diagnostic classification. The 10 neuropsychological test items that best distinguished Healthy, MCI and Dementia groups are shown. Notably, the RBMT total PS score contributed more to the diagnostic classification than the MMSE and MoCA-J scores. JART FIQ: Japanese Adult Reading Test Full-Scale IQ; MMSE: Mini-Mental State Examination; MoCA-J: Montreal Cognitive Assessment Japanese version; RBMT PS: Rivermead Behavioural Memory Test profile score; WAIS-III FIQ: Wechsler Adult Intelligence Scale-III Full-Scale IQ.
These findings indicate that memory assessment (as captured by the RBMT) is particularly valuable for distinguishing Dementia from MCI and Healthy individuals. Although the MMSE and MoCA-J are widely used screening tools, the RBMT provides a more detailed measurement of memory skills that appears highly sensitive to early cognitive changes.
Evaluation of the contribution of RBMT sub-items to diagnostic classification
We further examined the RBMT components that best discriminated among the three groups. A random forest model incorporating the RBMT total score, prospective memory, visual memory, verbal memory, and sub-item scores resulted in an AUC of 0.885, Kappa of 0.566, and MCC of 0.567. RBMT total PS again emerged as the strongest contributor to the classification (Figure 3). Although certain sub-items (e.g., visual memory tasks) also had moderate importance, the comprehensive nature of the RBMT total score was more predictive than that of any single memory domain.

The 10 items that contributed most to the diagnostic classification among the total score and sub-items of the RBMT. Among all items, including the RBMT sub-items, the PS total score contributed the most to the diagnostic classification. Overall, memory ability was more influential in diagnostic classification than in specific memory domains. RBMT PS: Rivermead Behavioural Memory Test profile score.
This suggests that evaluating memory across multiple domains—rather than focusing only on a single aspect—offers enhanced sensitivity to cognitive decline. However, the lengthy administration time (approximately one hour) of RBMT may limit its practicality as a routine first-line test.
Statistical comparison of items contributing to diagnostic classification of each dementia group
Next, we constructed a random forest model to differentiate AD, DLB, VaD, and FTLD using age, sex, MMSE, MoCA-J, RBMT total PS, and the Gap variables for WAIS-III. This four-group classification yielded an AUC of 0.986, Kappa of 0.872, and MCC of 0.875, indicating very high discriminative power.
The top 10 contributors (in descending order) were WAIS-III VCI–JART VIQ, WAIS-III POI–JART PIQ, age, MoCA-J, RBMT total PS, WAIS-III VIQ-JART VIQ, WAIS-III PIQ-JART PIQ, MMSE, sex, and WAIS-III PSI–JART PIQ (Figure 4). Overall, the differences in WAIS-III sub-scores relative to predicted intelligence, along with age, global cognition (MoCA-J, MMSE), and memory function (RBMT), helped distinguish each subtype.

Top 10 Features Contributing to Random Forest–Based Four-Group and Pairwise Dementia Classifications. This figure presents the top 10 features identified by random forest analyses in two contexts: a four-group classification among Alzheimer's disease (AD), dementia with Lewy bodies (DLB), vascular dementia (VaD), and frontotemporal lobar degeneration (FTLD), and pairwise classifications among these same four dementia subtypes. Variables considered include age, sex, MMSE, MoCA-J, RBMT total score, and the gap between WAIS-III indices and predicted intelligence (JART). The first panel highlights features important for distinguishing all four subtypes simultaneously, whereas the other panels show those most influential in differentiating each pair of subtypes. JART FIQ: Japanese Adult Reading Test Full-Scale IQ; MMSE: Mini-Mental State Examination; MoCA-J: Montreal Cognitive Assessment Japanese version; RBMT PS: Rivermead Behavioural Memory Test profile score; WAIS-III FIQ: Wechsler Adult Intelligence Scale-III Full-Scale IQ.
We also performed pairwise diagnostic comparisons as summarized below. The top 10 contributors to each diagnostic classification are shown (Figure 4).
AD versus DLB
AUC = 0.952, Kappa = 0.778, MCC = 0.780. The top contributors were RBMT total PS (AD < DLB), age (AD > DLB), and WAIS-III PIQ-JART PIQ (AD > DLB). AD versus FTLD
AUC = 1.000, Kappa = 0.988, MCC = 0.988. The leading contributors were the WAIS-III VCI - JART VIQ (AD > FTLD), MoCA-J (AD > FTLD), and age (AD > FTLD). AD versus VaD
AUC = 0.980, Kappa = 0.852, MCC = 0.858. The key factors were WAIS-III PIQ - JART PIQ, WAIS-III FIQ - JART FIQ, and WAIS-III POI - JART PIQ (all AD > VaD). DLB versus FTLD
AUC = 1.000, Kappa = 0.778, MCC = 0.798. The distinguishing factors were the WAIS-III POI - JART PIQ (DLB < FTLD), age (DLB > FTLD), and WAIS-III VCI - JART VIQ (DLB > FTLD). DLB versus VaD
AUC = 0.624, Kappa = -0.111, MCC = -0.114. The main contributors were the MMSE (DLB > VaD), WAIS-III PIQ - JART PIQ (DLB > VaD), and WAIS-III PSI - JART PIQ (DLB > VaD). FTLD versus VaD
AUC = 0.889, Kappa = 0.667, MCC = 0.667. The top factors were WAIS-III POI - JART PIQ (FTLD > VaD), WAIS-III VCI - JART VIQ (FTLD < VaD), and WAIS-III PIQ - JART PIQ (FTLD > VaD).
Utility of the RBMT profile score in borderline MMSE/MoCA-J cases
In borderline MMSE/MoCA-J cases defined using leakage-free out-of-fold predictions from the baseline model (Model A: MMSE + MoCA-J), adding RBMT-PS (Model B: Model A + RBMT-PS) improved classification performance for both clinical decisions (Healthy versus MCI and MCI versus Dementia). These conclusions were consistent across the prespecified borderline definition and sensitivity analyses using narrower and wider borderline probability bands. Detailed performance metrics for Model A and Model B, incremental gains (Model B − Model A) with bootstrap 95% confidence intervals, and subset-specific RBMT-PS cut-offs for interpretability are summarized in Table 2.
Incremental utility of RBMT-ps as a second-line assessment in borderline MMSE/MoCA-J cases.
Borderline cases were identified using leakage-free out-of-fold predicted probabilities from a baseline random forest model using MMSE and MoCA-J (Model A) in the complete-case cohort (N = 590), obtained via stratified 5-fold cross-validation. For each clinical decision (Healthy versus MCI; MCI versus Dementia), candidates were participants whose top-two Model A predicted classes matched the diagnostic pair, and the corresponding two-class probabilities were renormalized to yield a conditional probability for the positive class (MCI for Healthy versus MCI; Dementia for MCI versus Dementia). Borderline was defined as a conditional probability within 0.50 ± p (primary: p = 0.15; sensitivity analyses: p = 0.20 and p = 0.10). Performance within each borderline subset was evaluated using out-of-fold predictions for Model A and for the corresponding model additionally including RBMT-PS (Model B). Incremental performance is reported as Model B − Model A with 95% confidence intervals derived from stratified bootstrap resampling (5000 iterations). RBMT-PS cut-offs (Youden's J) and RBMT-PS-alone AUCs are provided for interpretability within the borderline subsets and are not intended as universal screening thresholds.
AUC: area under the receiver operating characteristic curve; CI: confidence interval; IQR: interquartile range; MMSE: Mini-Mental State Examination; MoCA-J: Montreal Cognitive Assessment Japanese version; RBMT: Rivermead Behavioural Memory Test; RBMT-PS: RBMT profile score; MCI: mild cognitive impairment.
Discussion
Usefulness of RBMT as a dementia screening test
Our findings underscore that RBMT is a strong predictor of dementia, surpassing both the MMSE and MoCA-J in its ability to distinguish between Healthy individuals, those with MCI, and those with Dementia. This may reflect RBMT's comprehensive assessment of multiple memory domains, which are particularly sensitive to early cognitive changes. Previous research has also indicated that detailed memory evaluations often outperform brief global tests in differentiating MCI from Dementia. 17
Nevertheless, RBMT requires approximately one hour to administer, limiting its utility for large-scale or rapid screening in many clinical settings. In contrast, the MMSE can be completed in less than 10 min, and the MoCA-J can be completed in approximately 15 min.18,19 In our study, attempts to significantly shorten the RBMT (by administering only selected subtests) appeared to reduce its accuracy. Consequently, although the RBMT is highly effective for in-depth memory evaluation—particularly for detecting subtle deficits that simpler tests might overlook—its routine use as a first-line screening tool remains challenging.
Given the relatively long administration time of the RBMT, our findings do not support replacing brief global screening instruments (MMSE or MoCA-J) with RBMT as a universal first-line test. Instead, the secondary screening analyses support a pragmatic two-step workflow: MMSE and MoCA-J for initial triage, with RBMT-PS reserved for cases in which screening results are borderline. When borderline subsets were defined using leakage-free out-of-fold predictions from the MMSE + MoCA-J model, adding RBMT-PS generally improved discrimination for both the Healthy–MCI and MCI–Dementia decisions. These conclusions were consistent across alternative borderline definitions, and the incremental gains tended to be larger in narrower, more ambiguous borderline subsets, suggesting that RBMT-PS is particularly helpful when initial screening is least decisive (Table 2).
Looking ahead, rapidly advancing digital screening tools offer faster administration with high accuracy, sometimes exceeding 90%. 20 Such tools may ultimately complement or replace more time-intensive assessments for initial triage. By incorporating a comprehensive memory function assessment similar to the RBMT into dementia screening tests, it may be possible to achieve highly accurate diagnostic outcomes and enhance early detection.
Accuracy of screening by neuropsychological tests
We achieved an AUC of 0.898 for the three-group classification of Healthy, MCI, and Dementia using only neuropsychological tests. This is comparable to the accuracy of prior studies that have employed a combination of tests or imaging data to distinguish Healthy controls, MCI, and AD.21,22 Studies focused on classifying MCI of the Alzheimer type can sometimes attain very high accuracy, but approximately half of all MCI cases in general may progress to non-AD dementia. 23 Hence, a more general three-group model may be especially useful in real-world clinical scenarios, where AD is not the only concern.
These findings suggest that a well-designed battery of neuropsychological tests can provide a robust screening performance. However, additional data, such as neuroimaging or biomarkers, could further enhance classification. In practice, combining cognitive testing, imaging, and clinical observations may yield a higher accuracy and earlier detection.
The usefulness of neuropsychological tests in diagnosing dementia
Our four-group classification (AD, DLB, VaD, FTLD) using only neuropsychological test data reached an AUC of 0.986, matching or exceeding some reports that also incorporate MRI or other objective measures.24,25 These data underscore how a combination of traditional tests (MMSE, MoCA-J, RBMT) plus more comprehensive measures of cognitive function (WAIS-III versus premorbid estimates) can effectively characterize different dementia profiles. Machine learning models that integrate these traditional and more advanced measures are used to improve diagnostic precision.
Possibility of differentiating dementia subtypes using neuropsychological tests
In this study, we conducted pairwise comparisons of AD, DLB, FTLD, and VaD to determine which neuropsychological tests can aid in diagnostic classification.
AD versus DLB
RBMT scores were lower in AD, while age and performance IQ were lower in DLB. These findings align with the tendency for AD to feature pronounced memory deficits (including delayed recall) from early stages, whereas DLB primarily presents with attention disturbances, executive dysfunction, and visuospatial impairments, usually with relatively mild memory impairment. 26 Regarding age, some studies noted a younger mean age for DLB than for AD (p = 0.08), consistent with the results of the present study. 27
AD versus FTLD
Compared to AD, FTLD showed lower language comprehension, lower general cognitive function, and younger age. FTLD often involves behavioral and social cognitive disorders, with some subtypes showing severe language deficits. 28 Earlier onset in FTLD relative to AD is consistent with previous reports. 29
AD versus VaD
In differentiating AD from VaD, the VaD group had lower performance IQ, perceptual integration, and general intelligence. VaD presents heterogeneous deficits based on vascular lesion sites but commonly impairs executive and visuospatial functions, whereas AD is characterized by significant episodic memory deficits but comparatively preserved perceptual integration and executive function. 30
DLB versus FTLD
DLB was associated with lower perceptual integration, whereas FTLD was associated with lower language comprehension and younger age. DLB typically involves visuospatial and attention deficits, 31 whereas FTLD often presents with substantial language and social cognitive impairments. 32 FTLD is also more common in younger patients than DLB, which generally has a later onset. 33
DLB versus VaD
The classification accuracy between DLB and VaD is notably low, likely because both can present with attention and executive dysfunction, 34 and visuospatial impairments can appear in VaD depending on the lesion location. 35 In addition, this study did not include items focused on visual hallucinations and cognitive fluctuations, which are hallmark features of DLB.36,37
FTLD versus VaD
Although VaD was characterized by lower perceptual integration and performance IQ and FTLD by lower language comprehension, the classification accuracy for FTLD versus VaD remained relatively low. Language and executive functions can distinguish between these conditions, 38 but VaD manifests variably, 39 and FTLD itself has multiple subtypes with diverse behavioral and language presentations. 40
When differentiating AD from other subtypes, assessing not only memory, but also WAIS-III language comprehension, perceptual integration, performance IQ, and full-scale IQ appears useful. For conditions such as DLB, FTLD, and VaD, where clinical features often overlap, specialized assessments of visuospatial function, visual hallucinations, sleep disturbances, and social cognition may be necessary. Despite the potential to classify multiple dementia subtypes via combined neuropsychological tests, challenges persist in addressing the heterogeneous vascular lesions of VaD and the hallucinations and cognitive fluctuations that are characteristic of DLB. Future research should incorporate imaging and biomarker data for a more precise evaluation. In clinical practice, the development of treatment strategies and care plans tailored to each pathology is increasingly important.
Limitations
This study analyzed only AD, DLB, VaD, and FTLD cases in a single university hospital setting, excluding the less common causes of dementia. Consequently, the disease distribution in our sample may not represent a broader population. Notably, subtype classification results should be interpreted cautiously because some dementia subtypes were represented by small sample sizes (particularly FTLD, n = 7). Such class imbalance can yield optimistic performance estimates and unstable variable importance rankings. We also did not integrate biomarker or neuroimaging data, which may have limited diagnostic specificity. Additionally, the exclusion of participants with any missing data (complete-case analysis) might have introduced selection bias, potentially limiting the generalizability of our findings to populations with incomplete data, although this approach ensured data integrity for the machine learning models. We did not analyze the characteristics of the excluded participants in detail to assess the potential magnitude of this bias. Finally, we did not perform sensitivity analyses, such as evaluating alternative machine learning algorithms or handling missing data differently (e.g., through imputation), which could have provided further insights into the robustness of our results. In addition, the RBMT-PS cut-offs reported in Table 2 were derived post hoc within model-defined borderline subsets to aid clinical interpretability; they are workflow- and subset-specific, have not been externally validated, and should not be interpreted as population-level screening thresholds. Future studies should prospectively validate these findings in larger, multi-center cohorts, ideally incorporating biomarkers, longitudinal follow-up, and more diverse populations.
Conclusion
Using a random forest approach, we demonstrated that the RBMT contributed more to classifying Healthy, MCI, and Dementia than commonly used screening measures, such as the MMSE and the MoCA-J. However, the length of RBMT poses practical barriers for routine screening. Additionally, our neuropsychological test–based models achieved high classification accuracy, both for Healthy/MCI/Dementia detection and for differentiating among dementia subtypes (AD, DLB, VaD, FTLD). Further improvements may be realized by integrating imaging, biomarkers, or brief digital assessments, facilitating earlier and more accurate diagnosis of dementia and its subtypes.
Footnotes
Acknowledgements
The authors utilized the generative AI model Gemini 2.5 Pro (Google LLC) to assist with English language editing. The authors reviewed and edited the AI-generated text and take full responsibility for the final content of the manuscript.
Ethical considerations
This study followed the Declaration of Helsinki and was approved by the Osaka Medical and Pharmaceutical University's ethics review board (Ethics Committee Approval Number 2022-169). Participants consented using the opt-out method. All data were anonymized before analysis.
Consent to participate
Informed consent to participate was obtained from participants using an opt-out method.
Consent for publication
Not applicable
Author contribution(s)
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
