Abstract
Background:
For community-dwelling elderly individuals without enough clinical data, it is important to develop a method to predict their dementia risk and identify risk factors for the formulation of reasonable public health policies to prevent dementia.
Objective:
A community elderly survey data was used to establish machine learning prediction models for dementia and analyze the risk factors.
Methods:
In a cluster-sample community survey of 9,387 elderly people in 5 subdistricts of Wuxi City, data on sociodemographics and neuropsychological self-rating scales for depression, anxiety, and cognition evaluation were collected. Machine learning models were developed to predict their dementia risk and identify risk factors.
Results:
The random forest model (AUC = 0.686) had slightly better dementia prediction performance than logistic regression model (AUC = 0.677) and neural network model (AUC = 0.664). The sociodemographic data and psychological evaluation revealed that depression (OR = 3.933, 95% CI = 2.995–5.166); anxiety (OR = 2.352, 95% CI = 1.577–3.509); multiple physical diseases (OR = 2.486, 95% CI = 1.882–3.284 for three or above); “disability, poverty or no family member” (OR = 1.859, 95% CI = 1.337–2.585) and “empty nester” (OR = 1.339, 95% CI = 1.125–1.595) in special family status; “no spouse now” (OR = 1.567, 95% CI = 1.118–2.197); age older than 80 years (OR = 1.645, 95% CI = 1.335–2.026); and female (OR = 1.214, 95% CI = 1.048–1.405) were risk factors for suspected dementia, while a higher education level (OR = 0.365, 95% CI = 0.245–0.546 for college or above) was a protective factor.
Conclusion:
The machine learning models using sociodemographic and psychological evaluation data from community surveys can be used as references for the prevention and control of dementia in large-scale community populations and the formulation of public health policies.
INTRODUCTION
With the aging of the global population and the increasing prevalence of dementia, the disease burden on society and families has increased substantially. The prevalence of Alzheimer’s disease (AD) increases with age: 3% among people aged 65–74, 17% among people 75–84 years old, and 32% among people aged 85 years and older [1]. Dementia has become the fifth leading cause of death among people aged 65 years and over [2]. By the end of 2019, people 60 years or older made up approximately 18.1% of China’s population, accounting for approximately 254 million people [3], and this figure will exceed 400 million by 2030 [4]. The prevalence of dementia in China was 4.9% during 1985∼2018 and 7.4% from 2015 to 2018 [5]. Dementia patients need more supervision and personal care, and their family members have to interrupt their employment and pay more medical or other service fees. This leads to an increase in economic burden and emotional stress among caregivers [6].
Therefore, exploring the risk factors for dementia is of great significance for the early identification and prevention of dementia in elderly individuals. The risk factors for dementia in the elderly population include genetic, physiological, biochemical, psychological, and social factors. A large amount of literature reports multifaceted risk factors for senile dementia, including demographics, such as age, sex [7], and low years of education [8]; heredity factors, such as genetics [9] and family history [10]; lifestyle factors, such as a lack of social and mental activity [11], lack of physical exercise, smoking, and drinking alcohol [12]; and physical diseases, such as hypertension, diabetes [12], traumatic brain injury [13], and cardiovascular disease risk factors [14].
Prediction models, as tools to predict the prevalence of disease, are widely used in dementia risk prediction. Currently, most of the research on prediction models is based on neuroimaging data and genetic data, which is not a very economical and feasible method for screening large-scale community populations. Li et al. used the deep-learning time-to-event model to analyze the magnetic resonance data of 2,164 subjects to predict the risk of mild cognitive impairment (MCI) developing into AD. The model predicted the progression to AD with a concordance index of 0.762 from 6 to 78 months [15]. Ding et al. conducted in-depth analysis of 1,002 patients with 18F-FDG PET brain data and found that the diagnostic specificity and sensitivity of predicting AD were 82% and 100%, respectively [16]. Hall et al. developed a supervised machine learning model and found that the overall areas under the curves (AUCs) were 0.73 for dementia and 0.64–0.68 for AD. They proposed that age, sex, and vascular and lifestyle factors were not predictive, but the APOE genotype was most consistently present across all models [17].
However, Ansart et al. conducted a systematic review of existing models for predicting MCI and found that the use of magnetic resonance imaging did not improve the accuracy of prediction [18]. Cognitive variables and clinical and behavioral indicators also have a good prediction effect [19–21]. A recent Chinese study built a logistic regression prediction model for cognitive impairment with 3-year follow-up data of 6,718 community-dwelling elderly people (2008∼2011), containing four factors, age, instrumental activities of daily living, marital status, and baseline cognitive function, and reported that the concordance index of the model constructed by logistic regression was 0.814 (95% CI: 0.781–0.846); this model could identify community-dwelling elderly people at the greatest 3-year risk for cognitive impairment and help community nurses in the early identification of dementia [22].
This paper aimed to build a prediction model for dementia based on machine learning with prevalence survey data in Chinese community-dwelling elderly people; the prediction model might establish an economical and feasible method for screening large-scale community-dwelling elderly populations. To focus on the accuracy of prediction and analyze the importance of variables, the supervised classification machine learning algorithms (logistic regression, random forest, and neural network models) were applied.
MATERIALS AND METHODS
Participants
This study was based on survey data on the mental health status of the elderly population in Wuxi City; the survey was conducted as part of the “Psychological Care Project for the Elderly” by the Aging Health Division, National Health Commission of China. The sample was recruited through random cluster sampling, and this survey covered 4 subdistricts and 1 town, including 46 communities and 13 villages, distributed in all 5 districts of Wuxi City, involving a total of 9,387 elderly people over 60 years old.
First, informed consent was obtained from the elderly individuals through telephone consultation and then an appointment was made for the elderly individuals to go to the Community Health Service Center to complete the questionnaire survey. For the elderly individuals for whom it was inconvenient to show up, an appointment was made for a home site survey. Before the face-to-face survey, written informed consent was obtained. Elderly individuals who did not provide informed consent did not participate in the survey.
Composition of the questionnaire
The “mental health questionnaire for the elderly”, which was designed by the Aging Health Division of National Health Commission of China, contained 18 sub-questionnaires, including questionnaires on demographic characteristics, family support, social participation, and chronic diseases; the Patient Health Questionnaire –9 (PHQ-9) to screen for depression; the Generalized Anxiety Disorder scale –7 (GAD-7) to screen for anxiety; and the Alzheimer Disease 8 scale (AD-8) to screen for all-cause (undifferentiated) dementia. The PHQ-9 and GAD-7 were self-evaluated, and AD-8 was scored by the information of a family member or caregiver. The enrolled participants with PHQ-9 scores ≥5 were classified into the possible depression group, those with GAD-7 scores ≥5 were classified into the possible anxiety group, and those with AD-8 scores ≥2 were classified into the possible dementia group. In order to enhance understanding and facilitate subsequent statistical analysis, all variables are unified as classified variables, and the categories of a variable with few responses were combined. Then all variables were simultaneously included in a logistic regression analysis; the stepwise regression with the minimum AIC principle was used to extract 9 variables as the regression coefficients which are significant, and other insignificant variables were excluded. For example, survey date and survey location were excluded, and some categories of family status and social support were combined due to increase the number of responses in those categories.
Statistical analysis
All the variables were included in logistic regression analysis to analyze their influence on dementia risk (two categories were created according to the AD-8 score). In total, 9 variables were included as risk factors: age, education level, physical disease, sex, special family status, marital status, source of life support, PHQ9 score, and GAD7 score.
Then, logistic regression model, random forest model, and neural network model were used to construct models with the 9 variables above and predict dementia risk among the community-dwelling elderly individuals using R version 4.0.5 and the glm, randomForest, and nnet packages. The original data were divided into a training set (80%) and test set (20%) by the simple random sampling without return of R software. The training set was used for model development. The test set was used to estimate the generalizability of the model. After tuning parameters of models, we determined the ntree = 4000 and mtry = 6 for random forest model, and hidden = 1, threshold = 0.01, and learningrate = 0.1 for neural network model. The performances of the three prediction models were shown with the AUC, specificity, sensitivity, accuracy, and Youden index. The DeLong’s test was used to pairwise compare the difference of 3 ROC curves.
RESULTS
Demographic data and psychological evaluation results
A total of 9,387 elderly people over 60 years old in 5 districts of Wuxi City were included in this study. Their ages ranged from 60 to 101 years old, with an average age of 72.84±6.16 years old. The average number of years of education was 7.09±3.70 years, and the average number of physical diseases they suffered from was 1.59±1.17. Their demographic data and psychological evaluation results are shown in Fig. 1.

Demographic data and psychological evaluation results. The ages, education levels, physical diseases, gender, special family statuses, marital statuses, sources of life support, PHQ9 scores, GAD7 scores, and AD8 scores of totally 9,387 elderly people over 60 years old in 5 districts of Wuxi city were collected in half a year. The enrolled samples with PHQ9 score ≥5 were classified in suspicious depression group, with GAD7 score ≥5 were classified in suspicious anxiety group, with AD8 score ≥2 were classified in suspicious dementia group.
Risk factors for possible dementia according to logistic regression analysis
The logistic regression analysis of the demographic data and psychological evaluation results indicated that PHQ9 scores ≥5 (OR = 3.933, 95% CI = 2.995–5.166); GAD7 scores ≥5 (OR = 2.352, 95% CI = 1.577–3.509); multiple physical diseases (OR = 2.486, 95% CI = 1.882–3.284 for three or above); the “disability, poverty or no family member” (OR = 1.859, 95% CI = 1.337–2.585) and “empty nester” (OR = 1.339, 95% CI = 1.125–1.595) categories of special family status; “no spouse now”; marital status (OR = 1.567, 95% CI = 1.118–2.197); age >80 (OR = 1.645, 95% CI = 1.335–2.026); and female (OR = 1.214, 95% CI = 1.048–1.405) were risk factors for possible dementia (AD8 score ≥2), and higher education level (OR = 0.365, 95% CI = 0.245–0.546 for college or above) was a protective factor against possible dementia.

Forest plot of dementia risk factors in community elderly by logistic regression analysis. The logistic regression analysis of the demographic data and psychological evaluation results indicated that “suspicious depression (PHQ9 score ≥5)”, “suspicious anxiety (GAD7 score ≥5)”, “disability, poverty or no family member” and “empty nester” in special family status, “with no spouse now”, female, physical disease, age, and gender are the risk factors of suspicious dementia (AD8 score ≥2), their OR values ranged from 3.933 to 1.214. The education level is the protective factor, their OR values ranged from 0.711 to 0.365.
Evaluation and comparison of three machine learning prediction models
A logistic regression model, random forest model, and neural network model were used to predict dementia risk among the community-dwelling elderly individuals. The evaluation indexes are summarized in Table 1, and the ROC curves of the training datasets are shown in Fig. 3. The logistic regression model (AUC = 0.677), random forest model (AUC = 0.686), and neural network model (AUC = 0.664) all had some level of predictive ability. Delong’s test showed that there were no significant differences of the ROC curves between Logistic Regression model and Random Forest model (Z = –0.515, p = 0.607), Random Forest model and Neural Net model (Z = 1.198, p = 0.231), as well as Logistic Regression model and Neural Net model (Z = 1.560, p = 0.119). And Random Forest model had slightly better performance.
Evaluation indexes summary of the three algorithm models

ROC curves of three machine learning prediction models in test set. The three models all have prediction value to some extent. Delong’s test showed that there were no significant differences of the ROC curves between Logistic Regression model and Random Forest model (Z = –0.515, p = 0.607), Random Forest model and Neural Net model (Z = 1.198, p = 0.231), as well as Logistic Regression model and Neural Net model (Z = 1.560, p = 0.119). And Random Forest model had slightly better performance. The numbers in the plot were AUC value.
In the random forest model for the prediction of dementia risk, the PHQ9 score (mean decrease accuracy, MDA = 40.83), GAD7 score (MDA = 34.00), sources of life support (MDA = 31.92), education level (MDA = 25.67), special family status (MDA = 23.37), physical diseases (MDA = 22.51), marital status (MDA = 22.31), age (MDA = 17.28), and gender (MDA = 4.65) were very important, as shown in Fig. 4.

Variable importance Plot of Random Forest Model. The variables of PHQ9 grading, GAD7 grading, sources of life support, education level, special family statuses, physical diseases, marital statuses, ages, and gender were very important to the Random Forest model for the prediction of dementia risk.
DISCUSSION
The aging of the population and the increasing prevalence rate of dementia are serious public health problems worldwide. For a large number of elderly individuals without enough clinical data, it is very important to identify risk factors for dementia and develop a model to predict the population at high risk of dementia to guide the establishment of reasonable public health policies to prevent dementia. At present, studies of machine learning prediction models pay more attention to a series of complex and expensive biological data, such as neuroimaging and genetic data, which are not affordable for the screening of large numbers of community-dwelling elderly individuals. Moreover, a systematic review found that the use of magnetic resonance imaging did not improve the accuracy of the prediction model [18]. Clinical data also contain nonimaging data, such as demographic information and neuropsychological tests. It is of great medical interest to reveal hidden patterns that may help clinicians gain new insights [23].
Therefore, this cluster sampling community survey collected data on sociodemographics and neuropsychological self-rating scales for depression, anxiety, and cognition evaluation among 9,387 elderly people in 5 subdistricts of Wuxi City. We developed machine learning models to identify the risk factors for dementia and predict the population at high risk of dementia. When faced with a variety of machine learning algorithms, even a senior data scientist may not be able identify the best one before evaluating them. During evaluation, we followed specific principles and procedures. First, supervised learning was chosen because we set the criteria for the independent variable, and all samples were divided into training set and test set. Second, we were predicting a binary result, not a numerical result; thus, we selected a classification algorithm. Finally, because we focused on the accuracy of prediction as well as the importance of the included variables, we selected the logistic regression, random forest, and neural network algorithms.
Machine learning models help to predict dementia risk
This paper found that the logistic regression model, random forest model, and neural network model all had some level of predictive ability, but the random forest model had better performance than the other two in both the training and test datasets. The difference in the AUC value of the random forest model between the training set and in the test set was large, which may be due to the overfitting caused by the characteristics of the random forest algorithm. However, the AUC value in the test set was 0.686, which indicated the best performance.
Several previous papers have made attempts to establish machine learning models to predict existing and possible dementia and obtained some good results. Joshi et al. used 7 machine learning models, including decision tree, bagging, BF tree, random forest tree, RBF networks, neural networks, and multilayer perceptron, for the classification of AD, vascular disease, and Parkinson’s disease in 746 patients. It was reported that the Multilayer Perceptron model performed best, but they did not use AUC as the evaluation index. It was also reported that the increase in vascular risk factors increases the risk of AD, and the APOE gene, diabetes, age, and smoking were the strongest risk factors for AD [24]. Pekkala et al. developed a late-life dementia prediction model using a supervised machine learning method, the Disease State Index (DSI), including predictors such as cognition, vascular factors, age, subjective memory complaints, and APOE genotype, in the Finnish population-based CAIDE study. They reported that AUCs for DSI were 0.79 and 0.75 for the main and extended populations, respectively [25]. Hall et al. used the same DSI model to predict dementia in the Vantaa 85 + cohort study and found that the overall AUCs were 0.73 for dementia and 0.64–0.68 for AD. Cognition is the most important predictor of dementia, followed by function (daily activity ability), sociodemographics (education) and other factors, but age, sex, and vascular and lifestyle factors were not predictive [17]. Cleret et al. used unsupervised machine learning classification (hierarchical clustering on principal components) to identify participants with a high likelihood of dementia in population-based surveys using data from two large surveys in Europe. They reported that a cluster of both functional and walking/climbing limitations indicated a higher likelihood of dementia (probability of dementia >0.95; AUC = 0.91) and proposed that machine learning could identify a high likelihood of dementia in population-based surveys, even without cognitive and behavioral measures [26]. Ford et al. also used 5 machine learning classifiers to identify extant and potential dementia with data from general practice patient records of 93,120 patients aged >65 years and reported that logistic regression, support vector machine, neural network and random forest performed very similarly, with an AUC of 0.74. The top features retained in the logistic regression model were disorientation and wandering, behavior change, schizophrenia, self-neglect, and difficulty managing [27]. A recent Chinese study built a logistic regression prediction model for cognitive impairment with 3-year follow-up data of 6,718 community-dwelling elderly people (2008∼2011); the model included four factors: age, instrumental activities of daily living, marital status, and baseline cognitive function, and reported that the concordance index of the model constructed by logistic regression was 0.814 (95% CI 0.781–0.846), which could identify community-dwelling elderly people at the greatest 3-year risk for cognitive impairment and help community nurses in the early identification of dementia [22].
These studies all showed that clinical data also contain nonimaging and nongenetic data, such as demographic information and neuropsychological tests, and are of great medical interest to reveal hidden patterns that may help clinicians gain new insights. The studies by Hall et al. [17], Joshi et al. [24], and Pekkala et al. [25] all utilized the APOE genotype, clinical data, sociodemographic data, and relatively small sample sizes (range: 245–709), which may limit the application of prediction models due to the high cost of genetic testing. The AUC values of their models ranged from 0.73 to 0.79. Ford et al. [27] and Cleret et al. [26] have developed the prediction models with detailed clinical data from large, specific projects, with AUC values in the range of 0.74–0.91. Only the study of Hu et al. [22] involved community-dwelling elderly individuals without detailed clinical data or diagnosis; their AUC value was 0.814. Comparisons of the predictions of machine learning models for dementia can be found in Supplementary Table 1. The AUC value of our prediction model for all-cause (undifferentiated) dementia was slightly lower than the above research results (0.686, less than 0.7). The main explanation may be that we conducted these machine learning models based on the results of community surveys performed by nonclinicians, and most of the tools used were self-assessment or informant-assessment scales applicable to epidemiological surveys, which may have slightly reduced the quality and reliability of the data. However, our results are still a reference for the prevention and control of dementia in large-scale community populations and the formulation of public health policies.
Machine learning model helps to identify the risk factors for dementia
The logistic regression and random forest model both revealed that sociodemographic data and psychological evaluation, such as PHQ9 scores, GAD7 scores, physical diseases, special family status, marital status, age, sex, and education level, were very important for the prediction of dementia risk. PHQ9 scores ≥5; GAD7 scores ≥5; multiple physical diseases; “disability, poverty or no family member” and “empty nester” in special family status; “no spouse now”; age older than 80 years; and female sex were risk factors for possible dementia, while a higher education level was a protective factor.
This study revealed that the risk of dementia in senile people with suspected depression and suspected anxiety was 3.933 times and 2.352 times higher, respectively, than that in normal people, which is consistent with previous studies. Livingston et al. also proposed that depression may be a precursor to dementia, and elderly people with depressive symptoms have a higher risk of developing dementia. It is also believed that if the depression risk factor is completely eliminated, the number of new dementia cases can be reduced by 3.9% (the population attributable fraction is 3.9%) [28]. Mossaheb et al. developed an AD dementia prediction model that included depressive symptoms and found that loss of interest among the eight symptoms of depression (depression, loss of interest, change in appetite, sleep disorder, psychomotor change, energy loss, uselessness, and difficulty in focusing) was significantly related to the occurrence and development of AD [29]. Santabárbara et al. conducted a meta-analysis of anxiety and dementia that included six prospective cohorts and 10,394 participants and found that there was a significant correlation between anxiety and dementia, with an RR of 1.29 (95% CI: 1.01–1.66) [30]. These results strengthened the view that depression and anxiety were risk factors for dementia. The pathological mechanism may be related to the effect of depression and anxiety on stress hormones, neuronal growth factor, and hippocampal volume [31].
The presence of three or more kinds of physical diseases increased the risk of dementia in elderly individuals by 2.486 times, while two kinds of physical diseases increased the risk by 1.812 times, and one kind of physical disease increased the risk by 1.543 times. Age-related physical health problems and dementia were frequently comorbid. This co-occurrence is due to a number of physical problems, such as diabetes and hypertension, which increase the risk of AD and vascular dementia and make mixed dementia more prone to occur [12–14]. The more physical diseases a person has, the more likely he or she is to develop dementia, which may be related to a lack of resilience and repair ability, leading to these problems [32, 33].
Regarding special family status, “disability, poverty or no family member” and “empty nester” were the risk factors for suspected dementia; the risks were 1.859 times and 1.339 times higher for individuals in these categories, respectively, than in normal people. Regarding physical disability, visual and hearing impairment may predict or accelerate cognitive deterioration, and alterations of vision and hearing may manifest as complex cognitive and behavioral symptoms relevant to the differential diagnosis of dementias [34, 35]. Additionally, Chen and Cao reported that economic hardship impacts cognitive functioning and highlighted the negative health risks that economically disadvantaged individuals may experience [36]. A meta-analysis of the relationship between social activities and incident dementia found that people who rarely participated in social activities (RR: 1.41, 95% CI: 1.13–1.75), had less frequent social activities (RR: 1.57, 95% CI: 1.32–1.85), and experienced more loneliness (RR: 1.58, 95% CI: 1.19–2.09) had a higher risk of dementia [37].
Having no spouse (living alone, never married, divorced, or widowed) is a risk factor for suspected dementia, and the associated risk is 1.567 times higher than that of normal people. Sundström et al. also found that elderly people living alone as nonmarried individuals may be at a significantly higher risk of dementia than individuals who are married in a Swedish national register-based study, with the highest risk observed among people in the young-old age group (50–64 years old), especially among those who were divorced or single (HRs 1.79 versus 1.71) [38]. Liu et al. conducted a study of 15,379 individuals aged 52 years and older and found that all unmarried patients (including the cohabiting, divorced/separated, widowed, and never married) had significantly higher odds of developing dementia over the study period than their married counterparts [39]. This result is basically consistent with our research results.
Regarding the risk factors of age and sex, the age of 80–101 years old is a risk factor for suspected dementia, and the risk of dementia was 1.645 times higher than that of individuals who are 60–69 years old. Dementia usually occurs in elderly individuals, and the prevalence rate increases exponentially at the age of 65 years and above. Overall, approximately 80% of dementia patients are aged 75 years or older. There may be an interaction between age, neuropathology, comorbidity, and clinical manifestations, especially with the continuous increase in life expectancy [40]. Female sex is a risk factor for suspected dementia, and the risk of dementia was 1.214 times higher than that of men. At present, epidemiological investigations have shown that the prevalence of AD and other forms of dementia in women (1188.9/100000) is higher than that in men (669.3/100000) [41].
Education level is a protective factor against dementia. Compared with illiterate individuals, the RR of dementia in the primary school-level population is 0.711, that in the junior middle school-level population is 0.517, that in the senior high school-level population is 0.558, and that at the university level and above is 0.365. Kremen et al. found that a low level of education or a short duration of education can affect the cognitive reserve of the brain, so the protective effect of a high level of education on cognitive function may not impact the pathological changes in the brain [42], but it might increase the clinical threshold of cognitive changes [43]. Low levels of education are thought to lead to a decline in cognitive ability because it leads to a reduction in cognitive reserves that enable people to maintain function in the event of brain lesions [44].
Conclusion
This study was a cluster-sampling community survey that collected sociodemographic data and neuropsychological self-rating scales for depression, anxiety, and cognition evaluation among 9,387 elderly people in 5 subdistricts of Wuxi City. We developed machine learning models to identify the risk factors for all-cause (undifferentiated) dementia and predict the population at high risk of dementia. Finally, we found that the logistic regression model, random forest model, and neural network model all had certain predictive ability, but the random forest model had slightly better performance than the other two models in both the training and test datasets. The sociodemographic data and psychological evaluation revealed that depression; anxiety; multiple physical diseases; “disability, poverty or no family member” and “empty nester” in special family status; “no spouse now”; age older than 80 years; and female sex are risk factors for suspected dementia, while a higher education level is a protective factor.
According to the above results of prediction model and risk factors for all-cause (undifferentiated) dementia, it is suggested that priority community population for dementia prevention is that of elderly women living alone with a low level of education and multiple physical and mental health problems. More attention should be given to their treatment of physical diseases, reducing depression and anxiety, increasing care and attention from family or caregivers, and organizing social activities for them to participate in during the formulation of public health policies to prevent dementia.
This research also had some limitations. We conducted machine learning modeling for all-cause (undifferentiated) dementia based on the sociodemographic data and psychological evaluation (nonimaging and nongenetic data) conducted in community surveys conducted by nonclinicians, and most of the tools used were self-assessment and informant-assessment scales that are used in epidemiological surveys, which may have slightly reduced the quality and reliability of the data. However, our results are still a reference for the prevention and control of dementia in large-scale community populations and the formulation of public health policies.
Footnotes
ACKNOWLEDGMENTS
This work was supported by the Top Talent Support Program for young and middle-aged people (No. BJ2020084), the General Program of the Wuxi Health Committee (No. M202013), and the Major Program of East-West Cooperation Science and Technology Support Project of Haidong Science and Technology Bureau (No. 2021-HDKJ-Z1) to LL.
