An application of the Bayesian network model based on the EN-ESL-GA algorithm: Exploring the predictors of heart disease in middle-aged and elderly people in China

Abstract

BACKGROUND:

The morbidity and mortality of heart disease are increasing in middle-aged and elderly people in China. It is necessary to explore relationships and interactive associations between heart disease and its risk factors in order to prevent heart disease.

OBJECTIVE:

To establish a Bayesian network model of heart disease and its influencing factors in middle-aged and elderly people in China, and explore the applicability of the elite-based structure learner using genetic algorithm based on ensemble learning (EN-ESL-GA) algorithm in etiology analysis and disease prediction.

METHODS:

Based on the 2013 national tracking survey data from China Health and Retirement Longitudinal Study (CHARLS) database, EN-ESL-GA algorithm was used to learn the Bayesian network structure. Then we input the data and the learned network structure into the Netica software for parameter learning and inference analysis.

RESULTS:

The Bayesian network model based on the EN-ESL-GAalgorithm can effectively excavate the complex network relationships and interactive associations between heart disease and its risk factors in middle-aged and elderly people in China.

CONCLUSIONS:

The Bayesian network model based on the EN-ESL-GA algorithm has good applicability and application prospect in the prediction of diseases prevalence risk.

Keywords

Bayesian networks heart disease influence factors middle-aged and elderly people

1. Introduction

Cardiovascular diseases are the leading cause of morbidity and mortality worldwide [1]. Studies [2] have indicated that aging is the major risk factor for cardiovascular diseases. Due to the large population and the very fast aging rate, China is facing the great challenge of the rapid increase in the number of cardiovascular diseases and deaths. As the main type of cardiovascular diseases, heart disease will seriously affect the labor force of patients. Therefore, it is absolutely necessary to explore the influencing factors of heart disease and the interactions between those influencing factors in middle-aged and elderly people in China so as to provide countermeasures for the prevention and management of heart disease in middle-aged and elderly people in China and better realize “healthy aging”.

The relevant studies [3, 4] showed that risk factors associated with heart disease included hypertension, diabetes, dyslipidemia, obesity, lack of physical activity, smoking and poor diet habits. However, these risk factors often are not independent, but influence each other and work together in the occurrence and development of heart disease. In this case, traditional statistical methods such as logistic regression with the independence between variables as the premise are difficult to identify the interactions of independent variables on prevalence of heart disease. The Bayesian network is a very effective tool for dealing with uncertain knowledge expression and reasoning. By organically combining the knowledge of probability theory and graph theory, it can not only identify the complex interactions between variables, but also reveal the strengths of the dependency relationships between variables from a quantitative perspective [5]. It has been widely applied in the prediction and diagnosis of diseases [6, 7, 8, 9]. Constructing the Bayesian network model mainly includes three parts: structure learning, parameter learning and inference. Among them, it is the first and key task to obtain the Bayesian network structure which can accurately express the correlation between variables. In this study, an improved hybrid structure learning strategy for Bayesian networks based on ensemble learning named EN-ESL-GA was used to learn the Bayesian network structure. The EN-ESL-GA algorithm is a kind of Bagging ensemble method based on an elite-based structure learner using genetic algorithm (ESL-GA) [10], and the Bayesian network structure established by it is more accurate and reliable. Parameter learning and reasoning analysis were completed in Netica software.

2. Materials and methods

2.1 Data materials

2.1.1 Data sources

CHARLS which contains the basic data of families and individuals of middle-aged and elderly people aged 45 and over in China provides good data support for analyzing the aging problem of China’s population [11]. In order to understand the complex relationships between heart disease and its risk factors in middle-aged and elderly people in China so as to carry out appropriate intervention on the modifiable risk factors of heart disease and better realize “healthy aging”, this study selected 12 relevant variables to form the research data set variables based on the 2013 national tracking survey data in CHARLS database.

2.1.2 Data processing and variable definition

After processing the outliers and missing values of the research data set and discretizing the continuous variables, 11091 cases of qualified data were obtained as the data set for establishing Bayesian network model.

The data set involved 12 variables including gender, age, registered residence, education level, smoking, drinking, BMI, depression, hypertension, diabetes, dyslipidemia and heart disease (Heart disease here is a general term for any of various structural or functional abnormalities or diseases of the human heart including myocardial infarction, coronary heart disease, angina pectoris, congestive heart failure or other heart problems.). The CESD-10 scale was used in the CHARLS survey to investigate the risk of depression in middle-aged and elderly people, and the depression score ranged from 0 to 30 points. This study defined that if the depression score was greater than or equal to 10, it was depression, otherwise it was no depression. The classification and assignment of all variables are shown in Table 1.

Table 1
Description of node variables of Bayesian network

	Variable code	Variable meaning	Variable value
1	Gender	Gender	Male $=$ 1, female $=$ 2
2	Age	Age	Middle age (45 ${\leqslant}$ Age $<$ 60) $=$ 1, old age (Age ${\geqslant}$ 60) $=$ 2
3	Registered residence	Registered residence	Agriculture $=$ 1, non-agriculture $=$ 2
4	Education	Education	Primary education and below $=$ 1, secondary education $=$ 2, higher education $=$ 3
5	Smoking	Have you ever smokedï¼Ÿ	Yes $=$ 1, no $=$ 2
6	Drinking	Have you had alcohol in the past year?	Yes $=$ 1, no $=$ 2
7	BMI	BMI (kg/m²)	Underweight (BMI $<$ 18.5) $=$ 1, normal (18.5 $\leqslant$ BMI $<$ 25) $=$ 2, overweight (25 $\leqslant$ BMI $<$ 30) $=$ 3, obese (BMI $\geqslant$ 30) $=$ 4
8	Depression	Depression	Yes $=$ 1, no $=$ 2
9	Hypertension	Self-report whether you have hypertension or not	Yes $=$ 1, no $=$ 2
10	Diabetes	Self-report whether you have medically diagnosed diabetes or not	Yes $=$ 1, no $=$ 2
11	Dyslipidemia	Self-report whether you have medically diagnosed dyslipidemia or not	Yes $=$ 1, no $=$ 2
12	Heart disease	Self-report whether you have medically diagnosed heart disease or not	Yes $=$ 1, no $=$ 2

2.2 Bayesian network modeling

Firstly, by virtue of the Bayesian network toolbox BNT in MATLAB, this study used the elite-based structure learner using genetic algorithm based on ensemble learning (EN-ESL-GA) algorithm to learn the structure of the Bayesian network for data related to heart disease and its possible risk factors in middle-aged and elderly people in China. Then we input the data and the learned network structure into the Netica software for parameter learning and inference analysis.

The EN-ESL-GA algorithm is a hybrid Bayesian network structure learning method, featuring strong optimization search ability and more accurate and reliable optimization results. The EN-ESL-GA algorithm and its performance analysis have been described in detail in the manuscript [12]. The EN-ESL-GA algorithm is a kind of bagging ensemble method based on the ESL-GA algorithm, in which the number of sample sets is set to 20 and the weighted average method is selected as the fusion strategy. The algorithm equates the process of Bayesian network structure learning to learning the adjacency matrix of the directed acyclic graph. Firstly, 20 adjacency matrices learned by the ESL-GA algorithm are weighted and averaged and the weight is the normalized BDeu score [13] of the Bayesian network represented by each adjacency matrix. Secondly, a fusion matrix is obtained by filtering out the edges with weak dependence according to a given threshold. Finally, the fusion matrix is revised by the operations of removing loops and the maximum number control of parent nodes to obtain the final Bayesian network structure.ã€€Figure 1 shows the flow chart of the EN-ESL-GA algorithm.

Figure 1.

The flow chart of the EN-ESL-GA algorithm.

Figure 2.

The Bayesian network structure.

3. Results

3.1 Structure learning

When the EN-ESL-GA algorithm was used to learn the Bayesian network structure of the data related to heart disease and its risk factors of middle-aged and elderly people in China, the maximum number of parent nodes of the network was not known in advance. Here, the maximum node degree 4 of the Bayesian network structure obtained by the Maximum Weight Spanning Tree (MWST) algorithm [14] was set to the maximum number of parent nodes during the network construction and other parameters were adjusted appropriately. Specifically, the maximum CPU time was set as 600 s, the population size N was set as 200, the maximum iteration number M was set as 50 and the weight threshold of the integration result was set as 0.1. In addition, considering the actual causality, the learning results after each application of single ESL-GA algorithm were modified and the directed edges between gender, age and registered residence node variables and other node variables could only be pointed to other node variables by gender, age and registered residence node variables. Only education pointed to other node variables except gender, age and registered residence. If there were directed edges between smoking (drinking) and variables including gender, age, registered residence and education, the directions of directed edges could only be directed to smoking (drinking) by these variables, while the directions of the relationships between smoking (drinking) and other variables except drinking (smoking) were opposite. Only BMI pointed to other node variables except gender, age, registered residence, education, smoking and drinking.

As shown in Fig. 2, a Bayesian network structure with 12 nodes and 27 directed edges was finally constructed. It can be seen that gender, BMI, diabetes and dyslipidemia were directly related to heart disease while other factors were indirectly related to heart disease through these direct factors. Smoking and drinking indirectly affected the occurrence of heart disease through the intermediary node BMI. Registered residence was indirectly linked to heart disease through BMI and diabetes. Hypertension and depression indirectly affected heart disease by directly affecting dyslipidemia. At the same time, it is observed that gender, age and registered residence were directly correlated with education, and education further affected smoking, drinking and depression, thus indirectly correlated with heart disease.

3.2 Parameter learning and reasoning

The above Bayesian network structure and data set were input into Netica software for parameter learning. Table 2 lists the conditional probability distribution table of heart disease after the rank of heart disease prevalence from large to small. In order to more intuitively describe the difference in the incidence of heart disease in different gender, the line chart of conditional probability distribution of heart disease as shown in Fig. 3 was drawn according to Table 2.

Table 2
The conditional probability distribution table of heart diseases

Parent nodes of heart disease				Prevalence of
BMI	Diabetes	Dyslipidemia	Gender	heart disease (%)
Underweight	Yes	Yes	Nale	66.7
Obese	Yes	Yes	Male	53.6
Obese	Yes	Yes	female	47.4
Obese	Yes	No	Male	45.5
Overweight	Yes	Yes	Female	44.2
Underweight	No	Yes	Male	43.8
Overweight	Yes	Yes	Male	38.0
Obese	No	Yes	Female	37.8
Overweight	No	Yes	Female	37.2
Obese	Yes	No	Female	34.3
Normal	Yes	Yes	Female	33.8
Underweight	Yes	Yes	Female	33.3
Normal	Yes	Yes	Male	32.1
Obese	No	Yes	Male	32.1
Underweight	Yes	No	Female	28.6
Normal	No	Yes	Female	25.8
Normal	No	Yes	Male	25.7
Overweight	No	Yes	Male	25.6
Overweight	Yes	No	Male	22.7
Normal	Yes	No	Female	21.5
Overweight	Yes	No	Female	20.9
Obese	No	No	Female	19.0
Underweight	No	Yes	Female	18.2
Underweight	Yes	No	Male	16.7
Overweight	No	No	Female	15.1
Underweight	No	No	Female	13.1
Obese	No	No	Male	12.6
Normal	Yes	No	Male	11.9
Normal	No	No	Female	11.5
Overweight	No	No	Male	11.4
Underweight	No	No	Male	9.1
Normal	No	No	Male	7.5

Figure 3.

Line charts of conditional probability distribution of heart disease by gender (Note: The legend shows the status of the parent nodes of heart disease including diabetes, dyslipidemia and gender in order. For example, the broken line of ‘yes-yes-male’ reflects the change in the prevalence of heart disease in men with diabetes and dyslipidemia with the change of BMI.)

As can be seen from Fig. 3, the prevalence of heart disease among the middle-aged and elderly in China generally showed the following characteristics: the more obese BMI tended to be, the more likely it was to suffer from heart disease; People with diabetes and dyslipidemia were more likely to develop heart disease. However, the rates of heart disease when BMI was underweight in ‘y-y-m’ and ‘n-y-m’ were as high as 66.7% and 43.8% respectively, both exceeding the rates of heart disease when BMI was obese (53.6% and 32.1%). This makes us naturally want to carry out the diagnostic reasoning to see what factors have a clear gender difference dominating this situation.

Set Diabetes $=$ yes, Dyslipidemia $=$ yes, Sex $=$ male, and changes of smoking and drinking (direct influencing factors of BMI) at the state of BMI of underweight compared with that at the state of BMI of normal were as follows: the smoking rate increased from 80.7 percent to 87.6 percent; the drinking rate decreased from 56.9 percent to 43.3 percent. Similarly, set Diabetes $=$ no, Dyslipidemia $=$ yes, Sex $=$ male. The changes of smoking and drinking at the state of BMI of underweight compared with that at the state of BMI of normal were as follows: the smoking rate increased from 80.9 percent to 88.6 percent; the drinking rate decreased from 56.7 percent to 44.1 percent. It was thus clear that smoking was the main reason for the higher risk of heart disease in men who were underweight.

4. Discussion

The Bayesian network model constructed in this study was used to reveal the complex network relationships between heart disease and its influencing factors. The results of the study showed that the prevalence of heart disease in middle-aged and elderly people in China was 13.9%. Gender, BMI, diabetes and dyslipidemia were directly related to heart disease, while other factors were indirectly related to heart disease through these direct influencing factors. The results of this study indicated that there was a gender difference in the prevalence of heart disease, women (16.4%) were more likely to suffer from heart disease than men (11.3%). The more obese a person tended to be, the more likely it was to suffer from heart disease, and the rate of heart disease among obese people was 24.5%. Li et al. [4] also pointed out that high BMI was a key risk factor for cardiovascular diseases when studying the potential impacts of time trends of lifestyle factors on the burden of cardiovascular diseases in China. Patients with diabetes had a higher risk of heart disease (28.2%). The relevant study [15] showed that the risk of the coronary heart disease and its death would increase by 1.38 and 1.86 times respectively with each 10-year increase in the course of diabetes. The prevalence rate of heart disease in patients with dyslipidemia was 32.1%, indicating that patients with dyslipidemia were more vulnerable to heart disease, which was consistent with the research results in literature [16, 17].

There was also the certain interaction between gender and BMI which affected heart disease. Both the proportions of overweight and obesity in females (33.3% and 7.9%) were higher than that in males (26.9% and 3.9%), which might be one of the reasons for the higher prevalence of heart disease in females. The smoking rate of males was extremely high (79.2%). In the previous inference analysis of the Bayesian network model, we had found that smoking was very likely to lead to a higher risk of heart disease in men with low weight. This was consistent with the results of a previous study [18] which pointed out that smoking was a key risk factor of the coronary heart disease, and the weight change caused by smoking had no obvious mediating effect on the risk of the coronary heart disease. In addition, BMI, diabetes and dyslipidemia interacted with each other on heart disease. BMI directly affected diabetes and dyslipidemia which in turn affected diabetes. In the case of obese patients with diabetes and dyslipidemia, the rates of heart disease in men and women were 53.6% and 47.4% respectively. This was consistent with the conclusions of previous studies. For example, the meta-analysis [19] found that obese adults at the onset of diabetes had a higher risk of morbidity and mortality of the cardiovascular disease. Rao et al. [20] found that there was a dose-response non-linear relationship between BMI and hyperlipidemia when studying the cross-sectional relationship between BMI and hyperlipidemia in adults in the northeast China, and the odds ratio tended to increase significantly as per 1 kg/m² increased in BMI. A cohort study [21] found that dyslipidemia was the most important factor affecting type 2 diabetes.

This study also found that registered residence was indirectly related to heart disease through BMI and diabetes. It can be inferred from the network reasoning that people with non-agricultural registered residence tended to be more obese and had a greater risk of diabetes than those with agricultural registered residence, which thus increased the probability of heart disease by 1.0%. Literature [22] that studied the difference in the prevalence of diabetes between rural and urban areas in Tanzania also showed that the prevalence of diabetes in urban areas was higher than that in rural areas in Tanzania and this trend could be explained by the difference in obesity and overweight rates. Depression indirectly affected the occurrence of heart disease by directly affecting the occurrence of dyslipidemia. Literature [23] also pointed out that patients with dyslipidemia with depression had an increased risk of the cardiovascular disease. The two direct factors that affected hypertension included age and BMI, and the older the age was and the greater the BMI was, the more likely they were to suffer from hypertension. It can be inferred by the inference learning that the probability of suffering from hypertension in the elderly with obesity was 67.3%, which had exceeded twice of the general prevalence of hypertension (30.1%). In combination with the Bayesian network structure, inference and the analysis mentioned above, it could be found that BMI was a common direct influencing factor of hypertension, diabetes, dyslipidemia and heart disease, and obesity, hypertension, diabetes and dyslipidemia all increased the risk of heart disease and there were significant interactions between them. Therefore, the comorbidity of heart disease and obesity, hypertension, diabetes and dyslipidemia deserved high attention [24]. In addition, it was observed that gender, age and registered residence were directly correlated with education, and education further affected smoking, drinking and depression, thus indirectly correlated with heart disease.

5. Conclusions

It is thus clear that the Bayesian network model constructed in this paper which can not only discover the key influencing factors of heart disease and the interaction mechanisms among these key influencing factors but also deeply dig out certain intermediate links in the occurrence and development of heart disease has good applicability in the risk prediction of diseases. However, due to the complexity of practical problems, it is necessary to adjust the directions of some edges of the network structure according to some professional knowledge when using the EN-ESL-GA algorithm to learn the structure of Bayesian networks, which further increases the number and difficulty of network learning. This is the deficiency of the EN-ESL-GA algorithm in practical application, which needs to be further discussed.

Data availability statement

The datasets used in the experiments are publicly available at http://charls.pku.edu.cn/.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Ethics approval

As the data sets included in the study were downloaded from public databases, the study did not need the approval of an ethics committee.

Footnotes

Acknowledgments

The authors thank all participants of this study and the publicly available data set from http://charls.pku. edu.cn/.

Conflict of interest

All the authors declared no conflict of interest.

References

Sanchez-Rodriguez

Egea-Zorrilla

Plaza-Díaz

Aragón-Vela

Muñoz-Quezada

Tercedor-Sánchez

, et al. The gut microbiota and its implication in the development of atherosclerosis and related cardiovascular diseases. Nutrients. 2020 Feb 26; 12(3): 605.

Liberale

Badimon

Montecucco

Lüscher

Libby

Camici

. Inflammation, aging, and cardiovascular disease: JACC review topic of the week. Journal of the American College of Cardiology. 2022 Mar 1; 79(8): 837-47.

The Writing Committee of the Report on Cardiovascular Health and Diseases in China. Report on Cardiovascular Health and Diseases in China 2021: An Updated Summary. Biomedical and Environmental Sciences. 2022; 35(7): 573-603.

Klein

Shenasa

Baranchuk

. Social risk factors and atrial fibrillation. Cardiac Electrophysiology Clinics. 2021 Mar 1; 13(1): 165-72.

Dai

Ren

Shikhin

. An improved evolutionary approach-based hybrid algorithm for Bayesian network structure learning in dynamic constrained search space. Neural Computing and Applications. 2020 Mar; 32: 1413-34.

Ahmed

Dera

Hassan

Bouaynaya

Rasool

. Failure detection in deep neural networks for medical imaging. Frontiers in Medical Technology. 2022 Jul 22; 4: 919046.

Ning

Sun

Wang

Fan

Jia

Cui

. Use of machine learning-based integration to develop a monocyte differentiation-related signature for improving prognosis in patients with sepsis. Molecular Medicine. 2023 Mar 20; 29(1): 37.

Delucchi

Spinner

Scutari

Bijlenga

Morel

Friedrich

Furrer

Hirsch

. Bayesian network analysis reveals the interplay of intracranial aneurysm rupture risk factors. Computers in Biology and Medicine. 2022 Aug 1; 147: 105740.

Waddell

Namburete

Duckworth

Fichera

Telford

Thomaides-Brears

Cuthbertson

Brady

. Poor glycaemic control and ectopic fat deposition mediates the increased risk of non-alcoholic steatohepatitis in high-risk populations with type 2 diabetes: Insights from Bayesian-network modelling. Frontiers in endocrinology. 2023 Feb 22; 14: 1063882.

10.

Contaldi

Vafaee

Nelson

. Bayesian network hybrid learning using an elite-guided genetic algorithm. Artificial Intelligence Review. 2019; 52(1): 245-272.

11.

Chen

Wang

Strauss

Zhao

. China Health and Retirement Longitudinal Study (CHARLS). Encyclopedia of Gerontology and Population Aging. 2021; 948-956.

12.

Gao

Zeng

Zhi

. An improved hybrid structure learning strategy for Bayesian networks based on ensemble learning. Intelligent Data Analysis. 2023; 27(4): 1103-1120.

13.

Zhou

Wang

. An ensemble learning approach for XSS attack detection with domain knowledge and threat intelligence. Computers & Security. 2019; 82: 261-269.

14.

Bouchaala

Masmoudi

Gargouri

Rebai

. Improving algorithms for structure learning in Bayesian Networks using a new implicit score. Expert Systems with Applications. 2010; 37(7): 5470-5475.

15.

Fox

Sullivan

D’Agostino

Wilson

PWF

. The significant effect of diabetes duration on coronary heart disease morality – The Framingham Heart Study. Diabetes Care. 2004; 27(3): 704-708.

16.

Kane

Pullinger

Goldfine

Malloy

. Dyslipidemia and diabetes mellitus: Role of lipoprotein species and interrelated pathways of lipid metabolism in diabetes mellitus. Current Opinion in Pharmacology. 2021; 61: 21-27.

17.

Xing

Jing

Tian

, et al. Epidemiology of dyslipidemia and associated cardiovascular risk factors in northeast China: A cross-sectional study. Nutrition Metabolism and Cardiovascular Diseases. 2020; 30(12): 2262-2270.

18.

Mokhayeri

Nazemipour

Mansournia

Naimi

Kaufman

. Does weight mediate the effect of smoking on coronary heart disease? Parametric mediational g-formula analysis. Plos One. 2022; 17(1): e0262403.

19.

Zhao

Qie

Han

, et al. Association of BMI with cardiovascular disease incidence and mortality in patients with type 2 diabetes mellitus: A systematic review and dose-response meta-analysis of cohort studies. Nutrition Metabolism and Cardiovascular Diseases. 2021; 31(7): 1976-1984.

20.

Rao

Yang

, et al. Cross-Sectional Associations between Body Mass Index and Hyperlipidemia among Adults in Northeastern China. International Journal of Environmental Research and Public Health. 2016; 13(5): 10.

21.

Wang

Dai

Zheng

, et al. The Combined Effect of Dyslipidemia on the Incidence of Type 2 Diabetes: A Prospective Cohort Study in Northwest of China. Biomedical and Environmental Sciences. 2021; 34(10): 814-818.

22.

Aspray

Mugusi

Rashid

, et al. Rural and urban differences in diabetes prevalence in Tanzania: the role of obesity, physical inactivity and urban living. Transactions of the Royal Society of Tropical Medicine and Hygiene. 2000; 94(6): 637-644.

23.

Kim

Choi

Park

. Pre-Existing Depression among Newly Diagnosed Dyslipidemia Patients and Cardiovascular Disease Risk. Diabetes & Metabolism Journal. 2020; 44(2): 307-+.

24.

Kotsis

Jordan

Micic

, et al. Obesity and cardiovascular risk: a call for action from the European Society of Hypertension Working Group of Obesity, Diabetes and the High-risk Patient and European Association for the Study of Obesity: part A: mechanisms of obesity induced hypertension, diabetes and dyslipidemia and practice guidelines for treatment. Journal of Hypertension. 2018; 36(7): 1427-1440.

An application of the Bayesian network model based on the EN-ESL-GA algorithm: Exploring the predictors of heart disease in middle-aged and elderly people in China

Abstract

BACKGROUND:

OBJECTIVE:

METHODS:

RESULTS:

CONCLUSIONS:

Keywords

1. Introduction

2. Materials and methods

2.1 Data materials

2.1.1 Data sources

2.1.2 Data processing and variable definition

Table 1 Description of node variables of Bayesian network

3.1 Structure learning

3.2 Parameter learning and reasoning

Table 2 The conditional probability distribution table of heart diseases

5. Conclusions

Data availability statement

Funding

Ethics approval

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Description of node variables of Bayesian network

Table 2
The conditional probability distribution table of heart diseases