Development and external validation of a machine learning model for cardiovascular risk prediction in individuals with chronic lung disease: Evidence from CHARLS and ELSA

Abstract

Background

Patients with chronic lung disease (CLD) are at a significantly increased risk of developing cardiovascular disease (CVD); however, specific risk assessment tools tailored for this high-risk population are currently lacking. This study aimed to develop, validate, and interpret a machine learning model specifically designed to predict the risk of concurrent CVD in patients with CLD.

Methods

Based on the China Health and Retirement Longitudinal Study (CHARLS) cohort, 2,639 patients with CLD were included. Core features were selected using univariate and multivariate logistic regression. Seven machine learning algorithms were systematically compared. After identifying the optimal model, external validation was conducted using the English Longitudinal Study of Ageing (ELSA) cohort (n = 1,303). The SHapley Additive exPlanations (SHAP) framework was employed to interpret the model’s predictive mechanisms, and an interactive web application was developed based on the optimal model.

Results

The study ultimately identified 8 core predictors: age, body mass index (BMI), depression score, hypertension, dyslipidemia, impaired instrumental activities of daily living (IADL), and medication history for lung diseases and lipid-lowering drugs. The XGBoost model demonstrated the best performance, achieving Area Under the Curve (AUC) values of 0.838, 0.797, and 0.695 in the training, testing, and external validation sets, respectively, while exhibiting excellent calibration and clinical net benefit. SHAP analysis revealed that hypertension, depression score, and age were the primary contributing variables, and confirmed a significant synergistic amplification effect between lipid metabolism and psychophysical functional indicators.

Conclusion

The model constructed based on the XGBoost algorithm can accurately and robustly predict CVD risk in patients with CLD. Coupled with SHAP interpretability analysis and the online prediction tool, this study provides reliable digital decision support for CVD risk stratification, early identification, and personalized intervention among patients with CLD in primary care settings.

Keywords

chronic lung disease cardiovascular disease neural network SHAP interpretation

1. Introduction

Chronic lung diseases (CLD), including chronic obstructive pulmonary disease (COPD), asthma, and interstitial lung disease, represent a group of common chronic conditions characterized by persistent airflow limitation or lung parenchyma damage. In recent years, CLD has emerged as a major global public health challenge due to its increasing contribution to the overall disease burden and mortality rates.^1,2 According to the Global Burden of Disease (GBD) 2021 study, COPD is the fourth leading cause of death worldwide, accounting for approximately 3.5 million deaths in 2021, which represents 5% of total global deaths. It is projected that by 2050, the global prevalence of COPD will reach 600 million cases, reflecting a 23% increase compared to 2020.^3–5

Beyond respiratory symptoms, patients with CLD frequently suffer from multisystem comorbidities, particularly a significantly elevated risk of cardiovascular disease (CVD). Studies have demonstrated that chronic hypoxia, systemic inflammation, vascular dysfunction, and medication-related adverse effects in patients with CLD synergistically promote the development of atherosclerosis, arrhythmias, and heart failure.^6,7 CVD is not only the leading cause of hospitalization and readmission among patients with CLD but also one of the primary causes of non-respiratory-related mortality.⁸ Therefore, developing CVD risk prediction models specifically tailored to the CLD population is crucial for the early identification of high-risk individuals and the improvement of long-term prognosis through timely intervention.

Current evidence suggests that various demographic, behavioral, and health-related factors are closely associated with CVD risk. Age and sex serve as fundamental predictors, while lifestyle behaviors such as smoking, alcohol consumption, and sleep disorders are intrinsically linked to cardiovascular outcomes.⁹ Furthermore, comorbidities and psychosocial factors—including renal disease, depression, functional impairment, and low life satisfaction—are significantly correlated with adverse cardiovascular events.^10–12 Given the high prevalence of these overlapping risk factors in patients with CLD, assessing CVD risk in this population is inherently more complex. Consequently, there is an urgent need for comprehensive prediction models that integrate multidimensional variables to achieve more precise risk stratification and targeted interventions.

Although various CVD risk assessment tools (such as the Framingham Risk Score and QRISK3) have been widely applied in the general population, they do not adequately account for the unique pathophysiological characteristics and comorbidity burdens of patients with CLD, limiting their applicability to this specific group. Some studies have attempted to develop COPD-specific models, such as nomograms based on multivariate logistic regression (integrating traditional risk factors like age, sex, smoking, and hypertension), which have shown some clinical utility.¹³ However, these models primarily rely on traditional statistical methods, making it difficult to capture non-linear relationships and complex interactions among variables. Moreover, they frequently lack external validation across different populations, which restricts their generalizability. Thus, more flexible and generalizable modeling strategies are required to improve the accuracy of CVD risk assessment in the CLD population.

Against this backdrop, machine learning (ML) technologies have shown great potential in predicting chronic disease risks due to their strengths in automatic feature selection, modeling complex non-linear relationships, and robust generalization. Recent studies have further explored novel approaches for cardiovascular disease (CVD) risk prediction, providing valuable insights for applying ML in high-risk populations.^14–16 Studies indicate that ML algorithms, such as XGBoost, Random Forest, and LightGBM, outperform traditional methods in predicting CVD within the general population.¹⁷ Nevertheless, their application in high-risk subgroups, such as patients with CLD, remains relatively limited. Therefore, it is imperative to develop more targeted and highly accurate prediction models to enhance early risk identification and provide proactive support for clinical decision-making.

In this study, we utilized large-scale data from the China Health and Retirement Longitudinal Study (CHARLS) to develop machine learning-based CVD risk prediction models for the CLD population, integrating multidimensional variables including demographics, lifestyle, and health status. We selected key predictors based on clinical relevance and statistical analysis, comparatively evaluated multiple algorithms, and ultimately identified Extreme Gradient Boosting (XGBoost) as the optimal prediction model. To enhance the model’s robustness and external applicability, we introduced data from the English Longitudinal Study of Ageing (ELSA) for external validation, further confirming the model’s generalizability across diverse populations. Additionally, we employed SHAP analysis to elucidate the impact of key features on the prediction outcomes and developed an interactive online assessment tool using the Shiny framework, demonstrating its practical application potential in primary care and telemedicine settings.

2. Methods

2.1. Data sources and study population

This study utilized harmonized data from Wave 3 (2015) of the China Health and Retirement Longitudinal Study (CHARLS) as the model development cohort. The CHARLS data were accessed from the official CHARLS website after user registration and approval for research use. CHARLS is a nationally representative cohort covering 28 provinces in China and targets adults aged 45 years and older.¹⁸ It systematically collects multidimensional information on demographics, lifestyles, health status, and blood biomarkers.

The inclusion and exclusion criteria for the training cohort in this study were as follows: (1) aged 45 years or older; (2) self-reported physician-diagnosed chronic lung disease or asthma; and (3) complete information on cardiovascular disease status. Participants with missing key variables were excluded. The eligible CHARLS data were randomly partitioned into a training set and a testing set at a ratio of 7:3 for model training and internal validation, respectively.

2.2. Definitions of CLD and CVD

CLD was defined based on self-reported physician diagnoses, which included chronic bronchitis, emphysema, cor pulmonale, or asthma. CVD, serving as the primary outcome variable, was similarly determined based on self-reported physician diagnoses, encompassing heart diseases (e.g., myocardial infarction, angina, heart failure) and stroke.¹⁹

Additionally, the “community_id” variable was utilized to calculate the provincial-level prevalence of CLD and CVD within the CHARLS cohort, and geographic distribution maps were generated using the sf and geojsonio packages in R.

2.3. Covariates

Multidimensional covariates were included to comprehensively assess the risk of incident CVD in patients with CLD. Specifically, these variables comprised:

Sociodemographics and lifestyle: age (continuous variable), sex, educational level, marital status, residence type (urban/rural), annual per capita household expenditure, smoking history, drinking history, and average daily sleep duration.

Physiological, functional, and psychological status: body mass index (BMI), dominant hand grip strength, systolic and diastolic blood pressure, impaired activities of daily living (ADL), impaired instrumental activities of daily living (IADL), hearing/vision impairment, chronic pain, life satisfaction, and depression score.

Comorbidities and medication history: diabetes, hypertension, dyslipidemia (core cardiometabolic comorbidities), as well as the medication history of lung diseases, antihypertensives, lipid-lowering agents, and antidiabetic drugs (binary variables).

Blood biomarkers: 11 core indicators related to cardiometabolic health and systemic inflammation, such as white blood cell count, platelet count, and glycated hemoglobin (HbA1c).

All covariate data were derived from CHARLS questionnaires, physical examinations, or laboratory assays.

2.4. Data cleaning and preprocessing

Standardized data cleaning was performed on the CHARLS dataset. Extreme values of continuous variables were handled using the Winsorization method to minimize the influence of outliers. Missing covariate values were imputed via multiple imputation using predictive mean matching with the mice package (50 iterations, generating 5 datasets that were subsequently pooled). Cross-cohort variables were uniformly recoded, and the consistency of baseline distributions was evaluated using the tableone package.

2.5. Feature selection, model construction, and evaluation

Feature selection was performed in the training set using a logistic-regression-based feature-reduction approach. This approach was adopted to obtain a parsimonious and clinically interpretable predictor set composed of variables that were consistently available across the CHARLS and ELSA cohorts. First, univariable logistic regression analysis was conducted, and variables significantly associated with the CVD outcome (P < 0.05) were included in a multivariable logistic regression model. Variables that remained statistically significant after multivariable adjustment (P < 0.05) were retained as the final feature subset for subsequent machine learning model construction.

Based on this feature subset, seven machine learning models (GLM, SVM, GBM, NNET, RF, XGBoost, and AdaBoost) were constructed using the caret package in R to predict CVD risk in patients with CLD. All models were trained and optimized using 10-fold cross-validation, and the optimal hyperparameter configurations are detailed in Supplementary Table S1. Model performance was comprehensively evaluated using multidimensional metrics, independently assessed across the training, testing, and external validation sets. Specifically, the Area Under the Curve (AUC) and its 95% confidence interval (CI) were calculated using the pROC package to evaluate discrimination. The Brier score (defined as the mean squared difference between predicted probabilities and actual outcomes, ranging from 0 to 1, with values closer to 0 indicating better agreement and higher prediction accuracy) was calculated, and calibration curves were plotted to assess model calibration. Precision-Recall (PR) curves were generated to address potential data class imbalance issues. Finally, Decision Curve Analysis (DCA) was performed using the dcurves package, incorporating 50 bootstrap resamples to calculate the clinical net benefit across different risk thresholds, thereby evaluating the clinical utility of the models.

2.6. SHAP analysis

The SHapley Additive exPlanations (SHAP) method was employed to enhance model interpretability.²⁰ SHAP values were calculated using the fastshap package and visualized using the shapviz package. First, the global feature importance was demonstrated using a mean absolute SHAP value plot. Second, a SHAP beeswarm plot was utilized to present the distribution of contributions and the direction of impact for each feature at the individual level. Finally, SHAP dependence plots and force plots were generated to elucidate the non-linear effects, interaction effects of core features, and the predictive composition of individual cases, achieving both global and local model interpretation.

2.7. Subgroup analysis

Subgroup analyses were conducted in the combined CHARLS dataset (training and testing sets), stratified by sex (male/female) and age tertiles (divided at the 33rd and 66th percentiles of age within the training set). The AUC was calculated within each subgroup to evaluate model discrimination and verify the model’s robustness across different populations, providing a reference for personalized risk stratification.

2.8. External validation cohort

Data from Wave 7 (2014–2015) of ELSA were used as the external validation cohort. The ELSA data were accessed through the UK Data Service after registration and acceptance of the applicable data use agreement. This cohort, led by University College London, covers the UK population aged 50 years and older.²¹ The collected data are highly consistent with those of CHARLS. ELSA received ethical approval from relevant NHS Research Ethics Committees, and all participants in the original survey provided informed consent. Inclusion criteria consistent with the model development cohort were applied, including valid CLD and CVD status information. The ELSA dataset was solely used for validating the CHARLS-trained model and was not internally partitioned. Data cleaning and preprocessing strictly followed the protocols established for the CHARLS cohort.²²

2.9. Development and deployment of the web–based tool

An interactive online prediction tool was developed using the Shiny framework in R and deployed on the shinyapps.io platform. By inputting individual feature variables, this tool can generate real-time CVD predicted risk probabilities based on the optimal machine learning model. It also integrates core feature visualization functions, providing decision support for individualized risk assessment for clinicians and public health professionals, which is particularly applicable in primary care and resource-limited settings.

2.10. Statistical analysis

Among baseline characteristics, continuous variables were expressed as weighted means ± standard deviations (SD), and categorical variables were presented as weighted frequencies and percentages. Group comparisons were performed using the Student's t-test for continuous variables, and the Chi-square test or Fisher’s exact test for categorical variables. Results were reported as odds ratios (OR) with their 95% CI. All statistical tests were two-sided, and a P-value < 0.05 was considered statistically significant. All analyses were conducted using R software (version 4.4.0).

3. Results

3.1. Characteristics of the study population and construction of model datasets

Based on the standardized 2015 CHARLS dataset, this study had an initial sample size of 25,586 individuals. After strict screening (excluding those aged under 45 years and those with missing key variables), a final total of 2,639 patients with CLD were included. Among the patients with CLD, there were 2,337 cases of chronic lung diseases and 976 cases of asthma (including 674 overlapping cases). Concurrently, 903 patients with concurrent cardiovascular disease were identified, comprising 816 cases of heart-related diseases and 141 cases of stroke, with 54 overlapping cases (see Figure 1).

Figure 1.

Flowchart of participant selection from the CHARLS 2015 dataset.

Geographic distribution analysis revealed that CLD was highly prevalent in central and western regions such as Xinjiang and Chongqing, whereas CVD was more common in economically developed or high-latitude regions such as Shanghai and Heilongjiang. This suggests significant regional disparities driven by environmental and socioeconomic factors (Figure 2, Supplementary Table S2, epidemiological distribution of the overall CHARLS 2015 population: n = 25,586; with a total crude prevalence of 2,749 cases for CLD and 6,660 cases for CVD).

Figure 2.

Geographic distribution of disease prevalence across Chinese provinces. (a) Prevalence of chronic lung disease (CLD) among CHARLS participants by province. (b) Prevalence of cardiovascular disease (CVD) across the same regions.

After randomly dividing the enrolled patients into a training set (n = 1,849) and a testing set (n = 790) at a 7:3 ratio, analysis showed that the two groups were highly balanced in the distribution of demographic characteristics, physiological indicators, and biomarkers (all variables P > 0.05), providing a robust foundation for model development (Supplementary Table S3).

Baseline comparisons demonstrated a significant clustering of risk characteristics among patients with CLD with concurrent CVD in terms of demographics (advanced age, predominantly female), physiological functions (high BMI, weak grip strength), and comorbidity burdens (hypertension, diabetes, functional impairment, etc.) (Table 1). In addition, the CVD group exhibited higher levels of HbA1c. These multidimensional distinguishing variables provided core feature support for the subsequent construction of precise prediction models.

Table 1.

Baseline characteristics of chronic lung disease patients according to concurrent cardiovascular disease.

Characteristic	Overall (n=2639)	No (n=1736)	Yes (n=903)	P-value
Age_continuous, mean (SD)	64.3 (9.8)	63.5 (9.8)	65.8 (9.6)	<0.001
Sleeping_time, mean (SD)	6.0 (2.1)	6.2 (2.0)	5.7 (2.1)	<0.001
Annual_Household_Expenditure, mean (SD)	13226.5 (15513.4)	13024.1 (15673.6)	13634.3 (15190.8)	0.442
BMI, mean (SD)	23.4 (4.0)	23.0 (3.8)	24.1 (4.4)	<0.001
Hand_grip_strength, mean (SD)	27.9 (10.7)	28.9 (10.6)	26.0 (10.6)	<0.001
high_blood_pressure, mean (SD)	128.0 (20.7)	127.1 (20.2)	129.8 (21.5)	0.005
low_blood_pressure, mean (SD)	74.4 (12.0)	74.4 (12.0)	74.6 (12.2)	0.669
Depression, mean (SD)	9.8 (6.9)	9.0 (6.5)	11.3 (7.4)	<0.001
White_blood_cell, mean (SD)	6.1 (1.8)	6.1 (1.8)	6.2 (1.8)	0.621
Platelets, mean (SD)	197.9 (69.5)	196.9 (70.5)	199.6 (67.5)	0.437
HbA1c, mean (SD)	6.0 (0.8)	5.9 (0.7)	6.1 (0.8)	0.001
Haemoglobin, mean (SD)	13.8 (1.8)	13.8 (1.8)	13.8 (1.9)	0.656
Glucose, mean (SD)	101.2 (28.4)	101.0 (28.0)	101.7 (29.0)	0.643
TC, mean (SD)	181.9 (34.1)	181.7 (33.3)	182.2 (35.4)	0.758
TG, mean (SD)	136.8 (85.6)	135.9 (84.7)	138.4 (87.2)	0.564
HDL_C, mean (SD)	51.8 (11.8)	52.0 (11.6)	51.4 (12.1)	0.248
LDL_C, mean (SD)	100.7 (27.1)	100.8 (27.1)	100.5 (27.2)	0.824
CRP, mean (SD)	3.3 (5.4)	3.3 (5.4)	3.3 (5.6)	0.861
BUN, mean (SD)	15.7 (4.6)	15.6 (4.5)	15.8 (4.7)	0.517
Sex, n (%)				<0.001
Female	1183 (44.8)	704 (40.6)	479 (53.0)
Male	1456 (55.2)	1032 (59.4)	424 (47.0)
Education_level, n (%)				0.342
Illiterate	1342 (50.9)	870 (50.1)	472 (52.3)
Elementary	596 (22.6)	399 (23.0)	197 (21.8)
Middle	461 (17.5)	316 (18.2)	145 (16.1)
High school or above	240 (9.1)	151 (8.7)	89 (9.9)
Marital_status, n (%)				0.039
No	637 (24.1)	397 (22.9)	240 (26.6)
Yes	2002 (75.9)	1339 (77.1)	663 (73.4)
Residence, n (%)				0.006
Rural	1734 (65.7)	1173 (67.6)	561 (62.1)
Urban	905 (34.3)	563 (32.4)	342 (37.9)
Smoking_Status, n (%)				0.023
No	1170 (44.4)	742 (42.7)	428 (47.5)
Yes	1468 (55.6)	994 (57.3)	474 (52.5)
Drinking_Status, n (%)				0.001
No	1321 (50.1)	831 (47.9)	490 (54.5)
Yes	1314 (49.9)	905 (52.1)	409 (45.5)
ADL_limited, n (%)				<0.001
No	1805 (68.5)	1261 (72.8)	544 (60.3)
Yes	830 (31.5)	472 (27.2)	358 (39.7)
IADL_limited, n (%)				<0.001
No	1708 (64.7)	1215 (70.0)	493 (54.7)
Yes	930 (35.3)	521 (30.0)	409 (45.3)
Hearing_impairment, n (%)				<0.001
No	2015 (86.5)	1352 (88.5)	663 (82.6)
Yes	315 (13.5)	175 (11.5)	140 (17.4)
Vision_impairment, n (%)				0.083
No	2125 (91.2)	1405 (92.0)	720 (89.8)
Yes	204 (8.8)	122 (8.0)	82 (10.2)
Pain, n (%)				<0.001
No	1405 (56.7)	1007 (61.2)	398 (47.8)
Yes	1072 (43.3)	638 (38.8)	434 (52.2)
Sat_level, n (%)				0.072
Fair	1240 (50.9)	824 (50.8)	416 (51.2)
Good	913 (37.5)	626 (38.6)	287 (35.3)
Poor	281 (11.5)	172 (10.6)	109 (13.4)
Diabetes, n (%)				<0.001
No	2296 (88.7)	1572 (91.9)	724 (82.5)
Yes	292 (11.3)	138 (8.1)	154 (17.5)
Hypertension, n (%)				<0.001
No	1572 (60.7)	1171 (68.9)	401 (45.1)
Yes	1018 (39.3)	529 (31.1)	489 (54.9)
Dyslipidaemia, n (%)				<0.001
No	1970 (78.1)	1410 (84.8)	560 (65.0)
Yes	554 (21.9)	252 (15.2)	302 (35.0)
Med_Lung, n (%)				<0.001
No	982 (48.7)	685 (52.4)	297 (41.9)
Yes	1035 (51.3)	623 (47.6)	412 (58.1)
Med_BP, n (%)				<0.001
No	1835 (69.5)	1348 (77.6)	487 (53.9)
Yes	804 (30.5)	388 (22.4)	416 (46.1)
Med_Lipid, n (%)				<0.001
No	2392 (90.6)	1648 (94.9)	744 (82.4)
Yes	247 (9.4)	88 (5.1)	159 (17.6)
Med_Diabetes, n (%)				<0.001
No	2455 (93.0)	1657 (95.4)	798 (88.4)
Yes	184 (7.0)	79 (4.6)	105 (11.6)

3.2. Feature selection results

In the training set, an initial screening of all candidate covariates was first performed using univariate logistic regression; detailed results of the univariate analysis are provided in Supplementary Table S4. Subsequently, variables demonstrating statistical significance (P < 0.05) in the univariate analysis were incorporated into a multivariate logistic regression model for further evaluation. Ultimately, a total of 8 features were identified as key predictors independently associated with the risk of concurrent CVD in patients with CLD (Table 2).

Table 2.

Multivariate logistic regression analysis results.

Variable	Multivariable OR (95% CI)	P-value
Age_continuous	1.02 (1.01–1.03)	0.004
BMI	1.05 (1.02–1.08)	0.002
Depression	1.02 (1.01–1.04)	0.032
IADL_limited (Yes)	1.43 (1.11–1.84)	0.005
Hypertension (Yes)	1.50 (1.16–2.01)	0.022
Dyslipidaemia (Yes)	1.95 (1.64–2.44)	<0.001
Med_Lung (Yes)	1.35 (1.09–1.68)	0.006
Med_Lipid (Yes)	1.56 (1.22–2.08)	0.042

These core features encompass multiple clinical dimensions, including: age (Age_continuous) and BMI as continuous variables; IADL_limited as an indicator of functional status; prevalent hypertension (Hypertension), dyslipidemia (Dyslipidaemia), medication history for lung disease (Med_Lung), and history of lipid-lowering medication (Med_Lipid) within the domain of clinical comorbidities and treatments; and depression score (Depression) representing mental health.

The selected features delineate the cardiovascular risk profile of patients with CLD from diverse perspectives, including biological aging, physical functioning, metabolic burden, mental health, and clinical interventions. Such a combination of multidimensional features not only enhances the biological plausibility of the model but also establishes a robust data foundation for the subsequent construction and optimization of various machine learning algorithms.

3.3. Model performance

Based on the 8 selected core features, this study constructed and compared 7 machine learning models to evaluate their performance in predicting the risk of concurrent CVD in patients with CLD.

In terms of model discrimination, the XGBoost model demonstrated performance significantly superior to other algorithms. In the training and testing sets, its AUC reached 0.838 and 0.797, respectively (Figure 3(a) and (b)), exhibiting excellent risk identification capability (testing set sensitivity: 0.704, specificity: 0.825). PR curve analysis further confirmed its leading position, with PR-AUC values reaching 0.837 and 0.789 in the training and testing sets, respectively (Figure 4(a) and (b)).

Figure 3.

ROC curves of the seven machine learning models in the training set (a) and testing set (b).

Figure 4.

PR curves of the seven machine learning models in the training set (a) and testing set (b).

By comparison, other models showed acceptable AUC performance in the testing set (e.g., GBM at 0.723, GLM at 0.718). However, while the Random Forest model performed perfectly in the training set (AUC 1.000), it dropped significantly to 0.675 in the testing set, indicating severe overfitting. AdaBoost exhibited the lowest testing set AUC at 0.664 (Table 3).

Table 3.

Comparison of performance metrics across machine learning models.

Model	AUC_Train	AUC_Test	Sensitivity	Specificity
GLM	0.721 (0.702-0.741)	0.718 (0.688-0.748)	0.692	0.631
SVM	0.722 (0.703-0.742)	0.717 (0.687-0.747)	0.657	0.675
GBM	0.817 (0.801-0.833)	0.723 (0.693-0.752)	0.697	0.784
NeuralNetwork	0.740 (0.721-0.759)	0.701 (0.670-0.732)	0.587	0.77
RandomForest	1.000 (1.000-1.000)	0.675 (0.643-0.707)	1	1
Xgboost	0.838 (0.823-0.853)	0.797 (0.767-0.821)	0.704	0.825
Adaboost	0.685 (0.665-0.705)	0.664 (0.633-0.695)	0.806	0.483

Regarding model calibration, the calibration curves indicated that XGBoost had an extremely high degree of overlap with the ideal curve in both datasets (Figure 5(a) and (b)).

Figure 5.

Calibration curves of the seven machine learning models in the training set (a) and testing set (b).

Brier score results further validated this prediction accuracy; XGBoost achieved the lowest score in the testing set (0.1103), reflecting that the concordance between its predicted probabilities and actual outcomes was the most ideal among all models. GBM (0.1117) and GLM (0.1138) followed closely, showing good calibration capacity. Although the scores for Random Forest (0.1341) and AdaBoost (0.188) were relatively higher, they remained within a reasonable range, being only slightly inferior to the XGBoost model in terms of predictive consistency (Table 4).

Table 4.

Comparison of brier scores across machine learning models.

Model	Train	Test
GLM	0.1135	0.1138
SVM	0.1215	0.1221
GBM	0.0787	0.1117
NeuralNetwork	0.1067	0.1201
RandomForest	0.0061	0.1341
Xgboost	0.0709	0.1103
Adaboost	0.1694	0.188

In terms of clinical decision value evaluation, this study performed DCA on the training set (Figure 6(a)) and the testing set (Figure 6(b)).

Figure 6.

DCA curves of the seven machine learning models in the training set (a) and testing set (b).

The results showed that the standardized net benefit of the XGBoost model was significantly higher than both the “treat-all” and “treat-none” strategies across most high-risk threshold ranges. Specifically, in both the training and testing sets, when the high-risk threshold was set between 0.2 and 0.8, the XGBoost model maintained a consistently high and stable net benefit curve. This indicates that the model can provide reliable decision support across a wide range of risk assessment intervals. In contrast, models such as GLM, SVM, and AdaBoost yielded lower net benefits within the same threshold range, suggesting relatively limited clinical applicability.

In summary, the XGBoost model demonstrated the best balance and generalization ability across discrimination, calibration, and clinical net benefit, exhibiting excellent clinical utility. Therefore, this study ultimately identified XGBoost as the core model for predicting CVD risk in patients with CLD.

3.4. Subgroup analysis

To further evaluate the applicability and robustness of the XGBoost model across different populations, stratified analyses were conducted in the training and testing sets according to sex and age tertiles (Figure 7).

Figure 7.

AUC results of the XGBoost model stratified by sex (a–b) and age tertiles (c–d) in the training and testing sets.

In the sex subgroups (Figure 7(a) and (b)), the model demonstrated excellent stability. In the training set, the AUCs for males and females were 0.812 and 0.835, respectively; in the testing set, the AUCs were 0.792 and 0.775, respectively. The minor difference in discrimination between the two groups suggests that the model possesses good predictive fairness across different sexes.

In the age subgroups (Figure 7(c) and (d)), the model similarly maintained robust predictive capability. In the training set, the AUCs for the Low, Medium, and High groups, defined by age tertiles, were 0.801, 0.825, and 0.811, respectively. In the testing set, the corresponding AUCs for each group were 0.761, 0.785, and 0.790, respectively. The stratification results indicate that the model performs highly consistently across populations with different age distributions, without apparent predictive bias.

In summary, the XGBoost model exhibited stable and reliable predictive performance across all subgroups, further validating its broad applicability within the chronic lung disease population.

3.5. External validation

This study performed external validation of the XGBoost model using data from Wave 7 of the English Longitudinal Study of Ageing (ELSA, n = 19,802). Following the same inclusion criteria as the CHARLS cohort, 1,303 CLD patients aged 45 years and older were included, with 1,023 cases of asthma, 455 cases of lung disease, and 175 overlapping cases. For cardiovascular diseases (CVD), 80 patients had strokes, 308 had heart diseases, and 32 had overlapping conditions (Figure 1; baseline characteristics are shown in Supplementary Table S5).

The validation results revealed that the model performed robustly in a heterogeneous population, achieving an AUC of 0.695 and a PR-AUC of 0.704. The calibration curve showed a high degree of concordance between the predicted risk and actual observed outcomes, demonstrating excellent predictive consistency. Furthermore, DCA indicated that the model still yielded significant clinical net benefit in the validation set (Supplementary Figure S1).

Taken together, the external validation results confirm that the XGBoost model possesses strong geographic and cross-population generalizability, rendering it capable of providing reliable cardiovascular risk assessment for patients with CLD from diverse backgrounds.

3.6. SHAP–based model interpretation

To enhance the transparency of the XGBoost model in clinical applications, this study introduced the SHAP framework to deeply analyze the model’s predictive mechanisms from three dimensions: global contribution, feature interaction, and individual decision-making.

The SHAP beeswarm plot in Figure 8(a) demonstrates that all 8 included core variables are positively associated with an increased risk of CVD; that is, higher feature values or the presence of corresponding symptoms correspond to a greater predicted risk. Ranked by mean absolute SHAP values (Figure 8(b)), hypertension, depression score, and age are the three most powerful predictors, followed by BMI, dyslipidemia, lung disease medication history, impaired IADL, and lipid-lowering medication history. This indicates that for patients with chronic lung disease, vascular health and psychological status are the key weighted factors determining their cardiovascular risk.

Figure 8.

Model interpretation using SHAP framework, including feature importance (a–b), dependence plots (c–e), and individual prediction waterfall plot (f).

Meanwhile, SHAP dependence plots further reveal the synergistic mechanisms among variables. A strong interaction was found between BMI and dyslipidemia, reflecting the cumulative damage of metabolic burden on the cardiovascular system. The depression score and IADL also exhibited potential interaction effects, suggesting that the dual decline of physical and psychological functions accelerates risk progression. Furthermore, hypertension and dyslipidemia demonstrated a significant synergistic risk amplification effect (Figure 8(c)–(e)), meaning that when both are present, an individual’s predicted risk is significantly higher than the linear superposition of either single risk factor.

At the individual level, a representative SHAP waterfall plot in Figure 8(f) intuitively illustrates how features push an individual’s predicted value from the baseline (expected) value to the final outcome. In this sample, although lung disease medication history (Med_Lung = 1) and depression score (Depression = 16) contributed risk increments of +0.0865 and +0.122, respectively, the dominant role of protective factors—specifically the absence of hypertension (Hypertension = 0, contributing -0.219), a normal BMI (23.1, contributing -0.158), and no IADL limitation (contributing -0.156)—resulted in a final predicted probability f(x) close to 0, which is far below the population expected value of 0.494. This suggests that the individual is at an extremely low risk level.

In conclusion, SHAP analysis not only validates the biological plausibility of the XGBoost model but also uncovers the interaction patterns among complex risk factors, providing clinicians with precise and actionable insights for identifying high-risk patients with CLD and tailoring individualized intervention strategies.

3.7. Preliminary model application

This study developed an interactive web-based prediction platform (https://zzaakk.shinyapps.io/make_web/) based on the Shiny framework, integrating the core XGBoost model to achieve clinical translation. Users simply need to input the 8 core features—age, BMI, depression score, hypertension, dyslipidemia, IADL impairment, and medication history for lung disease and lipid-lowering drugs—to obtain personalized, real-time CVD predicted risk probabilities.

The platform is compatible with multi-device access and employs a local computation mode to ensure data security and personal privacy. As a convenient and efficient digital tool, this platform provides robust support for risk stratification of patients with chronic lung disease in primary care screening and remote health management.

4. Discussion

Based on large-scale population cohort data from CHARLS, this study developed and validated a risk prediction model specifically tailored for predicting concurrent CVD in patients with CLD. By integrating multidimensional clinical indicators—including demographic characteristics, clinical comorbidities, physical function, and mental health—this study identified 8 independent predictors using logistic regression and systematically compared the predictive performance of 7 machine learning algorithms. The results demonstrated that the XGBoost model exhibited the optimal comprehensive performance in terms of discrimination, calibration, and clinical decision benefit. Furthermore, its robustness and cross-population applicability were confirmed through external validation in the UK ELSA cohort. Additionally, SHAP analysis revealed that hypertension, depression score, and age were the most important contributing variables for predicting CVD risk, and it confirmed the presence of significant synergistic amplification effects among certain risk features (e.g., BMI and dyslipidemia, depression and impaired activities of daily living). Finally, based on the core XGBoost model, this study developed and deployed an online risk prediction tool, providing direct clinical decision support for primary care screening and remote health management.

The SHAP analysis in this study indicated that hypertension and age are the core clinical features for predicting incident CVD in patients with CLD. Chronic lung diseases (such as COPD and asthma) are frequently accompanied by systemic inflammation, oxidative stress, and chronic hypoxia. These pathological changes can trigger vascular endothelial dysfunction and increase arterial stiffness.^23–25 When patients with CLD present with comorbid hypertension, long-term hemodynamic abnormalities and systemic inflammation exert a superimposed effect, significantly accelerating the progression of atherosclerosis and ischemic heart disease.²⁶ Moreover, advancing age is intrinsically associated with the physiological degeneration of the cardiovascular system, which further amplifies the dual damage inflicted on target organs by hypertension and hypoxia.

Notably, the mental health dimension (depression score) and physical function status (IADL limitation) exhibited highly predictive contributions in the model. Due to chronic dyspnea and decreased exercise tolerance, the incidence of adverse psychological states such as depression is significantly higher in patients with CLD than in the general population. Current psychoneuroimmunological evidence suggests that depression can directly increase susceptibility to cardiovascular events by activating the hypothalamic-pituitary-adrenal (HPA) axis, causing sympathetic nervous system overactivation, and exacerbating systemic inflammatory responses.^27–29

Furthermore, SHAP interaction analysis confirmed a significant synergistic risk effect between depression and IADL limitation. The decline in physical function restricts patients’ daily activities, leading to decreased metabolic levels, and this reduction in physiological function further exacerbates depressive symptoms. These two factors intertwine to form a vicious cycle, jointly accelerating the occurrence of cardiovascular complications.

In the metabolic and clinical treatment dimensions, strong interactions were observed between BMI and dyslipidemia, as well as between hypertension and dyslipidemia, reflecting the cumulative damage of lipid metabolic disorders and metabolic syndrome on the cardiovascular system.^30,31 Meanwhile, the Med_Lung and Med_Lipid were identified as independent predictors. In epidemiological surveys, specific medication histories generally provide an objective reflection of the severity of the patient’s underlying disease or the frequency of acute exacerbations. More severe pulmonary lesions often correspond to a more pronounced increase in right ventricular overload and systemic inflammation, thereby resulting in a higher incidence of cardiovascular events. Integrating these multidimensional features into the XGBoost model effectively captured the complex pathophysiological states of patients with CLD.

Compared with existing clinical risk assessment tools, this study possesses several notable advantages. First, this study specifically modeled the high-risk subgroup of CLD, overcoming the inadequate applicability of traditional general-population models in this specific cohort. Second, the employed XGBoost algorithm effectively handles non-linear relationships and high-order interactions among multidimensional variables, yielding predictive performance significantly superior to traditional generalized linear models. Third, the study introduced the highly heterogeneous UK ELSA cohort for external validation, confirming the model’s excellent cross-geographic and cross-population generalizability. Recent studies highlight the role of explainable AI in healthcare, providing context for interpreting machine learning models.^32,33 Finally, the Shiny-based online tool further enhances accessibility, allowing users to obtain personalized risk assessments via a web browser without requiring specialized technical expertise.

However, this study has several limitations. First, disease diagnoses in the CHARLS and ELSA cohorts were based on self-reporting, which may introduce recall bias. Second, objective clinical measures, such as pulmonary function tests and echocardiography, were unavailable due to the limitations of public datasets. Detailed information on medication type and dosage was also lacking. Third, because of the cross-sectional design, the model assessed prevalent CVD rather than future CVD risk, and causality cannot be inferred. Fourth, the logistic-regression-based feature-reduction strategy may have excluded predictors with potential value in nonlinear or interaction-based models. Future studies should compare different feature-selection strategies in larger prospective datasets and incorporate multidimensional objective data to improve clinical applicability.

5. Conclusion

Based on nationally representative population data, this study successfully developed and validated a risk prediction model specifically tailored for predicting concurrent CVD in patients with CLD. Through logistic regression feature selection and the XGBoost machine learning algorithm, this study accurately identified 8 core predictors. The model demonstrated excellent predictive performance and robust generalizability across the training set, testing set, and the external validation cohort (ELSA). Furthermore, SHAP analysis enhanced the interpretability of the model, highlighting the critical roles of hypertension, depressive status, age, and lipid metabolic disorders in the development of CVD risk among the CLD population, while also revealing the synergistic interaction mechanisms among these complex risk factors. Finally, an interactive online risk prediction tool was developed to facilitate model visualization and clinical translation. This platform holds significant promise for supporting the early identification and personalized management of cardiovascular risk in CLD patients within primary care settings, thereby providing more precise risk stratification and targeted interventions for patients with chronic diseases.

Supplemental material

Supplemental material - Development and external validation of a machine learning model for cardiovascular risk prediction in individuals with chronic lung disease: Evidence from CHARLS and ELSA

Supplemental material for Development and external validation of a machine learning model for cardiovascular risk prediction in individuals with chronic lung disease: Evidence from CHARLS and ELSA by Ankang Zhu, Shuai Wei, Haobo Wang, Shaodong Liu, Yang Li, Xiaojie Pan, Xingcai Gao, and Xing lin in Digital Health.

Supplemental material

Supplemental material - Development and external validation of a machine learning model for cardiovascular risk prediction in individuals with chronic lung disease: Evidence from CHARLS and ELSA

Supplemental material

Supplemental material - Development and external validation of a machine learning model for cardiovascular risk prediction in individuals with chronic lung disease: Evidence from CHARLS and ELSA

Footnotes

ORCID iDs

Ankang Zhu

Shuai Wei

Xing Lin

Ethical considerations

This study was a secondary analysis of publicly available, de-identified CHARLS and ELSA data. The original CHARLS study, including participant recruitment, informed consent, and data collection, was approved by the Institutional Review Board at Peking University (IRB00001052-11015), and ELSA received ethical approval from the relevant NHS Research Ethics Committees.

Consent to participate

All participants in the original surveys provided written informed consent. CHARLS data were obtained from the official CHARLS website after registration and approval for research use, and ELSA data were accessed through the UK Data Service under the applicable data use agreement. Therefore, no additional ethical approval or informed consent was required for the present secondary analysis.

Author contributions

Ankang Zhu, Shuai Wei, and Haobo Wang contributed equally to this work and should be considered co-first authors. Ankang Zhu was responsible for conceptualization, methodology, and supervision; Shuai Wei handled data analysis and writing – original draft; Haobo Wang performed data curation, software development, visualization, and writing – review & editing; Shaodong Liu conducted literature review and result interpretation; Yang Li performed investigation, data preprocessing, and statistical analysis; Xiaojie Pan, Xingcai Gao, and Xing Lin provided supervision and served as corresponding authors. All authors read and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Declaration of generative AI and AI-assisted technologies in the writing process

ChatGPT (OpenAI) was used solely for language polishing. All content was reviewed and approved by the authors.

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Seeger

Adir

Barberà

, et al. Pulmonary hypertension in chronic lung diseases. J Am Coll Cardiol 2013; 62(25 Suppl): D109–D116. https://doi.org/10.1016/j.jacc.2013.10.036

Pauwels

Buist

Calverley

, et al. Global strategy for the diagnosis, management, and prevention of chronic obstructive pulmonary disease. NHLBI/WHO Global Initiative for Chronic Obstructive Lung Disease (GOLD) Workshop summary. Am J Respir Crit Care Med 2001; 163(5): 1256–1276. https://doi.org/10.1164/ajrccm.163.5.2101039

Safiri

Carson-Chahhoud

Noori

, et al. Burden of chronic obstructive pulmonary disease and its attributable risk factors in 204 countries and territories, 1990-2019: results from the Global Burden of Disease Study 2019. BMJ 2022; 378: e069679. https://doi.org/10.1136/bmj-2021-069679

Zhang

, et al. Disease burden of COPD in the Chinese population: a systematic review. Ther Adv Respir Dis 2023; 17: 17534666231218899. https://doi.org/10.1177/17534666231218899

Naeem

Wang

Mubarak

, et al. Mapping the Global distribution, risk factors, and temporal trends of COPD incidence and mortality (1990-2021): ecological analysis. BMC Med 2025; 23(1): 210. https://doi.org/10.1186/s12916-025-04014-0

Brassington

Selemidis

Bozinovski

, et al. Chronic obstructive pulmonary disease and atherosclerosis: common mechanisms and novel therapeutics. Clin Sci (Lond) 2022; 136(6): 405–423. https://doi.org/10.1042/CS20210835

Papaporfyriou

Bartziokas

Gompelmann

, et al. Cardiovascular Diseases in COPD: From Diagnosis and Prevalence to Therapy. Life (Basel) 2023; 13(6): 1299. https://doi.org/10.3390/life13061299

Ozgen Alpaydin

Ozuygur

Sahan

, et al. 30-day Readmission After an Acute Exacerbation of Chronic Obstructive Pulmonary Disease is Associated with Cardiovascular Comorbidity. Turk Thorac J 2021; 22(5): 369–375. https://doi.org/10.5152/TurkThoracJ.2021.0189

Koene

Prizment

Blaes

, et al. Shared Risk Factors in Cardiovascular Disease and Cancer. Circulation 2016; 133(11): 1104–1114. https://doi.org/10.1161/CIRCULATIONAHA.115.020406

10.

Kittiskulnam

Sheshadri

Johansen

. Consequences of CKD on Functioning. Semin Nephrol 2016; 36(4): 305–318. https://doi.org/10.1016/j.semnephrol.2016.05.007

11.

Santosa

Rosengren

Ramasundarahettige

, et al. Psychosocial Risk Factors and Cardiovascular Disease and Death in a Population-Based Cohort From 21 Low-Middle-and High-Income Countries. JAMA Netw Open 2021; 4(12): e2138920. https://doi.org/10.1001/jamanetworkopen.2021.38920

12.

Huang

Wang

Fang

, et al. Longitudinal association of chronic diseases with depressive symptoms in middle-aged and older adults in China: Mediation by functional limitations, social interaction, and life satisfaction. J Glob Health 2023; 13: 04119. https://doi.org/10.7189/jogh.13.04119

13.

Zhu

. A Nomogram for Predicting Cardiovascular Diseases in Chronic Obstructive Pulmonary Disease Patients. J Healthc Eng 2022; 2022: 6394290. https://doi.org/10.1155/2022/6394290

14.

Lee

, et al. Using Machine Learning to Identify Metabolomic Signatures of Pediatric Chronic Kidney Disease Etiology. J Am Soc Nephrol 2022; 33(2): 375–386. https://doi.org/10.1681/ASN.2021040538

15.

Fitriyani

Syafrudin

Chamidah

, et al. A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics 2025; 13(13): 2194. https://doi.org/10.3390/math13132194

16.

Zaidi

SAJ

Ghafoor

Kim

, et al. HeartEnsembleNet: An Innovative Hybrid Ensemble Learning Approach for Cardiovascular Risk Prediction. Healthcare (Basel) 2025; 13(5): 507. https://doi.org/10.3390/healthcare13050507

17.

Shah

Shukla

Dholakia

, et al. Predicting cardiovascular risk with hybrid ensemble learning and explainable AI. Sci Rep 2025; 15(1): 17927. https://doi.org/10.1038/s41598-025-01650-7

18.

Zhao

Smith

, et al. Cohort profile: the China Health and Retirement Longitudinal Study (CHARLS). Int J Epidemiol 2014; 43(1): 61–68. https://doi.org/10.1093/ije/dys203

19.

Wang

, et al. Changes in frailty and incident cardiovascular disease in three prospective cohorts. Eur Heart J 2024; 45(12): 1058–1068. https://doi.org/10.1093/eurheartj/ehad885

20.

Wang

Fang

, et al. Machine learning and SHAP value interpretation for predicting comorbidity of cardiovascular disease and cancer with dietary antioxidants. Redox Biol 2025; 79: 103470. https://doi.org/10.1016/j.redox.2024.103470

21.

Steptoe

Breeze

Banks

, et al. Cohort profile: the English longitudinal study of ageing. Int J Epidemiol 2013; 42(6): 1640–1648. https://doi.org/10.1093/ije/dys168

22.

Rafnsson

Orrell

d’Orsi

, et al. Loneliness, Social Integration, and Incident Dementia Over 6 Years: Prospective Findings From the English Longitudinal Study of Ageing. J Gerontol B Psychol Sci Soc Sci 2020; 75(1): 114–124.

23.

Jiang

Fan

Wang

, et al. Effects of hypoxia in cardiac metabolic remodeling and heart failure. Exp Cell Res 2023; 432(1): 113763. https://doi.org/10.1016/j.yexcr.2023.113763

24.

Goswami

Ranjan

Dutta

, et al. Management of inflammation in cardiovascular diseases. Pharmacol Res 2021; 173: 105912. https://doi.org/10.1016/j.phrs.2021.105912

25.

Malic

Topic

Francuski

, et al. Oxidative Stress and Genetic Variants of Xenobiotic-Metabolising Enzymes Associated with COPD Development and Severity in Serbian Adults. COPD 2017; 14(1): 95–104. https://doi.org/10.1080/15412555.2016.1199667

26.

Suvila

Niiranen

. Interrelations Between High Blood Pressure, Organ Damage, and Cardiovascular Disease: No More Room for Doubt. Hypertension 2022; 79(3): 516–517. https://doi.org/10.1161/HYPERTENSIONAHA.121.18786

27.

Menke

. The HPA Axis as Target for Depression. Curr Neuropharmacol 2024; 22(5): 904–915. https://doi.org/10.2174/1570159X21666230811141557

28.

Zhou

Wang

, et al. The etiology of poststroke-depression: a hypothesis involving HPA axis. Biomed Pharmacother 2022; 151: 113146. https://doi.org/10.1016/j.biopha.2022.113146

29.

Krittanawong

Maitra

Qadeer

, et al. Association of Depression and Cardiovascular Disease. Am J Med 2023; 136(9): 881–895. https://doi.org/10.1016/j.amjmed.2023.04.036

30.

Sandesara

Virani

Fazio

, et al. The Forgotten Lipids: Triglycerides, Remnant Cholesterol, and Atherosclerotic Cardiovascular Disease Risk. Endocr Rev 2019; 40(2): 537–557. https://doi.org/10.1210/er.2018-00184

31.

Milaneschi

Simmons

van Rossum

EFC

, et al. Depression and obesity: evidence of shared biological mechanisms. Mol Psychiatry 2019; 24(1): 18–33. https://doi.org/10.1038/s41380-018-0017-5

32.

Abbas

Seol

Abbas

, et al. Exploring the Role of Artificial Intelligence in Smart Healthcare: A Capability and Function-Oriented Review. Healthcare (Basel) 2025; 13(14): 1642. https://doi.org/10.3390/healthcare13141642

33.

Abbas

Jeong

Lee

. Explainable AI in Clinical Decision Support Systems: A Meta-Analysis of Methods, Applications, and Usability Challenges. Healthcare (Basel) 2025; 13(17): 2154. https://doi.org/10.3390/healthcare13172154

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.59 MB

0.00 MB

0.21 MB

0.42 MB