Predicting mortality risk of COVID-19 among chronic kidney disease patients using machine learning algorithms

Abstract

Objective: COVID-19 disease has a high prevalence and mortality rate among chronic kidney disease patients. Previous studies have shown that ML-based prediction models are effective for developing preventive strategies. Despite several efforts to establish prediction models for COVID-19 mortality risk among different populations, few studies have focused on COVID-19 mortality among patients with chronic kidney disease. The current study aimed to develop an effective, efficient preventive strategy to improve prognosis and survival among these patients by constructing a machine-learning-based prediction model. Methods: The current retrospective study used single-center data from 556 hospitalized patients with CKD. All patients in the cohort had respiratory failure following COVID-19. We leveraged ensemble and non-ensemble algorithms to construct a prediction model and scored the chosen features for better interpretability. Results: The empirical results of this study showed that XG-Boost achieved an AU-ROC of 0.921 with a 95% CI of [0.906–0.941] and an AU-ROC of 0.851 with a 95% CI of [0.835–0.877] in training and validation modes, respectively, yielding more favourable predictive performance than the models. Conclusion: XG-Boost demonstrated predictive merit that can be leveraged as an auxiliary tool to aid clinicians in making more informed decisions in clinical settings.

Keywords

machine learning predictive model preventive strategy chronic kidney disease COVID-19 mortality risk prognosis

Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a novel coronavirus disease, was first detected in Wuhan, China, in December 2019.^1–3 As an infectious disease caused by the third coronavirus, COVID-19 poses significant public health challenges due to its rapid spread across populations.^4–6 This disease is associated with tract infections in human and animal bodies, causing fever, cough, and cold, and sometimes leading to death among infected patients, often following acute respiratory distress syndrome or pneumonia.^7,8 COVID-19 has a high incidence and mortality rate and has brought about many challenges and overwhelming burdens among patients and healthcare workers in healthcare settings.^9,10 This disease induced considerable morbidity and mortality in more than 200 countries and regions.¹¹ Despite the World Health Organization (WHO) announcing the end of the COVID-19 pandemic, it is now considered endemic, affecting a significant number of people worldwide.¹² The previous studies on this topic demonstrated that chronic diseases potentially impact the severity, poor prognosis, and mortality of COVID-19, and they are considered significant risk factors for this disease.^11,13

This disease has manifold burdens among patients with chronic diseases, and the risk of COVID-19 mortality in this condition would be increased significantly.^14,15 Chronic Kidney Disease (CKD), as an essential chronic condition, is highly associated with an augmented risk of severe COVID-19 infection.^16,17 Hence, physicians should advise patients with CKD and closely monitor them to prevent direct exposure to this infection.^18,19 One cohort study found that the incidence of COVID-19 among CKD patients was 4.09%, compared with 0.46% in the general population. Also, the crude death rate of this disease in CKD was 44.6%, compared to 4.7% in CKD patients without COVID-19, indicating the significant impact of CKD in worsening and increasing the COVID-19 prognosis and mortality rate.^20,21 Generally, hospitalized COVID-19 patients with CKD, especially in stages 3 to 5 of this disease, have a significantly higher death rate than non-CKD patients, requiring a comprehensive and effective management strategy for these patients.^22,23

Preventive strategies are essential to enhance clinical and therapeutic measures for CKD patients infected with COVID-19, aiming to improve prognosis and reduce mortality.^24,25 Also, early detection and precise evaluation of disease severity among CKD patients infected with COVID-19 enhance clinical decision-making.^26,27 Past biomedical studies have demonstrated that the machine learning (ML) approach effectively establishes predictive models in healthcare settings to achieve prevention goals for various health conditions.^28–30 Although the ML approach has been used as an appropriate preventive strategy in several previous studies to predict the COVID-19 mortality risk among other populations using prognostic factors,^31–35 few efforts have been made to develop prediction models for this purpose among CKD patients. Luo et al. developed ML models for COVID-19 mortality risk among CKD patients using laboratory data in one study.²⁶ Despite Luo’s research, the current study aimed to predict COVID-19 mortality risk among these patients using clinical comorbidity and medication data to gain new insights into preventive strategies via an ML approach.

Methods

Study design

Figure 1 shows the roadmap for this study, including all steps taken to develop prediction models for COVID-19 death risk among CKD patients. As shown in this figure, we first established a database and described the input and output features, along with the frequency of samples in each feature. Secondly, to enhance the quality of the database for analysis and model development, we employed several preprocessing techniques, including identifying duplicate records, handling invalid values, and handling missing data. Thirdly, we employed FS techniques, including both univariate and multivariate regression analysis, to reduce data dimensionality and select more suitable factors for establishing prediction models of death risk. Fourthly, we utilized selected ML algorithms to develop prediction models. To assess the validity of the established models, we employed a hold-out strategy, dividing the original dataset into training and validation sets. To mitigate bias in the performance measures arising from the data, we evaluated the ML algorithms using a 10-fold cross-validation approach. To achieve optimal algorithm performance, hyperparameters were tuned via grid search, and the best-performing settings were selected and reported for each algorithm. Fifthly, the performance of various established prediction models was compared and analyzed to identify the best model with the highest predictive performance efficiency. Sixthly, the best-performing model was obtained to predict the mortality risk of COVID-19 among CKD patients.

Figure 1.

The roadmap of this study.

Population characteristics

The current retrospective study used single-center data from 556 hospitalized patients with CKD at Shariati Hospital in Tehran City who were treated for COVID-19 over 6 months from 01 March 2020 to 30 August 2020. The data of these patients were stored in one integrated database in (.SAV) format. Among 556 hospitalized patients, 216 were associated with deceased cases, and 340 were associated with alive cases. A schema of the database, regarding the variable view and data view, in (.SAV) format, is depicted in Figures 2 and 3, respectively. As shown in these figures, the data used for analysis and the development of prediction models are structured. Definitions of all variables and data on several risk factors, ranging from dementia to antidepressant drug use, for approximately 20 samples are presented in this data view. Moreover, the data on the death status after COVID-19 infection among CKD patients are presented in a data view schema with two statuses of 0 and 1 that are associated with alive and deceased cases, respectively.

Figure 2.

The database view in the variable view screen.

Figure 3.

The database view in the data view screen.

Inclusion and exclusion criteria

In this study, we used data from patients with CKD progression stages 3b-5, focusing on kidney insufficiency, and every patient in the cohort had respiratory failure following COVID-19. These populations focused on hospitalized patients with chronic kidney disease who had been diagnosed with COVID-19 symptoms, such as respiratory failure, and who were diagnosed with COVID-19 based on diagnostic test results. On the contrary, the patients with acute kidney disease, acute respiratory failures of other diseases, concurrently having malignant tumors, and post-transplant patients were excluded from this study.

Outcome variable

The outcome variable in this study was the mortality status of CKD patients infected with COVID-19. The death occurred in 24 to 33 days following COVID-19 infections in deceased cases. In this data-driven study, we aimed to estimate the mortality risk of COVID-19 among CKD patients, using this variable as the target. In the current database, statuses were categorized as deceased and alive cases, assigned the codes 1 and 0, respectively. The assessment of COVID-19-related mortality among CKD patients was based on positive diagnostic test results. These patients were receiving medications for CKD, and this information was documented in their medical records.

Additionally, after getting a COVID-19 infection, they also received COVID-19-related medications to reduce the severity of symptoms related to this disease. All samples had respiratory failure due to COVID-19. All features extracted and recorded in the database were the results of assessing CKD patients with respiratory failure status following COVID-19, and their information regarding these features was recorded in the database.

Features

The current features leveraged in this study were demographic, comorbidity, and medication use, including age, sex, kidney disease type, coronary artery disease, congestive heart failure, peripheral vascular disease, cerebrovascular disease, hemiplegia, dementia, chronic pulmonary disease, diabetes, hypertension, connective tissue disorder, liver disease, peptic ulcer disease, systemic corticosteroid use, other immunosuppressant drugs use, beta-blocker use, antiplatelet agent, oral antidiabetic drug use, insulin use, and antidepressant drug use. They were measured and assessed in 2 to 3 days after COVID-19 confirmation, following diagnostic test results.

Database preparation

We performed some steps to investigate and prepare the current database for data analysis. First, we reviewed it regarding the duplicate data. The data belonged to a single person and were stored in multiple rows within the current database. Second, any data with abnormal values (i.e., values not defined in the database) were deleted, and we removed these patients’ data from the database. Third, we addressed missing patient data, encountering two situations in this regard. If any feature in a record had missing values exceeding 5%, we excluded that row from the current database. Otherwise, we imputed missing data. In this process, we utilized the K-means clustering algorithm to categorize similar records based on their features. In the next step, we filled in the missing data using the values of the corresponding features belonging to similar records. If the missing data were in the outcome variable, we deleted the corresponding record if we could not find the actual data.

Feature selection

One way to select more relevant features for building predictive models in ML is to leverage feature selection (FS). This approach would simplify the database by eliminating irrelevant features, thereby enhancing classification accuracy, reducing computation time and training complexity, improving generalization, and facilitating data understanding.^36–38 To conduct the FS technique, we employed univariate and multivariate regression analysis with a significance level of p < 0.05.

ML development and assessment

We leveraged the selected ML algorithms to build models predicting mortality risk among CKD patients. To this end, the ensemble and non-ensemble algorithms were used in Weka 3.9. The ensemble algorithms were XG-Boost (an extension algorithm), Random Forest (RF), and Ada-Boost. In addition, Artificial Neural Networks (ANNs), Support Vector Classifier (SVC), Logistic Regression (LR), K-Nearest Neighbour (K-NN), and Decision Tree (DT) algorithms were used as base algorithms.

Algorithms’ performance and hyperparameter adjustment

The current study employed a grid search to evaluate the algorithms’ performance during training. This approach enables evaluating various combinations of each algorithm’s hyperparameters to assess its performance. Finally, the combination of hyperparameters with the highest performance would be selected as the representative performance of each algorithm. The hyperparameters considered for adjustment to obtain the best-performing model for each algorithm are presented as follows:

Ada-Boost: classifier type and number of iterations.

ANN: number of neurons, learning rate, and maximum epoch.

DT: confidence factor and minimum number of objects in leaves.

K-NN: K (number of similar cases to be considered), prediction or range target, and distance computation.

LR: binomial procedure, confidence interval, maximum iterations, and scale.

RF: maximum features, maximum tree depth, minimum samples in leaves, and minimum samples for splitting.

SVC: tolerance parameter, C (control parameter), and kernel type.

XG-Boost: max depth of tree, Gamma, Eta, minimum child weight, and number of epochs.

To comprehensively evaluate ML model performance and identify the best model for predicting mortality risk, we used a range of performance metrics, including Positive Predictive Value (PPV), Negative Predictive Value (NPV), sensitivity, specificity, accuracy, and F-Score. Selecting these criteria provides more predictive efficiency insights into their performance across positive and negative cases.

Additionally, the Area Under the Receiver Operating Characteristic (ROC) curve was used to compare the algorithms’ predictive ability for mortality risk and to assess their performance efficiency at different thresholds for positive and negative cases.

Hold-out strategy

Due to the lack of clinical data in other settings, we used a hold-out strategy to split the data and evaluate validation performance. This way, all data are randomly split into two sets: a training set and a validation set. The training set is used to build and estimate model parameters, and the validation set is used to evaluate the established ML models. Generally, 70% and 30% (or two-thirds and one-third) of the data are used in the hold-out strategy to train and validate ML models.^39,40 The current study used two-thirds of the data for training and one-third for validation. Moreover, to mitigate bias regarding the data distribution in the performance reports, we employed K-fold cross-validation to report the performance of the algorithms in training mode, and the mean of performances in (K = 10) training iterations was considered.

Feature importance

Feature importance (FI), also known as feature identification, feature attribution, or model explainability, enables researchers to look beyond the black box of ML algorithms and understand how they operate during training and the importance of each feature for prediction. FI provides an output metric or score, allowing us to rank features from most to least significant for the output class. They are often obtained by systematically varying features to identify which produce the most significant change in predictive strength, thereby generating an essential score for each feature that enables ranking.^41,42 In this study, we used Permutation Feature Importance (PFI) to rank the critical predictors of COVID-19 mortality risk among CKD patients. PFI is a model inspection method that ranks the contribution of individual features to the statistical functionality of a fitted model on a specified tabular dataset. PFI measures the importance of individual features to a model’s forecasting ability by computing the change in model error when the values of the features are shuffled (or permuted).

Statistical analysis

In this study, we used univariate analysis (Chi-square test) and multivariate regression to investigate differences between deceased and alive cases and to determine the importance of each feature in selecting the best features for ML-based mortality prediction. p < 0.05 was considered a meaningful statistical level in this respect. The statistical analysis was conducted using IBM SPSS Statistics version 25.

Results

Database preparation

Upon reviewing the current database for duplicate cases, six records, including one deceased and five living individuals, were excluded from the study. Five cells with invalid feature values in 5 records were deleted and regarded as missing data for imputation. Ten rows, including three dead and seven alive, with more than 5% missing values in their features, were removed from the study. For 15 cases with less than 5% missing data, we leveraged the imputation process and embedded their values with those of more similar records. Two records, including one deceased and one alive instance, were excluded due to the lack of output data. Finally, 538 rows were used for data analysis, comprising 211 deceased and 327 alive cases. The descriptive statistics for the data used in the study, comparing the two groups, are presented in Table 1.

Table 1.

The characteristics of the sample used among deceased and living CKD patients.

Feature	Value	Total CKD n = 538	CKD (%)	Deceased n = 211	Deceased (%)	Alive n = 327	Alive (%)	p-value (χ²)
Age	<45	68	12.64	16	7.58	52	15.90	0.03
	45–55	195	36.25	55	26.07	140	42.81
	>55	275	51.12	140	66.35	135	41.28
Sex	Male	316	58.74	125	59.24	191	58.41	0.1
Sex	Female	222	41.26	86	40.76	136	41.59	0.1
Type of kidney disease	Glomerulonephritis	52	9.67	18	8.53	34	10.40	0.03
	Diabetic kidney disease	124	23.05	52	24.64	72	22.02
	Hypertensive kidney disease	188	34.94	69	32.70	119	36.39
	Adult polycystic	28	5.20	18	8.53	10	3.06
	Pyelonephritis	31	5.76	15	7.11	16	4.89
	Other specified kidney disease	115	21.38	39	18.48	76	23.24
Coronary artery disease	No	390	72.49	120	56.87	270	82.57	<0.001
Coronary artery disease	Yes	148	27.51	91	43.13	57	17.43	<0.001
Congestive heart failure	No	411	76.39	126	59.72	285	87.16	<0.001
Congestive heart failure	Yes	127	23.61	85	40.28	42	12.84	<0.001
Peripheral vascular disease	No	448	83.27	152	72.04	296	90.52	<0.001
Peripheral vascular disease	Yes	90	16.73	59	27.96	31	9.48	<0.001
Cerebrovascular disease	No	400	74.35	115	54.50	285	87.16	<0.001
Cerebrovascular disease	Yes	138	25.65	96	45.50	42	12.84	<0.001
Hemiplegia	No	510	94.80	200	94.79	310	94.80	0.2
Hemiplegia	Yes	28	5.20	11	5.21	17	5.20	0.2
Dementia	No	497	92.38	194	91.94	303	92.66	0.17
Dementia	Yes	41	7.62	17	8.06	24	7.34	0.17
Chronic pulmonary disease	No	453	84.20	150	71.09	303	92.66	<0.001
Chronic pulmonary disease	Yes	85	15.80	61	28.91	24	7.34	<0.001
Diabetes	No	349	64.87	85	40.28	264	80.73	<0.001
Diabetes	Yes	189	35.13	126	59.72	63	19.27	<0.001
Hypertension	No	299	55.58	53	25.12	246	75.23	<0.001
Hypertension	Yes	239	44.42	158	74.88	81	24.77	<0.001
Connective tissue disorder	No	477	88.66	176	83.41	301	92.05	0.01
Connective tissue disorder	Yes	61	11.34	35	16.59	26	7.95	0.01
Liver disease	No	486	90.33	170	80.57	316	96.64	<0.001
Liver disease	Yes	52	9.67	41	19.43	11	3.36	<0.001
Peptic ulcer disease	No	503	93.49	196	92.89	307	93.88	0.1
Peptic ulcer disease	Yes	35	6.51	15	7.11	20	6.12	0.1
Systemic corticosteroid	No	434	80.67	155	73.46	279	85.32	<0.01
Systemic corticosteroid	Yes	104	19.33	56	26.54	48	14.68	<0.01
Other immunosuppressant drugs use	No	479	89.03	172	81.52	307	93.88	0.01
Other immunosuppressant drugs use	Yes	59	10.97	39	18.48	20	6.12	0.01
Beta-blocker use	No	220	40.89	64	30.33	156	47.71	<0.001
Beta-blocker use	Yes	318	59.11	147	69.67	174	53.21	<0.001
Antiplatelet agent	No	386	71.75	104	49.29	282	86.24	<0.001
Antiplatelet agent	Yes	152	28.25	107	50.71	45	13.76	<0.001
Oral antidiabetic drug use	No	378	70.26	92	43.60	286	87.46	<0.001
Oral antidiabetic drug use	Yes	160	29.74	119	56.40	41	12.54	<0.001
Insulin use	No	503	93.49	183	86.73	320	97.86	<0.001
Insulin use	Yes	35	6.51	28	13.27	7	2.14	<0.001
Antidepressant drug use	No	493	91.64	194	91.94	299	91.44	0.1
Antidepressant drug use	Yes	45	8.36	17	8.06	28	8.56	0.1

The bolds indicate the significant level at P<0.05.

According to Table 1, the features, including age, kidney disease type, coronary artery disease, congestive heart failure, peripheral vascular disease, cerebrovascular disease, chronic pulmonary disease, diabetes, hypertension, connective tissue disorder, liver disease, systemic corticosteroid, other immunosuppressant drugs use, beta-blocker use, antiplatelet agent, oral antidiabetic drug use, and insulin use obtained significant difference between dead and alive cases statistically (p < 0.05). On the contrary, the sex, hemiplegia, dementia, peptic ulcer disease, and antidepressant drug use did not differ between them.

Multivariate analysis

The results of FS using multivariate analysis in the CDK data on COVID-19 mortality risk are presented in Table 2.

Table 2.

The multivariate analysis of CKD patients.

Feature	β^a	OR^b	95% of CI^c [OR]	p-value
Age	0.113	1.285	[1.131–1.426]	0.02
Sex	0.21	1.125	[0.854–1.454]	0.13
Type of kidney disease	0.05	1.09	[1.04–1.16]	0.04
Coronary artery disease	0.348	1.524	[1.415–1.672]	<0.001
Congestive heart failure	0.226	1.421	[1.313–1.572]	0.01
Peripheral vascular disease	0.128	1.219	[1.198–1.315]	0.01
Cerebrovascular disease	0.071	1.13	[1.05–1.2]	0.02
Hemiplegia	0.151	1.085	[0.825–1.432]	0.1
Dementia	0.321	1.272	[0.914–1.456]	0.1
Chronic pulmonary disease	0.461	1.664	[1.476–1.763]	<0.001
Diabetes	0.725	2.203	[1.979–2.501]	<0.001
Hypertension	0.917	2.578	[2.152–2.816]	<0.001
Connective tissue disorder	0.126	1.05	[0.795–1.199]	0.08
Liver disease	0.214	1.125	[0.893–1.274]	0.1
Peptic ulcer disease	0.09	1.031	[0.693–1.316]	0.1
Systemic corticosteroid	0.143	1.275	[1.269–1.419]	<0.001
Other immunosuppressant drugs use	0.198	1.356	[1.307–1.503]	<0.001
Beta-blocker use	0.251	1.485	[1.359–1.723]	<0.01
Antiplatelet agent	0.133	1.234	[1.208–1.301]	<0.01
Oral antidiabetic drug use	0.386	1.515	[1.452–1.613]	<0.001
Insulin use	0.354	1.493	[1.433–1.579]	<0.001
Antidepressant drug use	0.08	1.025	[0.636–1.274]	0.1

^aRegression coefficient.

^bOdd Ration.

^cConfidence Interval.

The bolds indicate the significant level at P<0.05.

As shown in Table 2, 15 risk factors, including age (β = 0.113, OR = 1.285, 95% CI of OR = [1.131–1.426]), kidney disease type (β = 0.05, OR = 1.09, 95% CI of OR = [1.04–1.16]), coronary artery disease (β = 0.348, OR = 1.524, 95% CI of OR = [1.415–1.672]), congestive heart failure (β = 0.226, OR = 1.421, 95% CI of OR = [1.313–1.572]), peripheral vascular disease (β = 0.128, OR = 1.219, 95% CI of OR = [1.198–1.315]), cerebrovascular disease (β = 0.071, OR = 1.13, 95% CI of OR = [1.05–1.2]), chronic pulmonary disease (β = 0.461, OR = 1.664, 95% CI of OR = [1.476–1.763]), diabetes (β = 0.725, OR = 2.203, 95% CI of OR = [1.979–2.501]), hypertension (β = 0.917, OR = 2.578, 95% CI of OR = [2.152–2.816]), systemic corticosteroid (β = 0.143, OR = 1.275, 95% CI of OR = [1.269–1.419]), other immunosuppressant drugs use (β = 0.198, OR = 1.356, 95% CI of OR = [1.307–1.503]), beta-blocker use (β = 0.251, OR = 1.485, 95% CI of OR = [1.359–1.723]), antiplatelet agent (β = 0.133, OR = 1.234, 95% CI of OR = [1.208–1.301]), oral antidiabetic drug use (β = 0.386, OR = 1.515, 95% CI of OR = [1.452–1.613]), and insulin use (β = 0.354, OR = 1.493, 95% CI of OR = [1.433–1.579]) were considered the critical factors to predict COVID-19 mortality among CKD patients (p < 0.05). In contrast, the risk factors of sex, hemiplegia, dementia, connective tissue disorder, liver disease, peptic ulcer disease, and antidepressant drug use were excluded from the current study (p > 0.05).

Model construction and assessment

The results of evaluating the ML models’ performance in predicting COVID-19 mortality risk among CKD patients, along with more critical hyperparameters adjusted, are shown in Table 3. The range of hyperparameters used to obtain the optimal parameter with the highest performance in predicting COVID-19 mortality risk among CKD patients, as determined by a Grid search, is presented in Table 4.

Table 3.

The performance evaluation of ML models.

Models	Selected hyperparameters	PPV (%)	NPV (%)	Sensitivity (%)	Specificity (%)	Accuracy (%)	F-score (%)
Ada-boost	Classifier = rep-tree, Number of iterations = 20.	68.14	81.73	72.99	77.98	76.02	70.48
ANN	Number of neurons = 10, learning rate = 0.5, maximum epoch = 100.	55.38	74.91	65.88	65.75	65.80	60.17
DT	Confidence factor = 0.3, minimum number of object = 2, binary splitting = false.	60.87	83.59	79.62	66.97	71.93	68.99
K-NN	K = 3, Prediction or range target = mean value in nearest neighbour, Distance computation = Euclidean metric	52.59	74.25	67.30	60.86	63.38	59.04
LR	Binominal procedure = enter, confidence interval = 0.95, Maximum iterations = 20, Scale = Pearson	61.44	78.15	68.72	72.17	70.82	64.88
RF	Max-features = 6, Max depth = 10, Min-sample-leaf = 2, Min-sample-split = 2	76.86	91.55	88.15	82.87	84.94	82.12
SVC	Tolerance parameter = 0.001, C = 10, Kernl = RBF	68.18	84.46	78.20	76.45	77.14	72.85
XG-boost	Max depth = 10, Gamma = 1, Eta = 0.3, Min-child-weight = 1, epoch = 20	90.78	95.64	93.36	93.88	93.68	92.06

Table 4.

The range of hyperparameters using Grid search to tune ML algorithms.

Algorithm	Hyperparameter
Ada-boost	Classifier = rep-tree, DT, Number of iterations [10, 15, 20, 25, 30, 50]
ANN	Number of neurons [10, 15, 20, 30], Learning rate [0.1, 0.2, 0.3, 0.5, 0.7], Maximum epoch [30, 50, 70, 80, 100]
DT	Confidence factor [0.2, 0.25, 0.3, 0.35], Minimum number of objects [1–3], Binary splitting [false, true]
K-NN	K [3, 5, 7, 9]
LR	Binomial procedure [enter, forward, backward], Confidence interval [0.8, 0.9, 0.95], Maximum iterations [10, 15, 20, 30, 50]
RF	Max-features [5–8], Max depth [8, 10, 12, 15], Min-sample-leaf [2–5], Min-sample-split [2, 3]
SVC	C [10–15]
XG-boost	Max depth [6–20], Eta [0.2, 0.3, 0.4, 0.5], epoch [10, 15, 20, 25, 30, 50]

As Table 3 shows, the XG-Boost model, with a PPV of 90.78%, NPV of 95.64%, sensitivity of 93.36%, specificity of 93.88%, accuracy of 93.68%, and F-score of 92.06%, achieved superior performance compared to other models. RF, with a PPV of 76.86%, NPV of 91.55%, sensitivity of 88.15%, specificity of 82.87%, accuracy of 84.94%, and F-score of 82.12%, obtained satisfactory performance after XG-Boost. On the contrary, the ANN and K-NN models, which had more performance criteria, exhibited the lowest predictive performance, ranging from nearly 50% to 70%. By comparing the models’ performance metrics, SVC, Ada-Boost, LR, and DT ranked third to sixth, respectively, in terms of predictive efficiency for mortality risk.

The ROC curves for the ML models in training and validation modes are shown in Figures 4 and 5, respectively.

Figure 4.

The ML models’ performance in training mode.

Figure 5.

The ML models’ performance in validation mode.

In training mode (Figure 4), XG-Boost with an AU-ROC of 0.921 and a 95% CI of [0.906–0.941] outperformed other models. RF and SVC with AU-ROC of 0.843 and 95% CI = [0.82–0.876] and AU-ROC of 0.813 and 95% CI = [0.806–0.831] obtained satisfactory performance to predict mortality. The Ada-Boost, LR, DT, and K-NN with AU-ROC of [0.6–0.8] were considered the fourth to seventh models in terms of performance. The lowest performance was observed for the ANN, with an AU-ROC of 0.567 and a 95% CI of [0.533–0.589]. According to Figure 5, in validation mode, the XG-Boost model, with an AU-ROC of 0.851 and a 95% CI of [0.835–0.877], demonstrated a notable predictive ability compared to other ML models. The Ada-Boost, SVC, and RF models, with AU-ROC ranging from 0.7 to 0.8, achieved nearly favourable predictive performance for mortality risk. Other models, including DT, LR, K-NN, and ANN, had AU-ROC scores below 0.7. The K-NN model, with an AU-ROC of 0.525 and a 95% CI of [0.507, 0.531], demonstrated lower predictive ability in this respect. Generally, comparing the ML models using various performance criteria in the current study demonstrated that XG-Boost has higher performance efficiency in predicting COVID-19 mortality risk among CKD patients.

Feature importance assessment

We considered XG-Boost, the best-performing model, to identify the most predictive features for COVID-19 mortality among CKD patients and enhance the model’s interpretability. The PFI of XG-Boost for eight more efficient risk factors is depicted in Figure 5. Based on the PFI depicted in Figure 6, the risk factors, including hypertension, diabetes, chronic obstructive pulmonary disease, chronic atherosclerotic disease, age, insulin use, beta-blocker use, and congestive heart failure, were identified as significant predictors of COVID-19 mortality among CKD patients.

Figure 6.

The PFI of the XG-Boost model.

Discussion

This study aims to construct an efficient ML model to assess and predict the COVID-19 mortality risk of CKD patients. We used eight ML models, including Ada-Boost, ANN, DT, LR, K-NN, RF, SVC, and XG-Boost. Also, we trained ML algorithms on clinical and medication data to build the models. Concisely, based on the current study’s results, the XG-Boost with a PPV of 90.78%, NPV of 95.64%, sensitivity of 93.36%, specificity of 93.88%, accuracy of 93.68%, F-score of 92.06%, and AU-ROC of 0.921 and 0.851 was recognized as the best model for predicting mortality, according to XG-Boost, hypertension, diabetes, chronic pulmonary disease, chronic artery disease, age, insulin use, beta-blocker use, and congestive heart failure seemed to have better predictive ability than others regarding COVID-19 mortality among CKD patients.

Despite numerous studies on leveraging ML models to build predictive models of COVID-19 mortality risk, one effort by Luo focused on this topic among CKD patients. Their research developed a predictive model of COVID-19 mortality risk among patients with CKD using laboratory parameters. LightGBM, as an ensemble model with an AU-ROC of 0.833, demonstrated superior performance efficiency compared to others in mortality prediction.²⁶ In the current study, the XG-Boost AU-ROC values of 0.921 and 0.851 provided more predictive insight than those in Luo’s study, indicating the substantial role of clinical data in predicting the mortality risk of COVID-19 among these patients. Ponce et al. developed ML models to predict the mortality risk of patients with acute kidney injury (AKI) using demographic, comorbidity, laboratory, and AKI characteristics. Their study revealed that the Elastic Net model, with an AUC-ROC of 0.823 (95% CI 0.761–0.885) in the validation mode, is the best-performing model for prediction purposes.⁴³ In the current study, XG-Boost achieved an AU-ROC of 0.851 and a 95% CI of [0.835–0.877] in validation mode, performing satisfactorily, similar to Ponc’s study. Additionally, age and hypertension were among the most significant features in that study and are considered critical in the current research.

As stated, numerous studies have explored this topic in general populations, and we have summarized some of them here. An et al. used ML models to predict mortality risk among COVID-19 patients. The SVM with an AU-ROC of 0.963 [0.946, 0.979] and 0.962 [0.945, 0.979] using Lasso and a linear kernel was the best model for mortality prediction.⁴⁴ Additionally, age was identified as a significant predictor of mortality, consistent with the current study’s findings. Banoe et al. reported that the inspired modification of a partial least squares (SIMPLS) model with AU-ROC >0.85 has favourable predictive ability.⁴⁵ Pourhomayoun et al. used ML techniques to predict mortality risk in patients with COVID-19, achieving an accuracy of 89.98% for predictive purposes.³¹ Shanbehzadeh et al. applied various ANN architectures to predict mortality among COVID-19 patients, yielding an AUC-ROC of 0.888 as their performance metric.⁴⁶ Generally, the present study demonstrated satisfactory predictive performance, similar to previous studies on this topic among general populations.

This study demonstrated hypertension, diabetes, CPD, CAD, and age as the five top-ranking risk factors influencing the mortality prediction among CKA patients. According to our study, diabetes and hypertension are recognized as significant risk factors in the general population. The results of Albitar et al.’s study have revealed that advanced age, hypertension, and diabetes are three critical risk factors of COVID-19 death among the general population.⁴⁷ Wu et al. in their meta-analysis indicated that diabetes increases the risk of mortality among COVID-19 patients.⁴⁸ Another systematic review and meta-analysis by Corona et al. has shown that diabetes is the most important factor increasing the risk of COVID-19 mortality.⁴⁹ Comparing the findings of the current research on CKD populations with those of previous studies on general populations provides insight into the role of diabetes in COVID-19 mortality risk. Although diabetes is the most important risk factor among the general population, the current findings demonstrated that hypertension is the top-ranking risk factor for the mortality risk of COVID-19 among CKD patients who are infected with COVID-19. So, hypertension can play a more essential role in clinical decision-making regarding COVID-19 mortality and the assessment of clinical progression in CKD patients. Moreover, age is considered the most significant risk factor in studies examining COVID-19 mortality risk in the general population.⁵⁰ Like diabetes, current research has identified age as a crucial risk factor for COVID-19 mortality among CKD patients, contradicting previous findings in general populations, which generally state that age is the most critical risk factor in this context. CPD and CAD are two significant comorbidities in predicting death risk in CKD patients, which is consistent with other studies working on COVID-19 risk prediction among general populations.⁵¹ Typically, comparing the importance of these factors between CKD in the current research and general populations in the previous studies shows that although, some risk factors are common regarding the importance in predicting COVID-19 mortality risk, the ranking of importance is different between CKD and general populations, and identifying more critical risk factors in this situation can be essential the clinical decision making by healthcare providers in clinical environments, depending on the type of populations are investigated.

Limitations and future implications

During this research, we encountered several limitations and constraints, which we addressed. First, we used a single-center database, which may impact the models’ generalizability. Therefore, using multi-center databases or large registries helps resolve this issue as much as possible. Some risk factors, such as laboratory parameters used in previous studies, are missing from the current database, potentially affecting the models’ performance. We recommend conducting a prospective study to consider these critical factors to achieve more accurate predictive insights.

Additionally, some data were lost and embedded using statistical techniques, which can influence the models’ generalizability. A prospective study eliminates this limitation. Another limitation of the current study was the failure to leverage the ML model’s external validity to assess its generalizability. For future research, we recommend employing this approach to evaluate the predictive capabilities of ML models in other clinical settings.

Conclusion

In this study, XG-Boost achieved AU-ROC scores of 0.921 and 0.851 in the training and validation sets, respectively. We concluded that the ML models could provide us with favourable predictive insights into the mortality risk of COVID-19 among CKD patients. Also, features of hypertension, diabetes, chronic pulmonary disease, chronic artery disease, age, insulin use, beta-blocker use, and congestive heart failure were known as the best predictors in this regard. According to the results, XG-Boost, an efficient model, can serve as a robust knowledge base for prediction systems. Physicians can use these systems to evaluate patients based on these risk factors, thereby enhancing diagnostic and therapeutic protocols by promoting individual decision-making.

Footnotes

Acknowledgments

We thank all the people who assisted us in all steps of this study.

ORCID iD

Raoof Nopour

Ethical considerations

This study was approved by the ethics committee of Tehran University of Medical Sciences (Reg No: 96-11-582002) on 03 Apr 2024. All methods were carried out in accordance with relevant guidelines and regulations.

Consent to participate

Due to the retrospective nature of this study, informed consent was waived for this research.

Author contributions

R.N. conducted the writing, review, and editing of this manuscript.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The dataset generated and analyzed during the current study is not publicly available due to privacy concerns. However, de-identified data used in this study can be made available from the corresponding author upon reasonable request, subject to approval from the institutional review board and in compliance with data protection regulations.*

References

Wang

Tang

Wei

. Updated understanding of the outbreak of 2019 novel coronavirus (2019-nCoV) in Wuhan, China. J Med Virol 2020; 92(4): 441–447.

Lai

C-C

Shih

T-P

W-C

, et al. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): the epidemic and the challenges. Int J Antimicrob Agents 2020; 55(3): 105924.

Al-Qahtani

. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2): emergence, history, basic and clinical aspects. Saudi J Biol Sci 2020; 27(10): 2531–2538.

McArthur

Sakthivel

Ataide

, et al. Review of burden, clinical definitions, and management of COVID-19 cases. Am J Trop Med Hyg 2020; 103(2): 625–638.

Tabari

Amini

Moghadami

, et al. International public health responses to COVID-19 outbreak: a rapid review. Iran J Med Sci 2020; 45(3): 157–169.

Clemente-Suárez

Navarro-Jiménez

Moreno-Luna

, et al. The impact of the COVID-19 pandemic on social, health, and economy. Sustainability 2021; 13(11): 6314.

Noor

Islam

. Prevalence and associated risk factors of mortality among COVID-19 patients: a meta-analysis. J Community Health 2020; 45(6): 1270–1282.

Mohapatra

Pintilie

Kandi

, et al. The recent challenges of highly contagious COVID-19, causing respiratory infections: symptoms, diagnosis, transmission, possible vaccines, animal models, and immunotherapy. Chem Biol Drug Des 2020; 96(5): 1187–1208.

Abate

Checkol

Mantefardo

. Global prevalence and determinants of mortality among patients with COVID-19: a systematic review and meta-analysis. Ann Med Surg 2021; 64: 102204.

10.

Singh

Gillies

Singh

, et al. Prevalence of comorbidities and their association with mortality in patients with COVID-19: a systematic review and meta-analysis. Diabetes Obes Metabol 2020; 22(10): 1915–1924.

11.

Liu

Chen

Liu

, et al. Comorbid chronic diseases are strongly correlated with disease severity among COVID-19 patients: a systematic review and meta-analysis. Aging Dis 2020; 11(3): 668–678.

12.

de Holanda

e Silva

de Carvalho César Sobrinho

ÁA

. Machine learning models for predicting hospitalization and mortality risks of COVID-19 patients. Expert Syst Appl 2024; 240: 122670.

13.

Wang

Fang

Cai

, et al. Comorbid chronic diseases and acute organ injuries are strongly correlated with disease severity and mortality among COVID-19 patients: a systemic review and meta-analysis. Research 2020; 2020: 1–17.

14.

Hacker

Briss

Richardson

, et al. COVID-19 and chronic disease: the impact now and in the future. Prev Chronic Dis 2021; 18: E62.

15.

Smallwood

Harrex

Rees

, et al. COVID-19 infection and the broader impacts of the pandemic on healthcare workers. Respirology 2022; 27(6): 411–426.

16.

Carlson

Nelveg‐Kristensen

Freese Ballegaard

, et al. Increased vulnerability to Covid‐19 in chronic kidney disease. J Intern Med 2021; 290(1): 166–178.

17.

Zou

Qin

Yang

, et al. Clinical characteristics, outcomes and risk factors for mortality in hospitalized diabetes and chronic kidney disease patients after COVID-19 infection following widespread vaccination. J Endocrinol Investig 2024; 47(3): 619–631.

18.

Henry

Lippi

. Chronic kidney disease is associated with severe coronavirus disease 2019 (COVID-19) infection. Int Urol Nephrol 2020; 52(6): 1193–1194.

19.

S-X

Zhao

L-F

, et al. Management recommendations for patients with chronic kidney disease during the novel coronavirus disease 2019 (COVID-19) epidemic. Chronic Dis Transl Med 2020; 06(02): 119–123.

20.

Gibertoni

Reno

Rucci

, et al. COVID-19 incidence and mortality in non-dialysis chronic kidney disease patients. PLoS One 2021; 16(7): e0254525.

21.

Cheng

Luo

Wang

, et al. Kidney disease is associated with in-hospital death of patients with COVID-19. Kidney Int 2020; 97(5): 829–838.

22.

Cai

Zhang

Zhu

, et al. Mortality in chronic kidney disease patients with COVID-19: a systematic review and meta-analysis. Int Urol Nephrol 2021; 53(8): 1623–1629.

23.

Ozturk

Turgutalp

Arici

, et al. Mortality analysis of COVID-19 infection in chronic kidney disease, haemodialysis and renal transplant patients compared with patients without kidney disease: a nationwide analysis from Turkey. Nephrol Dial Transplant 2020; 35(12): 2083–2095.

24.

Martín

CZ-S.

Telenephrology: a resource for universalizing access to kidney care, perspectives from Latin America. In: Bezerra da Silva Junior

Nangaku

(eds) Innovations in nephrology: breakthrough technologies in kidney disease care . Springer, 2022, pp. 321–341.

25.

Elsayed

. Preventive strategies and renal replacement therapies for patients with COVID-19. J Egypt Soc Nephrol Transplant 2020; 20(4): 211–223.

26.

Luo

Gao

Yang

, et al. Predictive modeling of COVID-19 mortality risk in chronic kidney disease patients using multiple machine learning algorithms. Sci Rep 2024; 14(1): 26979.

27.

Pecly

IMD

Azevedo

Muxfeldt

, et al. COVID-19 and chronic kidney disease: a comprehensive review. Braz J Nephrol 2021; 43(3): 383–399.

28.

Javaid

Haleem

Pratap Singh

, et al. Significance of machine learning in healthcare: features, pillars and applications. Int J Intell Netw 2022; 3: 58–73.

29.

Nithya

Ilango

(eds). Predictive analytics in health care using machine learning tools and techniques. In: 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 15–16 June 2017.

30.

Nopour

. Design of risk prediction model for esophageal cancer based on machine learning approach. Heliyon 2024; 10(2): e24797.

31.

Pourhomayoun

Shakibi

. Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making. Smart Health 2021; 20: 100178.

32.

Chowdhury

MEH

Rahman

Khandakar

, et al. An early warning tool for predicting mortality risk of COVID-19 patients using machine learning. Cognit Comput 2024; 16(4): 1778–1793.

33.

Booth

Abels

McCaffrey

. Development of a prognostic model for mortality in COVID-19 infection using machine learning. Mod Pathol 2021; 34(3): 522–531.

34.

Liu

Jiang

, et al. Early prediction of mortality risk among patients with severe COVID-19, using machine learning. Int J Epidemiol 2020; 49(6): 1918–1929.

35.

Yan

Zhang

H-T

Goncalves

, et al. An interpretable mortality prediction model for COVID-19 patients. Nat Mach Intell 2020; 2(5): 283–288.

36.

Chandrashekar

Sahin

. A survey on feature selection methods. Comput Electr Eng 2014; 40(1): 16–28.

37.

Guyon

Elisseeff

. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–1182.

38.

Pal

Foody

. Feature selection for classification of hyperspectral data by SVM. IEEE Trans Geosci Rem Sens 2010; 48(5): 2297–2307.

39.

Mezzadri

Laloë

Mathy

, et al. Hold-out strategy for selecting learning models: application to categorization subjected to presentation orders. J Math Psychol 2022; 109: 102691.

40.

Uyar

Bener

Ciray

. Predictive modeling of implantation outcome in an in vitro fertilization setting: an application of machine learning methods. Med Decis Mak 2015; 35(6): 714–725.

41.

Musolf

Holzinger

Malley

, et al. What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics. Hum Genet 2022; 141(9): 1515–1528.

42.

Hassija

Chamola

Mahapatra

, et al. Interpreting black-box models: a review on explainable artificial intelligence. Cognit Comput 2024; 16(1): 45–74.

43.

Ponce

de Andrade

LGM

Claure-Del Granado

, et al. Development of a prediction score for in-hospital mortality in COVID-19 patients with acute kidney injury: a machine learning approach. Sci Rep 2021; 11(1): 24439.

44.

Lim

Kim

D-W

, et al. Machine learning prediction for mortality of patients diagnosed with COVID-19: a nationwide Korean cohort study. Sci Rep 2020; 10(1): 18716.

45.

Banoei

Dinparastisaleh

Zadeh

, et al. Machine-learning-based COVID-19 mortality prediction model and identification of patients at low and high risk of dying. Crit Care 2021; 25(1): 328.

46.

Shanbehzadeh

Nopour

Kazemi-Arpanahi

. Design of an artificial neural network to predict mortality among COVID-19 patients. Inform Med Unlocked 2022; 31: 100983.

47.

Albitar

Ballouze

Ooi

, et al. Risk factors for mortality among COVID-19 patients. Diabetes Res Clin Pract 2020; 166: 108293.

48.

Z-H

Tang

Cheng

. Diabetes increases the mortality of patients with COVID-19: a meta-analysis. Acta Diabetol 2021; 58(2): 139–144.

49.

Corona

Pizzocaro

Vena

, et al. Diabetes is most important cause for mortality in COVID-19 hospitalized patients: systematic review and meta-analysis. Rev Endocr Metab Disord 2021; 22(2): 275–296.

50.

Caramelo

Ferreira

Oliveiros

. Estimation of risk factors for COVID-19 mortality-preliminary results. medRxiv. 2020; 2020(02). doi:10.1101/2020.02.24.20027268.

51.

Jawad

Alsuwaidi

Khan

. Population risk factors for COVID-19 mortality in 93 countries. J Epidemiol Glob Health 2020; 10(3): 204–208.