Abstract
Heart failure is a prominent global cause of mortality. Heart failure is a medical condition characterized by the heart's inability to adequately circulate blood throughout the body or meet its needs. The rising expenses associated with conventional medical treatments for heart failure diagnosis have underscored the significance of developing diagnostic systems that utilize machine learning approaches. We employed various machine learning techniques on a heart failure dataset from the literature, specifically focusing on the survival status of heart failure patients. After employing the hold-out validation technique to prevent overfitting, the survival status of patients was evaluated by utilizing GridSearchCV hyperparameter optimization on classifiers such as Naive Bayes, Support Vector Machines, Decision Trees, Random Forests, K-Nearest Neighbor, Discriminant Analysis, and Extreme Gradient Boosting. Our performance measurement results show that the Decision Trees, Support Vector Machines, and Naive Bayes algorithms are prominent algorithms for the relevant data. While Decision Tree has the highest accuracy value, Support Vector Machines gives the lowest false negative rate, a crucial metric in medical decisions. Additionally, Naive Bayes has very similar results to the Support Vector Machines algorithm. While the straightforward and effective Decision Tree model, which has minimal pre-processing and does not require input scaling, is an easy-to-use method, Support Vector Machines and Naive Bayes which has low false negative rate values may help decision-making by reducing the risk of misdiagnosis.
Keywords
Introduction
Heart Failure (HF) is a public health problem considered the most common and deadliest disease worldwide. In addition, the diagnosis of HF is a major challenge for less experienced physicians, as it is associated with many symptoms and many pathological features. In recent years, improvements in HF management provided by the development of decision support systems have increased survival by reducing the death rate. With the application of Machine Learning (ML) techniques as a key tool in decision support systems to detect HF through clinical data easily, it has been possible to improve the existing diagnosis and treatment methods while reducing costs.
Recent studies to predict HF through models developed via ML techniques are presented as follows 1 : developed a classification tree based on standard long-term heart rate variability for risk assessment in patients suffering from Congestive HF by using a dataset derived from two different Congestive Heart Failure databases with 85.4% accuracy. 2 developed a Clinical Decision Support system that generates outputs including a severity level for HF, a prediction for the type of HF, and an interface for management purposes. Some ML algorithms, such as Artificial Neural Network (ANN), Decision Tree (DT) and Random Forest (RF), were used to validate the data, and an accuracy of 87.6% was achieved. 3 constructed a model to perform a multi-level risk assessment and predicted developing HF using the C4.5 decision tree classifier with an accuracy of 86.5%. 4 evaluated the performance of the existing Seattle Heart Failure Model using electronic health records at Mayo Clinic and built an HF risk prediction model using ML techniques, incorporating new predictive markers. Logistic Regression (LR) is the best result for classifying, with an accuracy of 81%. 5 developed a novel prediction model for HF using Gated Recurrent Unit (GRU) deep learning techniques. In addition, the superiority of the proposed model was discussed by presenting a comparison with different ML methods such as Multi-Layer Perceptron (MLP), Recurrent Neural Network (RNN), Support Vector Machines (SVM). Area Under Curve (AUC) for the RNN model is calculated as 0.883. 6 used Long Short-Term Memory (LSTM)-based architecture with electronic health record (EHR) data to predict HF events. Their model outperformed LR, RF, and AdaBoost models, achieving an AUC of 68.27% and an F1-Score of 21.86% with 5-fold cross-validation. By integrating five different algorithms, 7 implemented a ML model that employed RapidMiner as a research tool and achieved a higher level of accuracy than MATLAB and Weka. SVM and DT models with performance levels above 85% are the best. Using various combinations of feature categories, 8 predicted patient survival and identified significant risk factors using ML classifiers. They compared the results with traditional biostatistics tests and found Serum Creatinine (SC) and Ejection Fraction (EF) to be the most important features. The authors then built ML models for survival prediction using only these two factors and the follow-up period. By using LR, they determined that the accuracy was 83.8%. 9 used RF, DT, KNN, SVM, Naive Bayes (NB), and ANN to predict the survival rate of patients suffering from HF. To tackle the issue of class imbalance in the data, various class balancing methods, such as resampling and the Synthetic Minority Oversampling Technique (SMOTE), were utilized. The study found that RF was the most successful algorithm, with 94.31% and 85.82% accuracy achieved with resampling and SMOTE, respectively. 10 aimed to increase the accuracy obtained in previous studies by performing Feature Selection (FS) and dealing with a class imbalance in the data set to predict the survival of HF patients. The best accuracy was achieved with RF at 77.72% without FS and 83.17% after FS. 11 studied the HF dataset to predict mortality using machine learning techniques such as One Rule, RF, SVM, MLP, and NB. The study used a correlation-based FS algorithm to select the most relevant features, which helped improve the predictions’ accuracy and reduce the cost. The results indicated that the MLP method was the most successful, achieving an accuracy rate of 78%. 12 used various ML algorithms to execute classification procedures, and success rates were displayed for assessing the mortality connected to heart failure. Several algorithms have been tested, with success rates ranging from 73% to 83%. The SVM algorithm offers the most effective classification method out of the tested algorithms. 13 used ML techniques to diagnose HF in a dataset of 487 subjects from two institutions. They tested different classifiers and found an accuracy of 91.23%, a sensitivity (TPR) of 93.83%, and a specificity (FPR) of 89.62% for the analysis using all categories. 14 utilized various kernel functions of the SVM algorithm on the HF dataset, achieving the highest accuracy rate of 87%. 15 attempted to predict the survival of HF patients using classification algorithms such as DT, MLP, and SVM. The highest accuracy rate of 90% was achieved by MLP. 16 conducted a survival analysis for high-risk HF patients using the K-Nearest Neighbour (KNN), RF, and Extra Trees classifiers (ETC) by dealing with the class imbalance, and the ETC algorithm showed higher success with 84.58%. A study conducted by 17 used nine different classification models to predict the survival of HF patients according to the most relevant features. To address the issue of class imbalance, they applied the SMOTE. The results showed that the ETC performed better than the other models, achieving an accuracy of 92.62% when used in combination with SMOTE. 18 identified only 3 of the 11 features in the HF data set using FS techniques and found the performance of the RF classifier according to the 3 selected features with 76.25% accuracy. 19 used ML to predict patient survival from HF-related pathophysiological parameters and analyzed important risk factors using a correlation matrix. They tested various classifiers and found Light GBM achieved the highest accuracy of 85% and an AUC of 93%., 20 after determining the hyperparameters of the RF algorithm, the binary particle swarm intelligence method was used, and a suitable model showing the risk of death with an accuracy of 79.66% was proposed. 21 utilized NB and DT classification algorithms on the data set that we used in our study and assessed their performance. The data were pre-processed, and the two features of the dataset, “age and gender”, were not included in the evaluation because of ignoring the disease detection. Of the 299 patients, 200 were used for training and 99 for testing. They achieved an accuracy of 86% with NB and an accuracy of 82% with DT. 22 has examined ANN, Extreme Gradient Boosting (XGBoost), and RF methods to predict the death or survival of patients with HF. The results showed that ANN was the most accurate method, with an accuracy of 86.67%. 23 proposed a SMOTE-based hybrid deep learning network with an accuracy of 95.52% for predicting the patient's survivors of HF. To extract the substantial features, Coati and Kepler optimization algorithms were used. 24 evaluated two HF survival prediction models using a dataset comprising 299 HF patients. While the first model employs survival analysis, using time and death features as targets, the second model employs a classification task to forecast death. The prediction models were the XGBoost model for the survival analysis and the RF for the classification approach, with a c-index of 0.714 and an accuracy of 0.74, respectively. 25 proposed an XGBoost model with a Brier score of 87.9% and an F1 score of 87.4% to diagnose Acute HF (AHF). They identified significant risk factors for the diagnosis using the Least Absolute Shrinkage and Selection Operator (LASSO) feature selection method. 26 applied a range of ML techniques and deep neural network (DNN) classifiers for congestive HF diagnosis. They obtained the best model, DNN, with an accuracy of 95.3%. They used the KNN method for the imputation of missing data and the C4.5 technique to identify relevant features and remove the outlier data from the dataset. 27 used popular ML techniques to build models that predict AHF patients’ in-hospital mortality. XGBoost had the top ML model with an accuracy of 89.7%. The LASSO method preprocessed the data to identify important features from the database. 28 evaluated the performance of ten ML algorithms to predict the risk of mortality and rehospitalization in HF patients. The recursive feature elimination (RFE) method performed feature selection, while the Smote method addressed imbalanced data. The best ML algorithm, CatBoost, had the highest F1 score of 91%. 29 aimed to identify the best ML techniques for the prediction of mortality in patients suffering from HF using a database containing clinical information on 299 HF patients. They used the MinMax data normalization approach to standardize the characteristics’ values. The RF achieved the best results with an accuracy of 92%.
In this paper, we analyzed a dataset of medical records collected from 299 patients with heart failure during the follow-up of 11 clinical features of each patient profile. We ranked the features corresponding to the most important risk factors by using feature importance. SC and EF were found to be significant risk factors for mortality among heart failure patients. Since the follow-up period of HF patients cannot be directly associated with survival status, the time parameter is removed from the model. After implementing hold-out validation to prevent overfitting, we applied NB, SVM, DT, RF, K-NN, XGBoost, and DA with hyper-parameter optimization to predict the patient's survival. This study uses accuracy, TPR, and FNR in medical datasets due to their importance. Performance metrics suggest DT, SVM, and NB are appropriate algorithms for relevant data. DT has 81% accuracy with 10% FNR and 90% TPR. SVM has lower accuracy (69%), higher TPR (98%) and lower FNR (2.5%). Like SVM, NB has 71% accuracy but 95% TPR and 5% FNR. Based on the analysis of the relevant data, it was concluded that DT, SVM, and NB algorithms may help physicians make the proper decisions.
Finally, the motivations and contributions of this paper can be presented as follows:
This paper specifically addresses the enhancement of the efficiency of frequently employed machine learning algorithms through the utilization of hyperparameter tuning. Foreseeing that all variables that have the potential to affect the output can contribute to the predictive power, we aimed to obtain a more informative prediction model by using all features except the time feature. Utilizing data on HF from existing literature, this paper highlighted the significance of excluding the time feature in the development of the prediction model. Based on the analysis of the relevant data, it is determined that DT, SVM, and NB algorithms have the potential to assist physicians in making accurate decisions.
The remaining sections of this paper are organized as follows: Section 2 contains a detailed description of the proposed method. Section 3 includes results and discussions. The conclusion is written in Section 4.
Methodology
In this section, data preprocessing, hyperparameter optimization, and validation techniques used to classify the HF dataset are presented.
Data
Analysis of a data set is done using a set of techniques to understand the data and detect patterns. Analysis of datasets can help determine what the data means, how accurate it is, and what results it can produce. To understand the data, it is necessary to know what each variable in the data set means, how the data is collected, the special cases in the data set, and the structure of the variables. To clean the data, it is necessary to detect and correct or remove missing data, incorrect data, and outliers in the data set. This helps to increase the accuracy of the dataset and makes the analysis results more reliable. Data visualization can visually show the distribution and patterns of variables in the data set. It is also essential to understand the relationship between different variables in the data set. Different statistical models can be used to understand the relationship between variables in the data set and to predict future behavior. This stage is important to finalize the analysis of the data.
The dataset used in this paper includes medical records of 299 patients with heart failure retrieved from Allied Hospital in Faisalabad (Punjab, Pakistan) in 2015. 30 The dataset consists of 11 clinical features, the duration of follow-up, and the death event. It includes demographic characteristics such as age, gender, and smoking, along with clinical data obtained from laboratory test results such as blood enzymes, sodium (Na), Creatinine Phosphokinase (CPK), High Blood Pressure (HBP), and platelet levels. The features mentioned in the data set and a visualization of the numerical features are presented in Table 1 and Figure 1, respectively.
Dataset features and descriptions.
Dataset features and descriptions.

Histogram distributions and density functions of numerical features in the data set.
When Table 1 is examined, it is seen that patients stay in the hospital for an average of 4 to 285 days. While the target feature indicates whether the patient dies during the follow-up period, the time feature represents the number of days until death or, in the case of survival, the period during which the patient is followed. In the correlation matrix, a negative correlation of −0.53 is observed between time and DEATH\_EVENT. This shows that the higher the time feature, the more the death rate tends to decrease. Since the time feature refers to the duration of patient follow-up, this feature introduces a bias towards future evaluations. In addition, a patient's time information cannot be known in advance in the real world but can only be known retrospectively. Using features containing future information during model training may be misleading in terms of the actual use of the model and may also cause overfitting. When training the model, it is important to exclude features that could cause such data leaks. With this approach, in order to accurately evaluate the performance of the model in real life, it would be appropriate not to use the time feature in the machine learning model, which explains the target variable quite well and may cause data leakage.16,20,24 Consequently, we made the decision to exclude the time element as an input factor.
Well-distributed features can have a positive impact on the performance of models, as they help the model generalize better to unseen data. Figure 1 displays the distributions of the numerical features. The distribution curves of the platelet, serum sodium, and serum creatinine variables in Figure 1 closely resemble a normal distribution.
The data preprocessing phase includes modules such as scaling and data missing imputation. The process of scaling adjusts the variables in a dataset to the same scale, leading to better model performance and consistency of results. The most commonly used scaling methods are standardization and normalization. Standardization organizes data considering both the mean and variance of each feature. Normalization transforms any feature vector in the data set, usually into the range [0, 1] or a particular norm. The common normalization scaling approach is min-max normalization. 31 For the data handled in this paper, both normalization and standardization are investigated, but it is determined that the data distribution does not necessitate any preprocessing. Only the platelet feature in Table 1 is divided by 1000 to create cells/mL for compliance with the feature description. Moreover, no data imputation was required as there were no missing values in the data set.
To see how the features are correlated, it is useful to look at the heatmap, which is a matrix that shows the correlation values of each independent feature pair according to the Pearson correlation coefficient. As seen from the heat map given in Figure 2, the highest correlation appears to be between the target variable and SC, with a value of 0.29. EF is negatively correlated with the target variable, with a value of 0.27. These correlations between the input variables and the target variable indicate a strong relationship between them. This means that any change in the input variable causes a significant change in the target variable, either in the same direction or in the opposite direction. Although a high correlation can often increase a model's predictive ability, correlation does not imply causation; that is, a high correlation does not require a causal relationship between correlated variables. Therefore, other analyses must confirm the findings if the correlation is high. With this approach, we estimated the target variable using all the features in our models.

Heatmap.
Upon reviewing Figure 2, the correlation between features was weak at best. While the categorical features have less correlation with the target variable, the numerical features age, EF, and serum sodium are reasonably correlated with the target variable at 0.25, −0.27, and −0.20 correlation coefficients, respectively, with serum creatinine being the highest at 0.29. The inference that can be made here is which feature may have higher predictive power. SC, EF, age, and serum sodium seem to be at their highest levels. In the upcoming section where we discuss important features, we will provide an explanation for this inference.
ML is a subfield of artificial intelligence that enables computer systems to automatically analyze and learn from data and make better decisions for the future. This allows computers to be trained to learn a task and then perform similar tasks more accurately.
ML uses different learning methods, such as supervised learning, unsupervised learning, and reinforcement learning. Supervised learning works by tagging data, and a computer uses tags to arrive at an accurate conclusion using what it has learned by analyzing new data. Unsupervised learning works by identifying patterns among data without labeling data. Reinforcement learning involves using trial-and-error methods to learn how best to perform a task.
Model selection includes the selection of an ML model according to the characteristics, size, and nature of the data. In this step, the most suitable one of many different model options, such as classification, regression, clustering, and size reduction, is selected. After the model selection, the inputs and targets are defined, and the training phase, which can include the adjustment of hyperparameters, is made. After the training, the validation step is performed to assess the model's performance. ML can be used in many application areas, such as image processing, natural language processing, spam filtering, voice recognition, recommendation systems, medical diagnostics, and financial analysis.
We analyzed our dataset and conducted initial evaluations to choose the most acceptable model relative to the nature of the problem. We identified the most commonly employed ML methods in healthcare risk prediction that have extensive literature.32,33 Consequently, this study focused on the NB, SVM, DT, RF, KNN, LDA, and XGBoost algorithms. We then applied data preprocessing and feature engineering techniques appropriate to these algorithms.
Linear discriminant analysis
Linear Discriminant Analysis (LDA) is a supervised classification technique used to identify a linear combination of features that can effectively separate two or more classes of data. The goal of LDA is to reduce the dimensionality of the data while preserving the discriminatory information between the classes.
In LDA, the data is projected onto a lower-dimensional space in such a way that the distance between the means of the different classes is maximized while the variance within each class is minimized. This is achieved by computing the eigenvectors of the scatter matrix, which captures the relationships between the features and the classes.
Naïve bayes
Naive Bayes (NB) is a classification algorithm that is based on Bayes’ theorem, which describes the probability of an event occurring given some prior knowledge or evidence. NB assumes that the features (i.e., predictor variables) used for classification are independent of each other, which is a simplifying assumption that allows for fast and efficient classification.
In NB, the probability of a particular class given a set of features is calculated by multiplying the prior probability of the class (i.e., how likely it is to belong to that class) by the likelihood of the features given that class (i.e., how strongly the features are associated with that class). The algorithm then chooses the class with the highest probability as the predicted class for the given set of features. NB is widely used in text classification tasks, such as spam filtering and sentiment analysis, because it is fast and requires relatively little training data.
Decision tree
Decision Trees (DTs) are a popular and widely used supervised learning method in ML that can be used for classification and regression problems. DTs are constructed through a recursive partitioning of the feature space that maps input features to predicted target labels. The resulting tree structure is a sequence of decision nodes, where each node tests the value of a particular feature and branches into different subtrees based on the outcome of the test. The leaf nodes of the tree represent the final predicted target labels. DTs can handle both categorical and continuous input features and are relatively easy to interpret, making them popular in various domains. They are also useful for FS and can be combined with ensemble methods such as random forests to improve accuracy and reduce overfitting. One of the most important advantages of this method is that the results can be used easily, even by users who do not have any software knowledge, since it can be visualized through flowcharts.
There are several DT classification (DTC) algorithms to determine the best splitting conditions at each node. These algorithms aim to yield the best overall performance by improving the information gain at each node of the tree. The most commonly used approach utilizes the entropy of the set in (1) and aims to select the splitting criteria where the maximum information gain in (2) is possible
34
:
The maximum depth of a DT (
To address classification and regression, Random Forest (RF) employs ensemble learning, a method that integrates numerous classifications to offer answers to challenging issues. Many decision trees are used in the RF algorithm, which trains the “forest” it creates via bagging or bootstrapping. An ensemble meta-algorithm called bagging increases the precision of ML techniques and minimizes dataset overfitting. The DT's predictions serve as the basis for the RF algorithm's determination of the outcome. It makes estimations by averaging the results of different trees and becomes more precise as the number of trees grows. Forming root nodes and splitting nodes randomly is the main distinction between the RF and DT algorithms. 35 RF yields highly effective results when applied to data sets with many variables, class labels, categorical variables, missing data, or an imbalanced distribution. 36
Extreme gradient boosting
Extreme Gradient Boosting (XGBoost) is an important algorithm in the field of supervised learning that provides high performance in classification, regression, and ranking tasks. It applies a process called boosting to produce accurate models. 37 Boosting trees in the XGBoost algorithm are divided into regression and classification trees. The essence of this algorithm is based on optimizing the objective function value. 38 At each iteration, XGBoost tends to select samples that were misclassified (or, if it is a regression, predicted with a high error rate) from the tree trained in the previous step, and with those samples, it trains a new tree in the next iteration. The most important feature of the XGBoost algorithm is its scalability in all scenarios. The system runs more than 10 times faster than current popular solutions on a single machine and scales to billions of instances in distributed or memory-limited settings. 39
K-Nearest neighbor
The K-Nearest Neighbor (KNN), which is a distance-based ML algorithm, is a method to classify data based on the closest training samples in the feature vector. 40 The algorithm uses known samples to estimate the k-nearest neighbors of unknown data to find a class label. The value of k is determined according to the size of the data. Since KNN is a non-parametric algorithm, that is, it does not make any assumptions about the distribution of the data, it is a flexible algorithm that can be used in various applications. However, the number k is entered externally, and memory usage is high depending on the size of the data.
Support vector machine
Support Vector Machine (SVM) has the ability to separate data into two or more classes with separation mechanisms such as linear in two-dimensional space, planar in three-dimensional space, and hyperplane in multi-dimensional space. 41 The method, which is frequently used in determining linearly separable classes, is successfully used in the classification of non-linear data by moving the input space, which cannot be linearly decomposed, to this higher-dimensional linearly separable space, thanks to kernel functions.
Hyperparameter optimization
An ML model's performance can be enhanced by using a technique called hyperparameter optimization (HO). There are parameters such as learning rate, epoch numbers, and mini-batch size in ML models that are not discovered by the learning algorithms themselves. These settings, known as hyperparameters, govern how the model learns and adjusts to the data. 42
In the literature, there exist hyper-parameter optimization techniques such as grid search, random search, Bayesian optimization, babysitting, and metaheuristic algorithms. Since hyperparameter configuration space is small, Grid Search is considered an effective HO method. This technique involves repeatedly attempting hyperparameter combinations that fall within a particular range. The hyperparameters that produce the best outcomes are chosen after all possible combinations have been tested. The model can perform better and minimize overfitting with the help of hyperparameter tuning. 43
Grid search is a helpful tool for data scientists seeking recommendations for configuration parameters for particular algorithms. Grid search works by attempting all possible combinations of the relevant parameters. Our reason for choosing GridSearch; The user automatically tries to find the best hyperparameter values rather than manually trial and error with the hyperparameters. Furthermore, evaluating multiple parameter combinations simultaneously speeds up the process and enables more effective use of computational power. 44 The hyperparameter values utilizing the grid search estimator to get the optimal parameters for the selected algorithms are displayed in Table 2.
Hyperparameters and their values.
Hyperparameters and their values.
Model validation is required to more accurately evaluate the results produced by the models. The validation of the results produced by the model built on the training set can be evaluated by various methods, such as hold-out validation, and k-fold cross-validation. In hold-out validation, 2/3 and 1/3 of the original data set are divided into the training and test sets, respectively. While the model is built with the training set, the performance is measured with the test set. 45 Another validation method is k-fold cross-validation, an efficient and statistically supported validation technique that may be applied to smaller datasets.
This approach uses one of the k subsets for testing while using the other subsets for training. 46 The initial dataset is partitioned into k distinct subsets in the cross-validation procedure. The value of the parameter k is dependent on the circumstances and the expert's evaluation and is not precisely defined.
Both techniques have their advantages and disadvantages, and the choice between them depends on the specific circumstances and goals of the analysis. Cross-validation can use the entire data set as a test set so that imbalances between test and train may not be considered, but in hold-out validation, as the training and test data are processed separately, the balance is considered. In this paper, we prefer the hold-out validation method since underfitting and overfitting status are better observed for the prediction of survival in patients with HF.
Model performance evaluation
Model performance refers to how well a model performs its intended task. The effectiveness of a machine learning model can be assessed in various ways. A way to test classification accuracy is the confusion matrix, where true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are specified as the principal components of the matrix. Some of the performance metrics that can be acquired using the confusion matrix are correct classification (accuracy), true positive rate (TPR), and false negative rate (FNR).
Results and Discussion
Data preparation and feature selection
The most informative features are SC and EF, as shown by the unnormalized plot of the features in Figure 3. The analysis was conducted with only these two features in, 8 which confirms our analysis.

Feature importances.
In machine learning, splitting data into training and testing datasets which is named as data stratification is a fundamental step that significantly affects the performance and generalization of models. Stratification is crucial for various reasons, including the reduction of bias, improved model evaluation, enhanced generalization, support for imbalanced datasets, and the provision of consistent and trustworthy evaluation measures. Thus, well-stratified training and testing datasets ensure that all categories or features within the data are proportionally represented in both sets.
In this paper, we used the train\_test\_split method of the Scikit-Learn library, which is very popular for data science in Python. 47 After analyzing different seed values, the value of 42 is chosen since it results in the smallest discrepancy between the train and test sets. Apart from the seed value, the stratify parameter is a feature used to preserve class proportions when splitting the dataset. If the stratify parameter is used, the class proportions in each subset will be the same as those in the original dataset when splitting the dataset. In this study, the stratify parameter was used by giving the target variable to preserve the class ratios (1 and 0 ratios) in the target variable (death\_event) we want to predict. Thus, while training and testing the model, it is aimed at obtaining more reliable results by maintaining a balanced number of samples from both classes.
In this paper, the data is split into 80% and 20% as the train and test sets, respectively. Each subplot in Figure 4 shows the distribution of different features in our dataset depending on the train and test sets. The aim here is to see whether the distributions of individual features in the train and test sets are similar. This shows that the model generally works with the same type of data in both sets, so when tested, it can recognize the patterns it has learned from the learning data. The word “proportion” on the ordinate axis refers to the ratio of the current feature value or feature range in the train or test sets. For example, the “age” feature shows the ratio of individuals in a certain age group to the total number of individuals. These ratios are used to show the consistency of the distribution of each feature in the train or test sets. The similarities between the curves in the graphs indicate that the train or test sets have similar feature distributions. This means that the two clusters are balanced for the learning and evaluation models. If the curves overlap or are very close to each other, this indicates that the feature distribution in the train or test sets is very similar.

Distribution of train and test dataset.
The closeness of the performance metric scores of the train and test data can determine whether an ML model overfits or underfits. Table 3 shows that overfitting or underfitting did not occur in any of our models.
Performance metrics of the selected ML technics with the best hyperparameter values.
Performance metrics of the selected ML technics with the best hyperparameter values.
Python's Scikit-Learn provides an effective way to perform the grid search method for optimizing hyperparameters on each considered classifier. We perform grid-based hyperparameter tuning to optimize the models in our paper.
Since Heart Failure is a common and severe disease with potential fatal outcomes, it is crucial to be cautious when selecting a model that aims to forecast the survival outcome of a patient based on their existing clinical data. When assessing a model used to identify a severe disease, it is important to examine not just its accuracy but also other performance metrics, such as the True Positive Rate (TPR) and False Negative Rate (FNR). Each of these metrics represents different aspects of the model and offers crucial information in the decision-making process. TPR (True Positive Rate) shows how many people with the disease are actually predicted correctly. When diagnosing heart failure, this rate is desired to be as high as possible. Making an incorrect prediction about a patient and failing to recognize the patient's state (resulting in a false negative) can lead to serious health consequences, perhaps resulting in death. FNR (False Negative Rate) shows how many people who are truly sick are mispredicted (classified as false negatives) by the model. When diagnosing heart failure, it is extremely important that this rate be as low as possible.
Table 3 displays the classification performance results achieved by utilizing all features except time and identifying the best hyperparameters.
According to the performance metrics results presented in Table 3, DT, SVM, and NB can be considered prominent algorithms for the corresponding data. While DT exhibits a superior accuracy rate of 81%, it has a lower TPR of 90% and a greater FNR of 10%. This means that the model produces more false negatives. This may result in missing the diagnostic status of some patients. When predicting the survival status of a life-threatening disease like heart failure, SVM can minimize the risk of misdiagnosis for patients by having a higher TPR and lower FNR values. This is critical for early diagnosis and treatment. While SVM offers lower accuracy (69%), it offers a higher TPR (98%) and a lower FNR (2.5%) value. Similar to SVM, while NB's accuracy is 71%, it provides a higher TPR at 95% and a lower FNR at 5%. These performance scores indicate that SVM and NB can accurately predict the majority of patients with HF. These predictions can enable doctors to take the necessary precautions to treat patients.
Given that the results of the DT algorithm are visually interpretable, it is further elaborated in the next subsection.
The performance of the tree was further examined by including cross-validation for the most accurate maximum depth determination to construct the best flowchart. The maximum depth of the tree is determined by evaluating 10-fold cross validation for the values from 1 to 9 (see Figure 5). Here, it can be observed that depth sizes 3 and 4 are the possible candidates for the best tree structure. When the whole dataset is utilized for training and testing, the accuracy of the DTCs with a maximum depth of tree of 3 and 4 is evaluated at 82.6% and 77.9%, respectively. When the confusion matrices illustrated in Figure 6 are also considered,

Hyperparameter tuning for DT.

Confusion matrix for the tree with maximum depth of (a) 3 and (b) 4.
The flow chart created according to

Flowchart obtained from DTC for the determination of survival of a patient.
Model characteristics and performances of research in the literature that utilize the same data set as our paper are presented in Table 4. The findings of Table 4 can be presented as follows:
While8,10,11,15,18,20,21,24 applied ML algorithms by selecting a few features through various FS algorithms, the studies5,6,12,18,24,29,30,45 used almost all features. Depending on the method and the specific context of the problem, reducing the number of features when building a prediction model can be both good and bad practice. By reducing overfitting and computational complexity and focusing on the most meaningful features, removing irrelevant or redundant features can enhance model performance and its interpretability. However, removing features without careful consideration can result in the loss of important information and degrade model performance. Omitting features due to biases or incorrect assumptions can result in biased model predictions or erroneous conclusions. Reducing the number of features without considering their relevance to the prediction task can result in underfitting, where the model fails to capture the underlying patterns in the data. This can lead to poor predictive performance and limited generalization ability. Considering these evaluations, we tried to prevent the omission of some important information by foreseeing that all variables that have the potential to affect the output could contribute to the predictive power. With this perspective, our model considers all features except the time feature to try to obtain a more informative predictive model. 12,19,22,29 considered the accuracy of only the training set while also taking into account the time feature. In our study, since the test set is used as well as the train set, we can predict how our model will perform on new data. This distinction can be seen as a significant factor that sets our study apart from these articles. 4,9,14,17 improved the accuracy by generating synthetic data to address the issue of imbalanced classes in the dataset. In our study, the existing data was distributed in the most balanced way as train and test through HO made with the train\_test\_split method. This approach effectively addressed the issue of data imbalance.
Model performance analysis using the UCI Dataset.
Model performance analysis using the UCI Dataset.
Consequently, it has been observed that our ML models, in which we enhanced the explanatory power by incorporating all features except for the time feature, exhibit strong performance and possess the capacity to more accurately identify patients. Additionally, our DT model outperformed our other models, achieving an accuracy train score of 81.17% and an accuracy test score of 78.33%. The DT model is not only capable of presenting data visually through a tree structure, but it will also be effective in guiding clinicians to make proper decisions due to its simplicity and usefulness.
Focusing on the survival status of HF patients, our study employs ML techniques for a dataset of 299 HF patients, including 11 clinical features for each patient profile. After splitting the train and test sets into 80% and 20%, respectively, the train\_test\_split method with a seed value of 42 is implemented to provide the smallest discrepancy. Naive Bayes, Support Vector Machines, Decision Trees, Random Forests, K-Nearest Neighbor, Discriminant Analysis, and Extreme Gradient Boosting are employed utilizing the GridSearchCV hyperparameter optimization. This study evaluated accuracy, TPR, and FNR, recognizing the importance of various performance criteria when working with a medical dataset. The fact that performance measures cannot be synthesized with physicians’ experiences can be presented as a limitation. The performance metrics findings indicate that the algorithms DT, SVM, and NB are noteworthy choices for the respective data. DT has a higher FNR of 10% and a lower TPR of 90%, but an 81% accuracy rate. Thus, the model generates more false negatives. This may overlook certain patients’ diagnoses. SVM can reduce patient misdiagnosis by having a greater TPR and lower FNR when predicting heart failure survival. This is essential for early diagnosis and treatment. SVM has lower accuracy (69%), but greater TPR (98%) and lower FNR (2.5%). NB has a 71% accuracy like SVM but a 95% TPR and 5% FNR. As a result, DT is an easy-to-use method because it provides a visual platform for physicians to use, whereas SVM and NB reduce the risk of physicians making a misdiagnosis. Based on these evaluations, the opportunities offered by these techniques will encourage physicians to make more appropriate decisions.
As is common with all artificial intelligence applications, this study is limited by the range and characteristics of the data used. Moreover, since the data only includes patients from a country, the genetic and demographic characteristics may affect the generalizability of the utilization of the proposed models and their applicability. Therefore, it is recommended to use the models with caution. Investigating the effects of performance metrics on the results of established ML models using various optimization techniques and selecting a suitable model can be presented as future work.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
