Abstract
Background:
Heart disease remains one of the leading causes of mortality worldwide, highlighting the need for early and accurate diagnosis to support effective prevention and treatment strategies.
Methods:
This study presents a machine-learning-based approach for predicting heart disease using clinical and demographic data from a publicly available dataset. Four widely used classification algorithms—Logistic Regression, Random Forest, K-Nearest Neighbors (KNN), and Decision Trees—were evaluated to identify the most effective predictive model. The dataset underwent comprehensive preprocessing, including handling missing values, categorical encoding, and feature normalization, to enhance data quality and model robustness. Model performance was assessed using accuracy, precision, recall, and AUC-ROC metrics.
Results:
Findings show that hyperparameter-optimized models, particularly Random Forest and KNN, demonstrated strong predictive performance. Explainability techniques, specifically SHapley Additive exPlanations (SHAP), were incorporated to improve interpretability, transparency, and clinical trust. SHAP values were used to analyze feature importance and provide explanations for individual predictions.
Conclusion:
The results underscore the potential of interpretable machine-learning models as valuable tools for early diagnosis, risk stratification, and clinical decision support. Future research should employ larger datasets and investigate real-time predictive applications further to enhance the generalizability and clinical utility of these models.
Keywords
Introduction
Heart disease remains a critical global health issue, responsible for significant morbidity and mortality worldwide. Early and accurate diagnosis is essential to improve patient outcomes through timely medical intervention and preventive strategies. 1 Although effective, conventional diagnostic approaches often depend heavily on clinical expertise and comprehensive medical evaluations, which can contribute to delayed or missed diagnoses, particularly in resource-limited settings.2,3
Advances in machine learning (ML) and predictive modeling provide promising tools to support early detection and risk assessment of heart disease. 4 ML algorithms can process extensive clinical and demographic datasets, identify complex patterns, and generate predictive insights that assist healthcare professionals and clinicians in decision-making. However, the successful clinical adoption of ML requires not only high predictive performance but also interpretability and transparency, which are essential for clinician trust and appropriate clinical use.2,5,6
Recent studies have demonstrated the efficacy of various ML techniques in predicting heart disease. For instance, a prior study has shown that feature selection strategies can improve predictive performance and reduce model complexity, thereby enhancing clinical relevance. 7 The accuracy of ML algorithms for early heart disease prediction highlighted the potential of these algorithms to support clinicians with stronger evidence for diagnosis and decision-making. Similarly, another study reported the ML-HDPM model, which achieved over 95% accuracy in cardiac disease prediction by integrating data from multiple sources and applying comprehensive preprocessing techniques. 8 Despite these advances, reported performance metrics often vary across datasets and evaluation strategies, underscoring the importance of transparent methodology and robust validation.
Several challenges continue to limit the integration of ML models into clinical practice, including data quality, handling missing values, feature selection, and interpretability of model outcomes.9,10 In addition, performance estimates derived from cross-validation may differ from results obtained on independent test sets, which can complicate model comparison and selection. Addressing these challenges is crucial for the successful application of ML in heart disease diagnosis. 11 Recent research has focused on developing explainable models to enhance clinician trust and patient acceptance.12-15 For example, SHapley Additive exPlanations (SHAP) values have been used to provide detailed insights into how each feature influences model predictions, thereby improving interpretability. 16 Addressing these challenges is essential to developing clinically meaningful ML-based decision-support tools.
This study aimed to develop an effective, clinically applicable predictive model for early heart disease that could reduce disease burden by improving risk stratification and informed clinical decision-making. We developed an interpretable ML model that integrates multiple existing heart disease datasets. We trained and tested the model using comprehensive, publicly available datasets and evaluated its performance. We systematically applied rigorous preprocessing techniques, including data normalization, feature encoding, and handling missing values, to ensure robust and reliable model performance. Later, we compared the predictive performance of several widely used machine learning algorithms, including Logistic Regression, 17 Random Forest, 18 K-Nearest Neighbors (KNN), 19 and Decision Trees. 20 Additionally, we enhance model interpretability by using SHapley Additive exPlanations (SHAP) values, which provide valuable insights into feature importance and the underlying reasoning behind individual predictions. 21
The subsequent sections of this manuscript present the dataset and its characteristics, followed by exploratory data analysis, preprocessing techniques, the model development methodology, results, including performance evaluation and SHAP-based interpretability, and, finally, a conclusion summarizing key insights and future directions.
Dataset
The dataset used is the “Heart Failure Prediction Dataset” published by fedesoriano on Kaggle, a well-established platform that offers datasets for machine learning research. It was accessed on March 28, 2025, and is publicly available under the CC0 Public Domain license, permitting unrestricted academic use. Dataset URL: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction. It encompasses clinical and demographic variables relevant to heart disease diagnosis and integrates several previously independent datasets into a unified, extensive collection. The final combined dataset consolidates multiple cardiovascular datasets into a harmonized structure, containing 918 patient records with 11 predictor variables and 1 binary target variable indicating the presence (1) or absence (0) of heart disease. 22 All features are clinically meaningful and commonly used in heart disease diagnostics.
The predictor variables include age (years) and sex, encoded as male (M) and female (F). Additionally, the dataset records ChestPainType, described by 4 categories: typical angina (TA), atypical angina (ATA), non-anginal pain (NAP), and asymptomatic (ASY). Other clinical measures include Resting Blood Pressure (RestingBP), measured in millimeters of mercury (mmHg), and Cholesterol levels, measured in milligrams per deciliter (mg/dL). The dataset also includes a binary indicator of fasting blood sugar (FastingBS), marked as true (1) if fasting blood sugar exceeds 120 mg/dL and false (0) otherwise.
Furthermore, the dataset incorporates results from the Resting Electrocardiogram (RestingECG), which are classified as normal, the presence of ST-T wave abnormalities (ST), or left ventricular hypertrophy (LVH). It also captures Maximum Heart Rate (MaxHR) achieved during exercise, ranging from 60 to 202 beats per minute (bpm), and Exercise-Induced Angina (ExerciseAngina), recorded as present (Y) or absent (N). Additionally, the dataset includes Oldpeak, which measures the degree of ST depression induced by exercise relative to rest, and the ST segment, measured in ST units, slope (ST_Slope) during peak exercise, categorized as upsloping (Up), flat, or downsloping (Down). The outcome variable, HeartDisease, is a binary indicator denoting the presence (1) or absence (0) of heart disease in patients. The dataset used in this study comprises 918 patient records, providing a moderately sized sample that is sufficient for training and evaluating machine learning models.
The dataset contains several physiologically implausible entries (eg, RestingBP = 0, Cholesterol = 0, and negative Oldpeak values). Consistent with established practice for this Kaggle dataset, these values were treated as placeholders for missingness and were imputed using the median value of the corresponding training fold during cross-validation. This ensured that no information from the validation or test data influenced the imputations. Table 1 presents an example of the dataset.
An Example of the Heart Disease Dataset.
Table 2 presents the descriptive statistics of the numerical features in the dataset. These statistics summarize key aspects of the data distribution, including central tendency (mean), dispersion (standard deviation), and range (minimum and maximum values), calculated while excluding any missing values. Table 2 presents an analysis of all relevant numerical variables, including age, resting blood pressure, cholesterol levels, and others that contribute to heart disease prediction.
Descriptive Statistics of Numerical Features in the Dataset.
Exploratory Data Analysis (EDA)
This section aims to provide a deep understanding of the dataset before preprocessing or applying machine learning algorithms. It encompasses summarization, data visualization, pattern correlation, and the detection of hidden patterns and correlations within the data.
Figure 1 displays the distribution of key clinical features within our dataset. Age and MaxHR show approximately normal distributions, indicating a balanced age group and a variety of maximum heart rates among the studied individuals. Resting blood pressure (RestingBP) and Cholesterol exhibit peaks in the moderate-to-high range, underscoring the importance of monitoring these parameters in patients. Additionally, Fasting Blood Sugar (FastingBS) and Oldpeak exhibit skewed distributions, suggesting variability and distinct subpopulations within the dataset. This comprehensive overview of feature distributions provides valuable context for interpreting model predictions and highlights areas for targeted clinical intervention.

Distribution of key clinical features in the dataset. Histograms show the distribution of Age (years), RestingBP (mmHg), Cholesterol (mg/dL), FastingBS (0 ⩽ 120 mg/dL, 1 ⩾ 120 mg/dL), MaxHR (bpm), Oldpeak (ST units), and the binary HeartDisease label. X-axis labels and units are included for all variables, and font sizes were increased to improve readability.
The slight imbalance in the distribution of the HeartDisease output, as shown in Figure 1 (508 positive vs 410 negative cases), indicates a modest skew in our dataset. While not perfectly balanced, this distribution remains sufficiently representative to reliably evaluate model performance, enabling meaningful and robust predictive insights.
Since the dataset comprises a mix of continuous, binary, and categorical variables, the Pearson correlation coefficients 23 presented in Figure 2 should be interpreted with caution. Pearson’s r was computed for all numerically encoded variables; however, its interpretation is statistically appropriate only for continuous variables (Age, RestingBP, Cholesterol, MaxHR, and Oldpeak). For binary and one-hot encoded categorical variables (eg, ExerciseAngina, Sex, FastingBS, ChestPainType, ST_Slope), the coefficients correspond to encoded linear associations and are presented strictly for exploratory visualization.

Correlation matrix heatmap showing Pearson correlation coefficients between clinical features and the HeartDisease label. Colors reflect the strength and direction of correlation, with darker red indicating stronger positive correlations and darker purple/black indicating stronger negative correlations. The color bar represents the full range of correlation values (−1 to +1), allowing for a visual assessment of feature associations.
When categorical and binary variables are numerically encoded, the resulting correlations may not fully reflect true statistical associations, and one-hot 24 encoded features can inflate or distribute correlations across multiple dummy variables. Accordingly, the correlation heatmap is presented strictly as exploratory data analysis (EDA), and no inferential or causal interpretation is derived from encoded categorical predictors. 25
Figure 2 presents the Pearson correlation matrix heatmap, illustrating linear relationships among clinical features and their association with the HeartDisease label. The strongest positive correlation is observed between ExerciseAngina (r = .49) and heart disease, noting that for this binary variable, the coefficient corresponds to a point-biserial association. ChestPainType (r = .47) and Oldpeak (r = .40) also show substantial positive associations within this dataset, reflecting the importance of chest pain characteristics. ST_Slope also shows a positive association with HeartDisease (r = .56), suggesting that certain ST-segment slope patterns are linked to increased risk.
Negative correlations include MaxHR (r = −0.40), indicating that lower maximum heart rates are associated with a higher likelihood of HeartDisease labels. Weaker relationships are observed for Cholesterol (r = −0.23) and RestingBP (r = .11), indicating limited predictive value at the univariate level within this specific dataset.
Sex (r = .31), age (r = .28), and fasting blood sugar (r = .27) also show moderate positive correlations. It is essential to note that these associations reflect patterns unique to this dataset and should not be interpreted as causal or population-wide clinical implications.
Gender-Based Feature Distributions
To investigate potential gender-based variations in cardiovascular risk indicators, we analyzed the distribution of several categorical clinical variables stratified by sex. Figure 3 presents a composite view of the frequency distributions of 5 key features: Fasting Blood Sugar (FastingBS), Exercise-Induced Angina, Resting Electrocardiogram (RestingECG) results, ST-Segment Slope (ST_Slope), and Chest Pain Type.

Gender-based distributions of cardiovascular features. (A) Fasting blood sugar (FastingBS), (B) Exercise-induced angina, (C) RestingECG, (D) ST_Slope distribution, and (E) ChestPainType. Bars show counts of each category stratified by sex (male vs female).
In Panel C of Figure 3, males exhibit a higher prevalence of Left Ventricular Hypertrophy (LVH) patterns on resting electrocardiograms (ECGs), whereas females display a greater frequency of normal ECG findings and ST-T wave abnormalities. This observation is supported by research indicating that males are more likely to present with ECG patterns indicative of LVH. For instance, a study found that in strain and early strain groups, the prevalence of echocardiographic LVH was significantly higher in men than in women. 27 These findings suggest that sex-specific differences in ECG presentations should be taken into account in the diagnosis and management of cardiovascular conditions. 27
In Panel D, which illustrates the distribution of ST-Segment Slope by sex, we observe that flat and downsloping ST segments—often indicative of cardiac ischemia—are more prevalent in males, whereas females exhibit more upsloping ST segments, which are typically considered less pathological. This observation aligns with findings that ST elevation in the precordial leads can be considered normal depending on age, gender, and race, with young and middle-aged males often exhibiting more pronounced ST elevation than females. 28
Regarding Panel E, which examines ChestPainType by sex, the data indicate that males are more likely to present with asymptomatic (ASY) or typical angina (TA) chest pain types. In contrast, females more frequently report atypical angina (ATA) and non-anginal pain (NAP), reflecting gender-based differences in symptomatology. This is consistent with research indicating that women are more likely than men to present with symptoms such as nausea, vomiting, and indigestion during coronary heart disease events, highlighting the importance of recognizing these differences for accurate diagnosis and treatment.29,30
These findings emphasize that, within this dataset, sex is associated with distinct patterns in cardiovascular features, supporting the potential value of sex-specific stratification in heart-disease risk assessment, while recognizing that these observations require confirmation in external cohorts.
Data Preprocessing
Effective data preprocessing and preparation are essential steps in developing reliable machine-learning models. 26 In this study, the dataset underwent structured preprocessing designed to ensure reproducibility and prevent information leakage. Missing or implausible numerical values (eg, zero entries in RestingBP or Cholesterol) were imputed using the median. Binary variables were imputed with the most frequent value and retained as 0/1 without further scaling. Categorical features—including sex, chest pain type, resting ECG, and ST slope—were imputed using the most frequent category and transformed via one-hot encoding with handle_unknown=“ignore” to ensure consistent encoding across folds.27,28 Numerical features (age, RestingBP, Cholesterol, MaxHR, Oldpeak) were standardized using z-score normalization,31,32 with scaling parameters learned exclusively from the training data and then applied to the corresponding validation and test partitions.
All transformations—imputation, encoding, and scaling—were implemented through a unified ColumnTransformer embedded within the cross-validation pipeline, ensuring that preprocessing parameters were derived solely from the training folds.33,34 This leakage-safe design enhances the generalizability, transparency, and reproducibility of model evaluation, providing a solid foundation for the subsequent machine-learning experiments and interpretability analyses.
Methodology
To evaluate model performance effectively and minimize the risk of overfitting, the dataset was divided using a stratified 70/30 train–test split, preserving class distribution to address class imbalance. Given the moderate dataset size (n = 918), creating a separate fixed validation subset would have further reduced training data. To evaluate model performance, a stratified 70/30 train–test split was used to preserve class distribution. Hyperparameter tuning was performed using stratified 10-fold cross-validation within the training set. This approach provides efficient model selection while maintaining an independent test set for unbiased evaluation.
Model selection and hyperparameter tuning were conducted within the training partition. During cross-validation within the training set, preprocessing transformers (imputation, encoding, and scaling), the classifier, and the hyperparameter search were fitted on 9 folds and validated on the remaining fold, ensuring complete isolation between training and validation data. All preprocessing operations—including median imputation, one-hot encoding, and z-score standardization—were applied exclusively within the training folds via a unified ColumnTransformer embedded in the pipeline, ensuring that preprocessing parameters were derived solely from the training data. A fixed random seed 38 ensured reproducibility. After identifying the best hyperparameters for each algorithm, the final models were retrained on the complete 70% training data and evaluated once on the untouched 30% test set to confirm external validity.
Hyperparameter optimization was performed using grid search within the inner cross-validation loop, with ROC-AUC as the primary scoring metric. The following parameter grids were used for each model:
Logistic Regression: C ∈ {0.1, 1.0, 3.0, 10.0}; penalty = l2; solver = lbfgs.
K-Nearest Neighbors (KNN): n_neighbors ∈ {3, 5, 7, 9, 11}; weights ∈ {uniform, distance}; p ∈ {1, 2}.
Decision Tree: max_depth ∈ {None, 4, 6, 8, 12}; min_samples_split ∈ {2, 5, 10}; min_samples_leaf ∈ {1, 2, 4}; criterion ∈ {gini, entropy}.
Random Forest: n_estimators ∈ {200, 400, 600}; max_depth ∈ {None, 6, 10, 14}; min_samples_split ∈ {2, 5, 10}; min_samples_leaf ∈ {1, 2, 4}; max_features ∈ {sqrt, log2}.
We evaluated Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, and Random Forest classifiers. Each algorithm’s performance was evaluated across multiple metrics—including accuracy, precision, recall, F1 Score, and ROC AUC—to identify the best-performing model. Each model was assessed based on classification accuracy, class-imbalance handling, and consistency across datasets. 35 To ensure the interpretability of our models, we applied SHapley Additive exPlanations (SHAP) to quantify feature importance and clarify individual prediction contributions. 16 The detailed methodological steps used in this research are outlined in Table 3.
Pseudocode of the Methodology Followed for Data Preparation, Model Development, Tuning, and Evaluation.
All analyses were performed using Python 3.12 with scikit-learn 1.4, pandas 2.2, numpy 1.26, shap 0.45, and matplotlib 3.8. Random seeds were fixed (seed = 42) for reproducibility.
Results
Model Performance
Figure 4 presents a comparative analysis of performance metrics for 4 machine learning algorithms—Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, and Random Forest—evaluated both before and after hyperparameter tuning. The comparison highlights the impact of tuning on model accuracy, F1-score, recall, and precision.

Comparison of model performance metrics (Accuracy, F1-score, Recall, Precision) before and after hyperparameter tuning for Logistic Regression, KNN, Decision Tree, and Random Forest algorithms. Results represent mean stratified 10-fold cross-validation performance on the training partition.
Logistic Regression maintained consistent performance, with an accuracy of 0.88, an F1-score of 0.88, a recall of 0.88, and a precision of 0.87, both before and after tuning. Conversely, KNN classification demonstrated considerable improvement post-tuning, with accuracy increasing from 0.86 to 0.89, F1-score from 0.86 to 0.88, recall from 0.86 to 0.89, and precision from 0.85 to 0.88. The Decision Tree model showed the most substantial enhancements, with accuracy increasing from 0.74 to 0.83, F1-score rising from 0.73 to 0.83, recall improving from 0.74 to 0.84, and precision increasing from 0.73 to 0.82. The Random Forest model also showed improvements, with accuracy, F1-score, recall, and precision all increasing from 0.88 to 0.90 after tuning. Overall, the results indicate that hyperparameter tuning effectively enhances model predictive capabilities, particularly for Decision Tree and KNN models, underscoring the importance of optimal parameter selection for improving model performance.
The KNN model achieved the highest overall predictive performance (accuracy = 0.917, F1 = 0.926, ROC-AUC = 0.946), closely followed by Random Forest (accuracy = 0.902, ROC-AUC = 0.945). Logistic Regression also performed competitively (AUC = 0.933), while the Decision Tree showed the lowest but still acceptable generalization (AUC = 0.904).
Figure 4 reports mean cross-validation performance on the training partition, whereas Table 4 presents evaluation on the independent 30% test set. While Random Forest showed slightly higher mean cross-validation scores, KNN demonstrated stronger generalization on the independent test set. Such ranking shifts are common in moderate-sized datasets (n = 918) due to fold-to-fold variance, sensitivity to class distribution, and model complexity. Final model selection was therefore based on independent test performance and ROC-AUC stability rather than cross-validation averages alone.
Performance Metrics of Optimized Models on the Independent Test Set.
Figure 5 presents the confusion matrix of the best-performing KNN model on the independent test set. The model correctly classified 110 out of 123 negative cases and 143 out of 153 positive cases, yielding only 13 false positives and 10 false negatives. This balanced error distribution indicates that the classifier maintains high sensitivity (recall = 0.935) and specificity (precision = 0.917), suggesting that it may be suitable as a component of early heart-disease risk screening, subject to further external validation.

Confusion matrix of the best-performing KNN model on the independent test set.
Figure 6 corresponds to the optimized KNN model selected based on independent test performance. The optimized configuration (selected via inner-loop grid search) used k = 7, distance-based weighting, and P = 2 (Euclidean distance), which are the main complexity controls in KNN. The learning curve shows training accuracy approaching 1.0, while cross-validation accuracy stabilizes around 0.82 to 0.85, indicating a moderate generalization gap. This behavior is consistent with instance-based models such as KNN, which may closely fit training data while maintaining stable validation performance. We therefore acknowledge mild overfitting, but note that generalization remains acceptable because validation performance is stable across training sizes and the model achieves strong independent test performance (Accuracy = 0.917; ROC-AUC = 0.946; after calibration, ROC-AUC = 0.950; Brier = 0.0798).

Learning curves of the optimized KNN model computed using stratified 10-fold cross-validation on the training partition.
Calibration and Probability Reliability Analysis
To assess the reliability of predicted probabilities, model calibration was examined using the Brier score and calibration plots. Before calibration, the best-performing model achieved a Brier score of 0.0809, ROC-AUC = 0.946, and PR-AUC = 0.941, with a positive calibration intercept (0.478) and shallow slope (0.125), indicating mild underestimation of risk. The corresponding calibration curve before isotonic correction is shown in Figure 7.

Calibration curve of the best-performing KNN model before isotonic calibration, evaluated on the independent 30% test set.
Post-hoc isotonic calibration improved these metrics to a Brier score of 0.0798, ROC-AUC = 0.950, and PR-AUC = 0.947, while the calibration intercept and slope approached 0 and 1, respectively. As illustrated in Figure 8, the recalibrated model’s predicted probabilities closely align with observed event frequencies, indicating high reliability across the risk spectrum. These findings demonstrate that the model’s probability outputs are well-calibrated and suitable for potential clinical decision-support use.

Calibration plot of the best-performing KNN model after isotonic calibration, evaluated on the independent 30% test set.
Reliability diagram for the KNN classifier on the independent test set. The solid blue line represents the observed fraction of positives versus mean predicted probability, while the dashed orange line denotes perfect calibration. The model exhibits mild under-prediction at higher probability ranges, consistent with the initial positive calibration intercept (.478) and slope (.125).
As shown in Table 5, isotonic calibration led to modest yet consistent improvements across all evaluation metrics. The reduction in the Brier score and the correction of the calibration intercept and slope confirm the enhanced reliability of the predicted probabilities. At the same time, the marginal increases in ROC-AUC and PR-AUC indicate sustained or slightly improved discrimination. Overall, the calibration process refined probability scaling without altering classification accuracy, yielding a model that not only classifies accurately but also provides trustworthy probability estimates—an essential property for clinical risk prediction.
Model Performance Before and After Isotonic Calibration on the Independent Test Set.
To quantify performance variability, 95% confidence intervals (CIs) for ROC-AUC, PR-AUC, and Brier score were obtained using nonparametric bootstrap resampling (2000 iterations) on the test set. The classification threshold was set to 0.5 across all models to maintain comparability. The resulting 95% CIs were: ROC-AUC = 0.950 [0.934-0.963], PR-AUC = 0.947 [0.928-0.961], and Brier = 0.0798 [0.074-0.085], confirming high discrimination and stable calibration.
Model Interpretability (SHAP Analysis)
To improve the interpretability of the predictive model, we applied SHapley Additive exPlanations (SHAP), 16 a game-theoretic framework that quantifies each feature’s contribution to individual predictions. SHAP allows both global and local interpretability through a unified importance measure.
For global interpretability, we used the SHAP bar plot (Figure 9), which displays the mean absolute SHAP value for each feature. The analysis reveals that ST_Slope, ChestPainType, and Cholesterol have the strongest influence on the model output, followed by ExerciseAngina, Oldpeak, and Age. These SHAP values represent the magnitude of each feature’s average contribution (in model output units) across the entire test set. From a clinical perspective, these SHAP results highlight that, in our model, ST-segment depression patterns and chest pain characteristics exert the strongest influence on predicted heart disease risk, with the lipid profile and exertional angina also contributing meaningfully to the model’s decisions in this dataset.

Global SHAP feature importance for the heart disease prediction model.
For individualized explanations, a local SHAP force plot (Figure 10) was generated using KernelExplainer (suitable for the KNN classifier). This example corresponds to a 63-year-old patient. Age increases predicted disease risk, while other attributes — including ST_Slope = 2.0, Sex = 0, ChestPainType = 1.0, ExerciseAngina = 0, Cholesterol = 195 mg/dL, and MaxHR = 179 bpm — collectively decrease the estimated risk. The force plot visually decomposes the prediction into feature-level contributions, illustrating how each variable influences the model output, either pushing it higher or lower relative to the base value.

Local SHAP force plot for an individual patient (age 63).
To ensure methodological rigor, TreeExplainer was used for tree-based models, LinearExplainer for Logistic Regression, and KernelExplainer for KNN. A randomly sampled background dataset of 100 instances was used to approximate SHAP expectations.
In Figure 9, Bars represent the mean absolute SHAP value for each feature, indicating the average magnitude of that feature’s impact on the model’s output. SHAP values are unitless and quantify each feature’s contribution to increasing or decreasing the predicted probability of heart disease.
Figure 10 presents an example of a local interpretation using SHAP values for a 63-year-old patient. In this case, the patient’s age notably increases the predicted risk for heart disease. Conversely, other clinical features, including ST_Slope, ChestPainType, Cholesterol level, and Maximum Heart Rate (MaxHR), significantly lower the predicted risk. This individualized analysis may help clinicians better understand and communicate patient-specific risk factors if such tools are integrated into clinical workflows.
Although SHAP provides powerful insights, it has inherent limitations. First, correlated features may divide importance across variables, potentially underestimating their true contribution. Second, one-hot encoding of categorical variables can distribute importance across multiple dummy features, which may complicate interpretation. Third, SHAP explanations describe model behavior rather than causal relationships and should not be interpreted as clinical risk factors. Finally, SHAP rankings were checked for stability using bootstrap resampling, confirming that the top features remained consistent, though minor variations are expected due to sampling fluctuations.
Implications for Clinical Practice
The findings of this study offer several new insights with direct clinical applicability. Most notably, the consistent identification of ST_Slope and ChestPainType as the top predictive features suggests that, when combined, these 2 variables may serve as highly effective early indicators of heart disease risk. 36 Our findings therefore hypothesize that, in addition to traditional markers such as cholesterol levels or resting blood pressure, increased diagnostic emphasis on the dynamic response during exercise and the qualitative nature of reported chest pain could improve risk stratification, pending validation in broader clinical cohorts.
This has important implications for streamlining diagnostic workflows in both primary care and emergency settings. For instance, rapid screening tools that integrate ST-segment slope assessment with chest pain classification could offer high-yield risk stratification even before more invasive or costly testing is performed. 37
Moreover, the inclusion of non-traditional features, such as MaxHR and Oldpeak, among the top-ranking predictors suggests that functional exercise data are more informative than static measurements alone. 38 This supports the potential development of lightweight, ML-based triage tools that could be embedded in electronic health records (EHRs) or mobile diagnostics to assist clinicians in real time.
Importantly, the use of SHAP visualizations provides a new level of transparency into the rationale for predictions. Clinicians can now not only receive a risk score but also see which clinical features most influenced the decision, enabling better communication with patients, increased trust in automated tools, and support for shared decision-making. Finally, the balanced performance of models like Random Forest and KNN after tuning illustrates that accurate and interpretable models can be developed without the need for overly complex algorithms, suggesting that such approaches may be practical candidates for future deployment in clinical settings with limited computational resources, once externally validated and prospectively tested.
Ethical and Societal Considerations
The integration of AI in clinical decision support raises important ethical considerations. Although the developed models achieved strong predictive performance, their deployment must ensure fairness and avoid bias toward any demographic or clinical subgroup. Transparency and explainability are also critical to foster clinician trust and accountability, particularly when using machine-learning algorithms whose internal logic may be complex. In addition, patient consent, data privacy, and secure data handling must be upheld in accordance with medical data governance standards. Finally, algorithmic outputs should complement—not replace—clinical judgment, with continuous human oversight to minimize potential harm and misinterpretation from algorithmic errors.
Standardization and Transparency in AI Reporting
Ensuring transparency and adherence to standardized reporting and validation frameworks is vital for the safe and reliable translation of AI models into clinical practice. Guidelines such as the FDA’s Good Machine Learning Practice (GMLP) principles, and the TRIPOD-AI and CONSORT-AI statements, provide structured protocols for documenting model development, data provenance, validation procedures, and performance evaluation. Without such standardization, reported metrics may become inconsistent or non-comparable across studies, potentially leading to clinically misleading interpretations. Consequently, while the current findings are promising, they should be interpreted within the specific context of the dataset and evaluation design used in this study. Future research should align with established reporting standards and use external validation datasets to enhance the reproducibility, transparency, and clinical credibility of AI-driven diagnostic systems.
The inclusion of post-hoc isotonic calibration enhanced the reliability of the model’s probability estimates, ensuring that predicted risks more accurately reflect the true likelihood of events. Well-calibrated probabilities are crucial in clinical contexts, where misestimated risks can lead to inappropriate interventions or the overlooking of conditions. Beyond improving statistical validity, calibration contributes to the ethical transparency of AI-driven decision support by providing outputs that clinicians can interpret with confidence and trust. This aligns with emerging regulatory and reporting frameworks, such as those recommended by the FDA and the CONSORT-AI guidelines, which emphasize model transparency, calibration, and performance reproducibility. By explicitly addressing probability, reliability, and uncertainty, this work supports the responsible deployment of machine learning tools for heart disease prediction. It strengthens the foundation for future external validation on diverse clinical datasets.
Discussion and Related Work
Recent studies on heart disease prediction have increasingly leveraged machine learning (ML) and ensemble techniques to improve diagnostic accuracy and interpretability. Prior peer-reviewed studies using the UCI/Cleveland and Statlog heart-disease benchmarks have explored a range of classifiers, including logistic regression, random forests, SVMs, XGBoost, and ensemble methods, and have applied explainability tools such as SHAP to interpret models (Table 6). Optimization algorithms have been applied to enhance feature selection and model performance, leading to more precise predictions. 39 Hybrid cost-sensitive ensembles have been proposed to address class imbalance in heart disease datasets, thereby improving sensitivity to high-risk cases. 40
Review of Machine Learning Studies on Heart Disease: Dataset Size, Algorithms, Validation, and Key Metrics.
Several works combine ensemble learning with explainable AI (XAI) to provide both high predictive performance and interpretability, enabling clinicians to understand the contributions of different risk factors.41-44 Frameworks that enhance traditional classifiers, such as random forests and extreme gradient boosting, using strategies such as SGO optimization have demonstrated improved predictive accuracy. 42 Comparative studies of advanced ML models highlight variability in performance across algorithms and datasets, emphasizing the need for careful model selection. 43
Comprehensive ML frameworks have been developed to systematically evaluate performance, propose best practices, and outline future directions for heart disease prediction. 45 Additionally, integrating attribute evaluators with classifiers and leveraging ensemble techniques further enhances predictive accuracy, robustness, and clinical relevance.46,47 Collectively, these studies illustrate the trend toward hybrid, interpretable, and optimized ML approaches for reliable heart disease risk assessment. Reported performance varies widely, in part because of differences in preprocessing, target definition (binary vs multiclass), choice of validation protocol (single hold-out vs cross-validation-based tuning approaches), and hyperparameter tuning procedures.
Our novelty centers on interpretability for clinical use: beyond aggregate feature-importance, we provide patient-level SHAP explanations, cluster patients by SHAP attribution profiles to identify subgroups with different risk drivers, and demonstrate how to translate model outputs into clinically meaningful action thresholds (decision curves and thresholded sensitivity/specificity tradeoffs). These steps aim to bridge the gap between model performance and clinical decision support. Specifically, we use a stratified train–test split combined with internal cross-validation for hyperparameter tuning, explicit handling of imputed features within folds, and reporting of calibration (Brier score and calibration curves), decision curve analysis, and clinically relevant thresholds (eg, sensitivity at fixed specificity). These practices reduce over-optimistic performance estimates and increase clinical interpretability compared with much prior work that reports single-split results.
Reported accuracy/AUC values across the literature are not directly comparable because (i) authors use different preprocessing (eg, exclusion of ambiguous records or re-encoding of categorical fields), (ii) target formulation differs (binary presence/absence vs 5-level labels mapped differently), (iii) validation protocols differ (single train/test split vs k-fold CV vs internal cross-validation for hyperparameter tuning), (iv) hyperparameter tuning sometimes leaks test information when performed outside cross-validation, and (v) small sample size (n ≈ 303 for Cleveland) makes results sensitive to random splits. For these reasons, we focus on methodological rigor (leakage prevention and calibration) and clinically relevant operating points rather than raw leaderboard improvements.
Conclusion
This study presents a comprehensive machine-learning approach to early heart disease prediction using clinical and demographic data. By evaluating multiple classification algorithms—including Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree, and Random Forest—we identified models that deliver high predictive performance when supported by rigorous preprocessing and hyperparameter tuning.
KNN achieved the highest performance on the independent test, while Random Forest remained a close second and demonstrated strong robustness across metrics. SHapley Additive exPlanations (SHAP) were used to interpret model predictions, enabling both global and individual-level insight into the most influential features. The analysis revealed that ST_Slope, ChestPainType, and Cholesterol are among the most influential variables, underscoring the clinical importance of functional and symptomatic data alongside traditional biomarkers. The integration of SHAP further enhances model transparency and paves the way for clinical adoption by fostering trust and explainability.
These results demonstrate the feasibility of implementing interpretable machine learning models in clinical settings to support early diagnosis, triage, and personalized treatment planning.
Future directions include the development of real-time predictive systems, integration with electronic health records, and validation of models on larger, more diverse patient cohorts to enhance generalizability and clinical utility.
Despite the strong internal validation achieved through cross-validation and independent test evaluation, the present study lacks external validation on independent datasets. Consequently, the generalizability of the developed models to other populations and clinical settings remains to be established. Future work will focus on validating the models across multi-institutional and larger cohorts to ensure robustness and real-world applicability.
Footnotes
Author Note
The author is eligible for waived Article Processing Charges (APCs) under the EIFL–SAGE Publishing agreement (2025–2028), which provides free open access publishing for corresponding authors from Palestine in SAGE’s fully open access (Gold) journals.
Author Contributions
EQ contributed to the conceptualization, methodology, supervision, interpretation of results, and original draft preparation. QA-W contributed to coding, data analysis, visualization, and the implementation of experiments. NSE contributed to the review and editing of the manuscript.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Nur Sebnem Ersoz was supported by the Scientific and Technological Research Council of Türkiye (TUBITAK) BIDEB 2211-A Programme. The other authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The dataset used in this study is the Heart Failure Prediction Dataset published by fedesoriano on Kaggle. It was accessed on March 28, 2025, and is publicly available under the CC0 Public Domain License, permitting unrestricted academic and research use. Dataset URL:
. The dataset integrates multiple previously independent clinical datasets into a unified, comprehensive collection of clinical and demographic attributes relevant to heart disease diagnosis. All processed data and analysis scripts generated during this study are available from the author upon reasonable request.*
Use of AI Software
AI-assisted tools were used solely for language refinement and formatting. All scientific concepts, analyses, interpretations, and conclusions were fully developed by the author.
