Abstract
Hundreds of people dying from heart disease almost every day that is how terrific a delayed diagnosis can be. Living in an advanced era full of intelligent systems, the increasing number of deaths can be reduced. This research paper focuses on the development of a cardiovascular disease prediction system particularly a heart disease, by developing machine learning classifiers, for instance, Support Vector Machine (SVM), Decision Tree, and XGBoost Classifiers. We also scaled the features to standardize unconstrained features in data, available in a fixed range for better optimization of models. For efficiency, the classification of features was also done in two categories, Independent features, and dependent features. Furthermore, the performance measures helped with best practices for model assessment & classifier performance. Eventually, after tuning hyper-parameters, the results exhibit high accuracy for XGBoost among other trained classifiers. After a comparative analysis, the best-suited algorithm can be utilized for heart disease detection, in the medical field, and regarding the economy, as costly treatments are taken into consideration. This indicates that a non-expert can also attempt for diagnosis without fretting over expensive treatments.
Keywords
Introduction
Heart disease is one of the most significant reasons for death around the world. A significant number of people are currently suffering from multiple forms of heart disease. More people have died of heart failure last year than any other cause. The World Health Organization (WHO) has identified that every year due to heart disease around 12 million deaths occur worldwide [13]. Unfortunately, due to the complex and unreasonable procedures, dissimilar symptoms, and pathological tests, the precise diagnosis of cardiovascular diseases becomes quite challenging task There is also a substantial requirement to establish heart condition predictive systems that can support medical experts in the early and reliable diagnosis of heart disease [24]. The health care field has a tremendous amount of data, for handling those data certain methods are utilized. Heart disease is simpler to treat when identified earlier. Accurate and early diagnoses play a major role to prevent different severe types of heart disease and heart failure.
The latest research demonstrates that Machine Learning is one such tool [2, 27], which is widely utilized in diverse domains because it does not require different algorithms for the different dataset. Reprogrammable and exceptional capability of machine learning bring a lot of advancements and research opportunities for medical sciences [18]. The primary aim of machine learning classification is to adequately predict the target class for each case in the data. Since these models are data driven models, therefore the actual data for classification is divided into two parts: one for training and constructing the model. The other one is used for testing and validating the model [5, 14]. In this study, for predicting heart disease three different machine learning or data mining classification techniques are used. The datasets are handled in python programming utilizing the Machine Learning (ML) Algorithm in particular Support Vector Machine (SVM) Algorithm, XG Boost Classifier and Decision Tree (DT) Algorithm to find out which trained model shows the best calculation among these in terms of accuracy for predicting heart disease.
Literature review
It is evident from the literature that several procedures for the prediction of various diseases is recommended and applied using various methods and techniques [12, 30]. In order to detect the agitation transition, the researchers in [20] have suggested a decision support system for dementia patients [20]. The authors in [16] have implemented the Data Mining (DM) algorithm. To identify heart disease, they combined electrocardiogram (ECG) features and clinical symptoms. For this purpose, they applied Naive Baayes (NB) algorithm, the Decision list algorithm, and the K-Nearest Neighbors (KNN) algorithm. Authors in [29] have developed a decision support system using a Neural Network to detect heart disease. With 78 patient records, they trained their model for prediction with adequate accuracy.
Researchers have also computed the weights for NN model using a Genetic Algorithm (GA). Based on the severity levels the classification is done on five classes of disease using the Backpropagation Algorithm (BA) and the ultimate weights of the NN are kept in the weighted base. Moreover, the finalized weights are used for predicting the risk of cardiovascular disease. The accuracy obtained is 94.17% [3].
Braga et al. [3], proposed SVM architecture where SVM parameter function selection and optimization is conducted using a GA. This method uses less input parameters for the help vector machine and is tested with an increased precision of 89.6% on 11 real-world datasets.
Researchers in [14] suggested methods to analyze the various classification algorithms on the heart disease dataset, such as DT, K-NN, NB and NN. Four classification algorithms have been implemented and they have observed that NB has greater precision than other classifiers. There is a huge amount of data in the dataset that takes more classification time. Therefore, they have used the attribute selection technique.
Sharma et al. [22] have proposed a system, which detects that if the person has heart disease, in terms of Yes or No. The system gives an idea about the heart status leading to Computer-Aided Design (CAD) beforehand. The System using Multi-Layer Perceptron (MLP) provides its users with a prediction result that gives the state of a user leading to CAD. Similarly, there are several research studies based on the comparative analysis of the traditional machine learning model for heart disease prediction [1, 18]. Researchers have also attempted to use data mining for this purpose [10, 28].
Research methodology
The primary aim of this research was to classify heart disease by implementing some DM techniques and ML models. For this purpose, different ML approaches were trained and applied to obtain maximum accuracy levels with a lower error rate. Each algorithm accepts 13 different clinical features and is trained to predict the presence of heart disease with maximum accuracy level.
Exploratory data analysis
The Cleveland heart disease dataset of 303 instances was taken from Kaggle Repository. The dataset contains a record of 165 patients having heart disease and 138 persons without heart disease. Each instance is having 13 clinical attributes and 1 target attribute showing whether a person is having heart disease or not. The description of all attributes a given in Table 1.
Description of attributes in dataset and statistical features
Description of attributes in dataset and statistical features
Exploratory Data Analysis is the first and crucial step in data analysis. It is a way of visualizing, summarizing, and interpreting information hidden in rows and columns before implementing any formal statistical techniques are applied. On the heart disease dataset, univariate, bivariate, and multivariate analyses were performed on all 14 attributes by using histograms, box plots, and correlation matrix techniques. The descriptive statistical analysis of the dataset is tabulated in Table 1.
Mostly histogram can be explained as, the dependent variable which is plotted along the vertical axis, and the independent variable that is plotted along the horizontal axis. Figure 1 presents age over the x-axis and frequency of data over the y-axis. The histogram of age feature depicts a bell-shaped curve. The generated curve highlights the fact that the collected data represents the individuals of the age group 30-80. In that age group if we breakdown the analysis into smaller groups then according to the data it can be concluded that individuals of age group 55 through 65 are in majority. The reason for the highly populated data indicates that individuals of this age group are mostly considered for the analysis of heart disease. However, the curve drops on both sides of the highly populated interval of the histogram. The fall on both sides could be related to different reasons including the fact that with increased age the fewer survival rates whereas with lesser age groups/intervals less number of identified individuals complaining about the aftereffects of the problems they face having the disease. The age ranges picked for the data collection seem to be unknown and are probable to be based on different individual aspects of the problem. The age feature’s histogram is random in shape. Here it can be observed that age 28 to 70 + is when people are getting heart disease and age 55 to 68 is the age where the number of people increased who are suffering from heart disease. Figure 2 presents the distribution of data based on the sex of an individual presented in the dataset. In Fig. 3 over the x-axis the scale contains values zero and one, presenting 0 Male and 1 Female. Based on the distribution visualized from the histogram, it can be concluded that the females were in majority for the dataset collection than the men.

Histogram of age.

Histogram of sex.

Histogram of chest pain.
In Fig. 3, it is observed that most individuals mentioned to have symptoms of typical angina chest pain represented by value 0, followed by that over 90 individuals reported non-anginal chest pain represented by value 2.0. Whereas over 45 individuals reported atypical chest pain and a few numbers of individuals reported asymptotic pain symptoms which is the lowest in the grouping analysis. As the distribution presents the highest frequency of data falling in the category ‘anginal chest pain’, this means this feature might be the highest contributing factor for the presence of heart disease.
Figure 4 shows the distribution of resting BP values as reported by the individuals included in the study. The histogram presents a bell-shaped distribution. The distribution is witnessed to have most data over rages of 110-140 mmHg which means most of the individuals reported symptoms of high resting blood pressure. Whereas, as the ranges increase the blood pressures are seen to drop.

Histogram of resting BP.
Figure 5 shows the distribution of data gathered from individuals having heart disease and normal ones as well. However, the distribution depicts over 25% of the individuals have borderline cholesterol readings, the ranges for borderline cholesterol are in between ranges of 200-239mg/dl. Over 50% of the individuals reported a cholesterol reading of 240 mg/dl which resides in higher cholesterol ranges and 75% of the individuals reported reading around 274 mg/dl which is alarming when it comes to monitoring the condition of heart patients. Lastly, some of the outliers are witnessed around range of 564mg/dl. As the dataset contains information of both patients and non-patients, the cholesterol feature cannot be solely attributed to being the contributing factor for heart disease presence because over 75% of the individuals reported higher cholesterol readings. However, these readings along with other related factors seem to amplify the heart disease. The histogram shows serum cholesterol ranges from 126-564 mg/dl, where most patients are having a range from 180-300 mg/dl.

Histogram of serum cholestrol.
Figure 6 shows the fasting blood sugar 0 individuals having normal ranges and 1 individual having higher sugar levels. The distribution uncovers the fact that there is an uneven distribution of sugar levels. Many individuals have low blood sugar levels, this includes information of around 255 individuals. Whereas less than 50 individuals reported high fasting blood sugar levels. From this histogram, it can be observed that the number of individuals who have fasting blood sugar symptoms is greater than people who are having this symptom.

Histogram of fasting blood sugar.
Figure 7 shows the resting ECG of all the individuals who were part of the data collection step. The graph shows that the data is evenly distributed among the individuals. The distribution uncovers the fact that out of 303 individuals of which 165 were heart disease patients might have reported St-T wave abnormality and the normal individuals have reported normal ECG. Whereas some of the outliers are seen for the individual reporting left ventricular hypertrophy. Further exploratory analysis of the data with the label of heart disease presence for normal or abnormal ECGs will further explain the symptoms of the disease.

Histogram of resting ECG.
Figure 8 shows that the data is spread over the entire scale. Approximately starting from 40-210 beats per minute. However, outliers are not witnessed. Over 70% of the individuals reported heart rate greater than 160 which resides in higher ranges for heart rate readings keeping in mind the normal resting heart rate ranges are 60-100. It also shows a step response over the ranges of 40-140. Only a few numbers of records/ individual reported normal heart rate whereas most of the data is inclined towards the left depicting that even normal individuals as well were having higher rates. Figure 9 shows the range of maximum heart rate achieved by patients, from 90 the number of patients is continuously increasing up to 165. After 165 it is decreasing in sudden steps.

Histogram of maximum heart rate.

Histogram of exercise induced angina.
Figure 9 shows for induced angina, it is usually witnessed during physical activity. Figure 10 shows the discrete distribution of zero and one. The data distribution clearly shows most of the individuals are not having stable angina as the histogram is having the majority of data present in zero label for angina and only a hundred records reported stable angina readings of the total 303 individuals. Again, the reason could be less physical activity in daily routines. From this histogram, it is clearly seen that most of the patients do not have induced angina.

ST Depression induced by exercise relative to rest.
Figure 10 is seen as right skewed indicating the highest number of patients having 0 ST depression induced by exercise relative to rest. The majority of individuals reported insignificant ST depression. Whereas the ranges seem to fall for the later ranges as the graph proceeds and finally ends at the range of 4.8. The minor outlier is also witnessed around the right corner with ranges 5.5-6.3 respectively. These reading are crucial for identifying the heart disease as higher values of ST depression tends to have a higher risk of heart strokes and related heart issues.
Figure 11 shows a large number of people have flat and down slopping peak exercise ST segments. Figure 12 is showing a different number of people who have a different number of major vessels (0-3) colored by fluoroscopy, where the highest number of patients are having 0 as value.

Histogram of peak exercise ST segment.

Histogram of number of major vessels.
Figure 13 shows that all the patients are having normal thalassemia, the highest number of patients are present at value 2. The dataset has only two labels i.e. presence of heart disease Yes/No. As already described in the data description there were over 138 normal individuals and 165 individuals having heart disease which is confirmed from the above visualization of target histogram.

Histogram of Thal.
Data pre-processing is an important part of data science, as real-world data tends to be incomplete, noisy, and inconsistent. Data pre-processing involves data cleansing, handling missing values, handling inconsistent data. For decision-making dummy variables of categorical attributes were created. The dataset contains multiple numerical characteristics that cover different degrees of units, range, and magnitude. This is challenging because ML algorithms are particularly susceptible to these attributes.
Standard scaling is used for feature scaling to handle such features. It is the most crucial part of data pre-processing. Standardization is a method, where the values are positioned at center with a unit standard deviation around the mean. To standardize the data, Standard Scaling is implemented in Equation (1).
The dataset is split into two sets i.e. training and testing set. The training set consists of 80% of the experimental dataset and the remaining 20% is used as the testing set.
Machine learning performance is typically evaluated by using a confusion matrix. It contains information about the actual and predicted classification done by classifier in terms of True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).
For experimental analysis three classification models are trained. First, the experiment was conducted on default models to get the performance measures, after that every model was fine-tuned by changing their hyper-parameters. Table 2 shows the accuracies obtained by the different versions of the classifiers trained for the heart disease prediction. The untuned version of SVM obtained an accuracy of 94.63%. The classifier resulted in 9 false positives and 4 false negatives and a high number of correct predictions. Whereas this accuracy is witnessed to drop with the parameter tuning of the classifier. The accuracy dropped to 87.60% with 21 false positives and 11 false negatives. The untuned trained version of the Decision tree resulted in 98.90% accurate results for classifications with approximating it as zero false positives and zero false negatives. Whereas a huge accuracy drop is witnessed with the tuning of the same classifier and the classifier resulted in 85.54% accuracy. The tuned version of the XG Boost untuned trained model is 98.76% with a minimum number of false positives and false negatives of 1 and 2, respectively. Whereas the tuned version of the same algorithm resulted in stable efficiency by not dropping the efficiency much and the acquired efficiency turns out to be 96.28%. Based on the stability in the results XG Boost is the best performer among the algorithms chosen for classification.
Performance measures on tuned and untuned classifiers on training data
Performance measures on tuned and untuned classifiers on training data
Once the performance measures of untuned models have been achieved, they were tuned by tuning their hyperparameters. After getting the results of tuned models, it is noticeable that XGBoost outperforms the SVM and Decision Tree with greater accuracy. Table 3 summarizes the performance of trained classifiers on test data.
Performance measures on tuned and untuned classifiers on test data
This prediction system concludes the possibility of heart disease. By using data mining techniques, the early diagnosis can save numerous lives and also time, as time plays a significant role. The data provided by users are classified and categorized, leading to accurate results by assuming the data to be correct. Among SVM, Decision Tree, and XGBoost Classifiers, the tuned XGBoost model outshines the former two algorithms, exhibiting the highest accuracy of 79.41%.
