Abstract
Coronary artery disease (CAD) is the most common cardiovascular disease, causing death all over the world. An invasive method, Angiography is used to diagnose this disease but it is very costly and has some side effects. Hence, non-invasive methods such as machine learning were being used for diagnosing CAD. One of the ways to detect the presence of CAD is to find out the stenotic artery. The proposed study has diagnosed whether the arteries are stenotic or not. This study aims to provide the best accuracy while balancing the dataset using a spreadsubsample filter. Data pre-processing and feature selection has been done on the dataset to improve accuracy. Different supervised classifiers were applied to the selected features. The highest accuracies for left anterior descending (LAD), left circumflex (LCX), and right coronary artery (RCA) obtained by Random Forest are 95.70%, 91.41%, and 94.38% respectively. Among all the arteries, LAD has the highest accuracy indicating that chances of a person having LAD as stenotic are very high.
Introduction
Cardiovascular diseases (CVD) are the most rampant diseases. The majority of the deaths are being caused due to this disease. One of the important cardiovascular diseases is CAD. It is caused when one of the three major arteries is blocked or has stenosis. It can cause heart attack, angina, and heart stroke, etc. Due to its severity, it is important to diagnose this disease in an early stage so that the number of deaths can be reduced. For this, there is an invasive method available which is angiography. Angiography is a successful technique in diagnosing CAD. It can detect whether the person has coronary artery disease or not, which artery is stenotic or blocked, and also determine the percentage (amount) and the location of stenosis. Although it is a very effective method and the best method for diagnosing CAD, it has certain disadvantages. The major problem with this method is that it is highly expensive, which is one of the main reasons for causing CVD death in low and middle-income countries. It also has various side effects.
To overcome these disadvantages a lot of non-invasive methods were suggested [1, 2, 3] and a lot of surveys were done [4, 5, 6] by researchers. The idea was to use machine learning algorithms in determining CAD. Machine learning is the process of making a machine learn without explicit programming. One of the applications of machine learning is data mining. Data mining is the process of determining and discovering out the hidden patterns, and knowledge from the raw data. It is used to extract patterns from raw data. There are many datasets available on heart disease prediction. For our study, we have considered the extension of the Z-Alizadeh Sani dataset. This dataset contains information about three major arteries, based on which our whole study has been done. This is an imbalanced dataset. In our study, this anomaly has come out as an important factor. We have tried to tackle this anomaly as we can see in Section 3. In most of the previous works where this dataset has been considered, pre-processing was not done because this dataset does not have any missing values.
In [7] a comprehensive review of CAD diagnosis using machine learning techniques has been done. They have shown three main aspects of CAD diagnosis. First is the detection of CAD that is whether a person has CAD or not. In this aspect, a lot of work has been done. Different approaches were made to depict the best accuracy in diagnosing CAD. The second aspect is to find out which artery is blocked or stenotic. Only a few studies have been done on this aspect as we can see in [8]. In our study, we have focused on this aspect. We have predicted the accuracy of each artery. The third aspect is to find out the percentage of stenosis present in each artery. Till now, no work has been done on this aspect. The main contribution of this study is as follows:
Providing a solution for dataset imbalance problem. Achieving the best results in the literature by performing feature selection and filtering the dataset.
The organization of the paper is as follows: Section 1 narrates the research work done in the field of CAD diagnosis. Section 2 narrates the details of the dataset used in this study. It also describes all three arteries. Section 3 describes the proposed work where an extended Z-Alizadeh Sani dataset [8] has been used. This section describes each step of the data mining approach. The proposed model has been applied to the three major arteries separately and the best accuracy for each particular artery has been calculated and explained in Section 4. Section 5 wind up the proposed work and Section 6 guides for future research.
In this section, we will be discussing some of the previous work that has been done on this topic. Most of the work was done on the old dataset that is the Z-Alizadeh Sani dataset. Only a few works have been done on the new extended dataset that is an extension of the Z-Alizadeh Sani dataset.
Mohan et al. [9] have proposed a novel method i.e. hybrid random forest with a linear model (HRFLM), which aims at finding significant features to help in improving the accuracy in predicting the heart disease. With the help of this approach, they achieved an accuracy of 88.7%. Raj et al. [10] have derived an optimal feature selection based medical image classification of Lung cancer, Alzheimer’s disease, and brain image. To improve the performance of the optimal classifier, they have used an opposition based crow search algorithm. After performing experiments on MATLAB, they achieved an accuracy of 95.22% which is best as compared to other feature selection models. Alizadeh Sani et al. [11] have used different data mining approaches for the prediction of CAD disease. They have also generated some new attributes. These attributes are LAD recognizer, LCX recognizer, and RCA recognizer which had helped in improving accuracy. They have used different techniques among which the SMO classifier was able to achieve the highest accuracy of 94.08. They have also shown that accuracy has also improved by using the weight-by-SVM method for feature selection. Arabasadi et al. [12] have proposed a hybrid method for the diagnosis of CAD. The proposed method has shown that the performance of a neural network along with a genetic algorithm is increased by 10% as compared to the performance of a neural network alone. By using this hybrid method on the Z-Alizadeh Sani dataset, they were able to achieve an accuracy of 93.85%.
Alizadeh Sani et al. [13] have diagnosed coronary artery disease using four data mining algorithms on lab and echo features. They have selected features that were previously not used in any work. They have shown a comparison of all the four algorithms by accuracy. Out of Naïve Bayes, c4.5, AdaBoost, and SMO, SMO has the highest accuracy of 82.16%. Alizadeh Sani et al. [7] have shown a comprehensive review of machine learning-based coronary artery disease diagnosis. The impact of various features such as dataset features and different machine learning algorithms were investigated in detail. Kodati et al. [14] have analyzed a heart disease using the Naïve Bayes algorithm on Weka and achieved an accuracy of 83.7%. Roohallah et al. [15] have suggested a unique approach for predicting the stenotic arteries. This novel approach includes three classifiers. They have also extended the number of records from 303 to 500 and achieved an accuracy of 96.40%. Abdar et al. [16] have proposed a multi-filtering approach for improving the performance of decision trees in diagnosing CAD. For their study, they have used extended Z-Alizadeh Sani dataset. This model was applied to the three major arteries, left anterior descending, left circumflex, and the right coronary artery. 94.40%, 92.97%, and 93.43% were the highest accuracies obtained by the Naïve Bayes tree for Left anterior descending, left circumflex, and Right coronary artery respectively. Alizadeh Sani et al. [17] have proposed an ensemble algorithm for diagnosing CAD. They have also used SMO, Naïve Bayes for classification. 10 fold cross-validation was used for calculating the accuracy. The study was conducted on Symptoms and ECG features. By using their proposed ensemble algorithm, they were able to achieve an accuracy of up to 85%. They have also evaluated certain rules which were not extracted in previous studies. Verma et al. [18] have proposed a novel hybrid approach, where they have used CFS (correlation-based feature subset selection) and particle swarm optimization (PSO) search method. They have used various supervised filter for comparing performances. They have applied this approach to the Cleveland dataset. After performing all experiments, they have concluded that the highest accuracy was achieved by MLP (multilayer perceptron), which was 88.4%. Abdar et al. [19] have used three different types of SVM for their study. For increasing the performance of the classifier, they have performed normalization. They have also used a genetic algorithm and PSO for the optimization of the classifier parameters. With these methods, they achieved an accuracy of 93.08%.
From all the above-mentioned works, we can say that different methods were suggested for diagnosing CAD. Different approaches were used for improving the performance of classifiers. In our study, we aim to achieve the highest accuracy while improving the performances of the classifiers using pre-processing and feature selection. While improving the performances, we have taken into account the problem related to our dataset (mentioned in Section 3).
Dataset
For predicting stenosis, we have taken the extended Z-Alizadeh Sani dataset from the UCI dataset library [8]. This dataset is the extended version of the Z-Alizadeh Sani dataset which originally contained 56 attributes and 303 instances. Out of 303 people, 216 people had CAD and the others were healthy. This dataset was collected from the Shaheed Rajaei cardiovascular, medical, and research center of Tehran, Iran. The features are divided into four groups as shown in Table 1. These groups are demographic, symptom and examination, ECG, and laboratory and echo features. Figure 1 shows the ratio of healthy people and non-healthy people.
Features of the extension of Z-Alizadehsani dataset
Features of the extension of Z-Alizadehsani dataset
Pie chart showing the ratio of CAD and NON-CAD people.
The newer version contains three additional features. These new features are the three major arteries. These are LAD (left anterior descending), LCX (left circumflex), and RCA (right coronary artery). There are two classes for each artery. Class1 represents people whose artery is stenotic and Class 2 represents people whose artery was normal. For each particular artery, the prediction will be done. All three arteries are important because even if one of the arteries is blocked or has stenosis more than 50%, then the person will be diagnosed with CAD [15]. Hence, it is important to predict whether the artery is stenotic or not.
Data mining is the process of determining and finding out hidden knowledge from data. It extracts hidden patterns from the raw data. For our dataset, we have applied the data mining process. It consists of various steps.
In the present work, different supervised algorithms in diagnosing CAD is proposed. Some supervised filtering techniques have been applied as a step of data pre-processing. Feature selection has also been done to improve accuracy. All this work has been done using Weka -3-9-3. Weka is a collection of machine learning algorithms for data mining tasks. It consists of data pre-processing, classification, clustering, selecting attributes, and visualization. Weka provides different components for performing various experiments on the dataset. The proposed work has been shown in Fig. 2.
Flowchart of the proposed work.
The following subsection describes each step of the data mining approach where different techniques were adopted in the present work.
Data pre-processing is the step where all the noise, irrelevant data, redundant data, and missing values must be removed or treated. Our dataset does not contain any missing values. But this dataset does contain class skewness because there are two classes wherein one class contains 216 people who had CAD and other classes contain the rest healthy people. So to reduce this skewness, we have used the Resample filter and then applied the spreadsubsample filter. These filters are supervised. We have used this filter because our dataset has a nominal class attribute, otherwise, we would have used unsupervised filters.
In this step, we have applied these filters on our dataset. We have also checked other filters for resampling. After applying other filters and comparing their performances we have concluded that supervised filtering techniques i.e. Resamle [16, 20] along with spreadsubsample [20] helped us in achieving the best result. Firstly, we have applied the Resample filter two times and then the spreadsubsample filter was applied. The Resample filter generates a random subsample of the given dataset. It performs sampling either with replacement or without replacement. In our study, we have preferred sampling with replacement.
Spreadsubsample allows us to specify the maximum spread between the majority class and the minority class.
Feature selection
When we have a lot of features, then we have to determine those important features that will help in performing the best classification and give us the best results. The problem with using all the features is that there might be some irrelevant features that are not needed for the prediction. They will only add up extra time in finding out results. So it is better to work on the selected features. Therefore, feature selection is an important processing step. In previous studies, most of the researchers have applied feature selection, even some of them created features [11]. However, most of them did not use any feature selection technique, they simply preferred the entire dataset [16]. In the present work, feature selection is applied after the filtering step.
In this work, the GainRatioAttributeEval method along with Ranker as a search strategy is used for selecting features. GainRatioAttributeEvaluator is an attribute evaluator. It evaluates the worth of an attribute by calculating a statistical measure such as Gain Ratio of an attribute concerning the class. Its formula is given by Eq. (1).
Ranker search is a method that arranges all the evaluated attributes in a rank. Since our dataset includes all three arteries i.e., LAD, LCX, and RCA, we have taken each of them as a class at a time. 30 features were selected concerning LAD as a class. 26 features were selected concerning LCX as a class and 28 features were selected concerning RCA as a class. All the selected features are listed in Table 2.
Selected features after performing feature selection technique
After having selected features and pre-processed data, we have applied various supervised techniques on our dataset. A comparison has been done among these techniques based on accuracy, sensitivity, specificity, and their time to build the model. The comparison is shown in Section 4. While applying different methods, we have used 10 fold Cross-validation which has partitioned the data into a training set and a testing set with a ratio of 9:1. Different models used were:
SMO
SMO stands for sequential minimal optimization. This algorithm was designed for training the support vector machines (SVM). It provides a solution for the optimization problem of SVM by breaking the major quantization queries into several minor queries. These small quantization problems are solved analytically [21]. It is one of the faster methods. It can work on larger datasets easily.
Multilayer perceptron
MLP (multilayer perceptron) is an artificial neural network. It consists of various perceptron where each perceptron signifies a linear classifier that divides the input into categories using a straight line. The Multilayer perceptron has three types of layers: an input layer, a hidden layer, and an output layer. The number of layers may vary according to the designer or the project. Each of these layers consists of various nodes which are known as neurons.
Random forest
It is one of the most powerful algorithms of machine learning. It is a method that constructs various decision trees during the training phase. Random forest finalizes the output by considering the majority of the decision of the trees. Because of the construction of various trees, there are fewer chances of overfitting. It also provides high accuracy.
AdaBoost
It is an ensemble method that generates a strong classifier by combining various weak classifiers. It was the first boosting algorithm for binary classification. It generates a series of models iteratively, simultaneously rectifying the errors generated from previous models.
Bagging
It is an ensemble method that combines various weak learners to form a single strong learner. The multiple models were trained with the same learning algorithm. The data is divided into a training set and a testing set. The different numbers of samples with replacement are given to these models. All of them generate different outputs. Then they are tested against the testing set. The final output would be the one that is generated by the majority of the models.
Naïve Bayes
It is a classification algorithm that works on the principle of conditional probability as given by Bayes theorem. According to Bayes theorem, we can calculate the conditional probability of an event
Weka provides a lot of performance measures some of which are very important in the medical field. For our study, we have used accuracy, sensitivity, and specificity measures. Based on these measures we have compared the mentioned techniques. These measures are discussed below:
Confusion matrix
A confusion matrix is a matrix or a table that summarizes the performance of the classifier. It has four fields, as can be seen in Table 3. TP which indicates the number of correctly classified instances of a class C1. TN which indicates the number of correctly classified instances of class C2. FP which indicates the number of instances of class C1 falsely classified as instances of class C2. FN which indicates the number of instances of class C1 falsely classified as instances of class C2 [11]. This is the description of the confusion matrix when we have binary classification. But if we have multiple classes then the size of the confusion matrix would have also increased.
Confusion matrix
Confusion matrix
Through the confusion matrix, we can further calculate three other measures: Accuracy, Specificity, and Sensitivity as shown in Eqs (3)–(5) respectively.
It defines the number of correct predictions made out of the total predictions. We can calculate accuracy using Eq. (3), where the parameters TP, TN, FP, FN have been taken from the confusion matrix.
All the experiments were performed on Weka 3-9-3. After performing pre-processing and feature selection, we have got some selected features as shown in Table 2. For each particular artery, the number of features selected are different. After applying six different models namely, Random Forest, Multilayer Perceptron, Bagging, SMO, AdaBoost, and Naïve Bayes, we have got the following result in Tables 4–6. Different performance measures like accuracy, sensitivity, specificity, and time to build the model were calculated in Tables 4–6.
Comparison of different supervised algorithms for class LAD artery
Comparison of different supervised algorithms for class LAD artery
Comparison of different supervised algorithms for class LCX artery
Comparison of different supervised algorithms for class RCA artery
Bar graphs concerning Tables 4–6 have also been shown where the comparison among these techniques and arteries can be seen clearly. These bar graphs correspond to each performance measure i.e., Accuracy, Sensitivity, Specificity, and Time taken to build the model.
Figure 3 shows the comparison of different algorithms for LAD (left anterior descending) artery. Figure 4 shows the comparison of different algorithms for the LCX (left circumflex) artery. Figure 5 shows the comparison of different algorithms for the RCA (right coronary artery) artery. From these figures, it can be seen that for all arteries, Random Forest has achieved the highest performance among all supervised techniques. The second-best performance is given by MLP while Naïve Bayes has given the poor performance. It can also be noted that among the three arteries, LAD has achieved the best performance. The highest accuracies achieved by LAD, LCX, and RCA are 95.70%, 91.41%, and 94.38% respectively. The resulted outcomes are better than the outcomes of previous works.
Comparison of different supervised algorithms for all the three arteries on the Basis of accuracy.
Comparison of different supervised algorithms for all the three arteries on the Basis of sensitivity.
Comparison of different supervised algorithms for all the three arteries on the Basis of specificity.
We have also evaluated the results without performing pre-processing and feature selection as can be seen in Table 7. To show the importance of pre-processing and feature selection, a bar graph has been drawn in Fig. 6. In this figure, two types of bars are considered, one representing the results with pre-processing and feature selection and the other with pre-processing and feature selection. A comparison between these two cases has been shown for all the three arteries for the Random Forest classifier. It can be seen that with pre-processing and feature selection we got the best results.
Accuracy comparison of different supervised algorithms for all arteries without pre-processing and feature selection
Comparison of all the three arteries on the Basis of Accuracy for both the cases i.e. with filtering and without filtering.
Without data pre-processing and feature selection, the accuracies are much lower. We can say that pre-processing and feature selection has greatly helped us in achieving the highest accuracy. In Table 7, for LAD, Random forest has achieved the highest accuracy (79.20%). For LCX and RCA, AdaBoost has gained the highest accuracy (65.34% and 70.29%). It can be seen here that if we apply pre-processing and feature selection, it will going to improve the accuracy. Figure 7 shows a line curve which depicts the time taken to build the model.
Comparison of all the three arteries in terms of the time taken to build the model.
Accuracy comparison of previous works with the proposed work
The proposed study has used various algorithms like SMO, MLP, Naïve Bayes, AdaBoost, bagging, and random forest on our dataset. Before applying any algorithm, the data was pre-processed using supervised filtering techniques, and then feature selection was done. Feature selection plays a very important role in prediction. The accuracy of our prediction is highly dependent on the features that we use. So it is important to select only important features. For each class LAD, LCX and RCA, feature selection was done. All the features are the same except some features like Q Wave, St Elevation, St Depression, T inversion, VHD, and Region with RWMA. 10 fold Cross-validation was used for determining accuracy. After that, all methods were applied.
After getting all the results, we see that for all three classifications i.e., for LAD, LCX, and RCA, the Random forest has achieved the highest accuracy of 95.70%, 91.41%, and 94.38% respectively. We can also see that among left anterior descending (LAD), left circumflex (LCX), and right coronary artery (RCA), LAD has the highest accuracy of 95.70%. As compared to the previous work, we can say that our results are much better than previous works with 303 instances and 56 attributes. We have compared our results with some of the previous works as shown in Table 8. The works mentioned in Table 8 have either used the first aspect or second aspect for diagnosing CAD. In this study, we have worked on the second aspect and till now the highest accuracy for this aspect has been achieved by Moloud Abdar et al. [16]. As we can see in Table 8. Researchers mentioned in Table 8 have used different techniques for diagnosing CAD. They have used the two most popular datasets: The Z-Alizadehsani dataset and Cleveland dataset. The accuracy they were able to achieve is also mentioned here.
In this study, Coronary artery disease was diagnosed using three major arteries LAD, LCX, and RCA. The classification has been done using each artery separately. Filtering and feature selection has been done to get better results. Pre-processing has rectified the imbalanced problem of our dataset. Due to which we were able to get improved performance. Different supervised machine learning algorithms were applied. The effectiveness of the algorithms was evaluated in terms of accuracy, sensitivity, specificity, and time taken to build the model. The extended Z-Alizadehsani dataset was analyzed. This extended dataset includes three new features which are three arteries. These are LAD, LCX, and RCA. Using Weka tool, we have performed all our experiments. From the experiment, we have concluded that Random Forest has acquired the highest accuracy for all arteries LAD (95.70%), LCX (91.41%) and RCA (94.38%) and among LAD, LCX and RCA, LAD (95.70%) has the highest accuracy for different algorithms. This accuracy is higher than the accuracy in previous works. With feature selection, the time taken to build the model is also less. Also, from here we can conclude that people might have higher chances of getting LAD as stenotic which can result in having CAD. We have analyzed that LAD is the major factor among LAD, LCX, and RCA, for determining CAD. We must focus more on diagnosing LAD.
Footnotes
Future work
The future aspect of this study is to analyze the different sets of features using different feature selection methods, and the behavior of these arteries according to those features. We aim to consider the third aspect for diagnosing CAD that is determining the percentage of stenosis. Different sampling techniques might be considered for future work for further improving the performance of classifiers.
