Abstract
In this study, we proposed a robust model to classify ECG arrhythmia. The proposed approach has two stages: (1) generation of an optimal feature subset using GA and (2) the classification and evaluation of the reduced ECG arrhythmia data using SVM Ensemble with Bagging (bootstrap aggregating). In step one, We explored a novel Genetic Algorithm wrapped Soft Margin Support Vector Machine algorithm to produce the optimal subset of features set. This step greatly improved the quality of classification by eliminating irrelevant features. In Step two, we used the ensemble of SVM with bagging (bootstrap aggregating). We trained every single Support Vector Machine Classifier separately using the randomly chosen samples from training dataset through a bootstrap technique. Then, these individually trained Support Vector Machine Classifiers are aggregated using the double-layer hierarchical combining to produce a joint decision. To evaluate performance and effectiveness, we applied the proposed model on UCI Arrhythmia (standard 12 lead ECG signal recordings) data set to classify arrhythmia into normal and abnormal subjects. With the proposed model, we got a promising true classification accuracy rate. The method is also comparable with state of art classifiers and other methods present in the related literature. The outcomes of the experiment and statistical analysis pointed out that our model is a very useful and practical approach for ECG arrhythmia classification.
Introduction
Cardiac arrhythmia
A cardiac arrhythmia (dysrhythmia) is one of the serious heart disorders which is associated with the rate/speed or rhythm of the heartbeat and affects millions of people. During a cardiac arrhythmia, the subject’s heart can beat too fast (tachycardia), too slow (bradycardia), or with an irregular or abnormal rhythm /rate. It is amongst the most common heart-related problems. Most arrhythmias are usually not harmful, but some arrhythmias can be critical or even more life-threatening (causing sudden death). Therefore timely and accurate analysis and diagnosis are very necessary. The most general test practiced to diagnose a cardiac arrhythmia is an electrocardiogram (abbreviated ECG or EKG) [1, 2]. The ability to automatically detect arrhythmias from electrocardiogram signals is very important for clinical diagnosis, analysis, and further treatment.
Electrocardiogram
An electrocardiogram (12-lead ECG) is a graph, produced by electrocardiographs, that renders valuable data about patient’s cardiac fitness. ECG recordings consist of large volumes of data. Typical electrocardiogram signals are composed of P waves, QRS complex (composed of Q, R, and S waves), and T waves. The P wave, QRS complex and T wave refer to atrial depolarization (contraction), depolarization (contraction) of the ventricles and repolarization (relaxation) of the ventricles respectively. The important parameters needed for the analysis of cardiac-patients are time span, shape, and the relationship between P wave, QRS complex, T wave, and R-R interval. Figure 1 shows the Points and elements of the ECG signal. ECGs are routinely used in many clinical applications to assist physicians to diagnose cardiac disorders. It is a non-invasive diagnostic test that helps to monitor for problems with the electrical activity of our heart (e.g. unexplained chest pain, shortness of breath, irregular cardiac rhythms or heartbeats and side effects of medicines on the heart etc.). It also helps to monitor the health and fitness of our heart when different disorders or situations are present, e.g. high blood pressure, high cholesterol, cigarette smoking, diabetes, or a family history of early heart-related medical problems [2, 14, 8, 21].
Schematic representation of a normal electrocardiogram.
Due to rapid development in the computer-based technology, new and highly automated tools for different diseases diagnosis become possible. Various researchers have proposed many computer-aided diagnoses approaches based on Machine learning and ensemble classifiers for almost all types of medical applications [18]. The medical decision support systems capable of automatically identifying arrhythmia from ECG recordings are useful for health professionals in making more accurate decisions. In this study, we proposed a hybrid system with a novel Elitist genetic algorithm based feature selection method (which reduces the dimension of input feature space) and SVM Ensemble with Bagging (bootstrap aggregating) for the classification ECG arrhythmia data. To evaluate performance and effectiveness, we applied the proposed model on UCI Arrhythmia data set to classify arrhythmia into normal (absence of cardiac arrhythmia) and abnormal (presence of cardiac arrhythmia) subjects. This benchmark standard 12 lead ECG signal recordings data set is a good environment (similar to the real-world situation) to examine the proposed model because it has some missing, incomplete and ambiguous bio-signals. There are many studies available in the literature which have performed some analyses and simulations on UCI Arrhtymia dataset. These studies reported true classification accuracy ranging between 70% and 85%. Some important approaches are discussed here. Guvenir et al. proposed the application of Voting Feature Intervals (VF15) supervised algorithm and Genetic Algorithm for the diagnosis of ECG arrhythmia from standard 12 lead ECG recordings data [13]. Gao et al. proposed a diagnostic system using Neural Network classifier based on a Bayesian framework [11]. Polat et al. applied the artificial immune recognition system (AIRS) based classifier with fuzzy weighted preprocessing on ECG dataset. They obtained 80.77% highest test classification accuracy [25]. Lee et al. a novel classification model based on Artificial Neural Network with self-adaptive weighted fuzzy membership function (NEWFM) and gained 81.32% accuracy [20]. Uyar et al. proposed a confidence-driven serial fusion idea. They applied a serial fusion system of SVM and LR classifiers in order to decide the presence or absence of ECG arrhythmia on UCI Arrhythmia Database [27]. Oveisi et al. introduced a fast and computationally efficient tree-based method for Feature extraction (FE) using mutual information (MI) and applied it on six datasets from the UCI repository (including Arrhythmia Dataset) and one Kungliga Tekniska Högskolan University (KTH) Action datasets. They obtained 68.5% accuracy on UCI arrhythmia dataset [24]. Jadhav et al. analyzed and compared the effectiveness of 3 ANN models namely Multilayer Perceptron (MLP), Generalized Feedforward Neural Network (GFFNN) model and Modular Neural Network (MNN) to automatically identify arrhythmias from ECG recordings [16]. Khare et al. applied a hybrid approach of three algorithms namely Rank Correlation, PCA and SVM to detect the presence and absence of arrhythmia. They claimed 85.98% accuracy [19]. Yilmaz used fisher Score technique for feature space reduction and the LS-SVM for ECG Arrhythmia classification [29]. Jadhav et al. implemented Feature elimination based Random subspace ensembles classifier (with PART tree as base classifier) and evaluated on the UCI ECG signal data. This method provided an attractive overall classification accuracy of 91.11% using 90% training and 10% testing data [15]. Xu et al. performed a comparison of two popular approaches SVM and Deep Neural network for Arrhythmia classification. They also discussed applications of feature selection methods like Fisher discriminant ratio and PCA with these classifiers. Ayar and Sabamoniri proposed a hybrid model based on Genetic Algorithm (For Feature Selection ) and Decision Tree with C4.5 (For Classification). They obtained 86.96% true classification accuracy for the Binary classification [4]. Kadam et al. proposed a novel Elitist-population based GA to choose the relevant features and the Soft-Margin SVM for Arrhythmia classification [17].
The rest of this paper is organized as follows. Section 2 elaborates proposed Improved Elitist GA-SVM based Feature Selection technique and SVM Ensemble with Bagging. Section 3 illustrates the experiments and results. Section 3 also covers the UCI dataset, performance indices, and different comparisons. Conclusions are finally expressed in Section 4, along with directions for future study.
Proposed improved elitist GA based feature selection and SVM Ensemble with Bagging
The SVM is one of the most influential and robust approaches for the binary classification task. To further improve the precision and true accuracy, sensitivity, and specificity of medical analysis, we applied SVMs ensemble classifier. The proposed approach is a hybrid model which consists of two phases. The First is complementary feature selection phase, Here we proposed an elitist-population based GA with SVM fitness function [22, 23] to find the relevant and optimal features. The best and optimal feature-set generated using phase 1 is used in phase 2. Phase 2 includes SVM Ensemble with Bagging (bootstrap aggregating) as classifier to classify cardiac arrhythmia into normal and abnormal classes. Figure 2 shows the proposed approach.
Proposed model.
Proposed Elitist-population based Genetic Algorithm assisted by Support Vector Machine.
The presence of a high number of attributes (Features) in the training dataset affects the overall performance of many soft computing and machine learning classifiers. The given High dimensional data many times includes some redundant and irrelevant dimensions that cause the classification error and reduces accuracy. Theses irrelevant features must be removed before the training phase to increase speed, efficiency and discriminant power of the training and testing. Therefore finding relevant feature having more discriminative power is very important. GA is one of the most adaptive and efficient algorithms for optimal and relevant feature subset selection [12]. Many researchers have reported its use as a feature selector [30, 7, 6, 5]. In this study, we proposed ‘an Elitist-population based Genetic Algorithm [28] assisted by Support Vector Machine’ for the selection of optimal and relevant feature set. This improved heuristic approach employs the classification error obtained by ten-fold cross-validated SVM classifier as the fitness value. The primary objectives of the proposed Elitist GA based Feature Selector is minimization of this classification error. Chromosomes (candidate feature subsets) are F bit strings where F is the total number of features available in the original dataset and A bit (gene) value at
Support Vector Machine
A Support Vector Machine is a discriminative classifier typically applied for binary classification tasks. The fundamental idea is to decide the optimal hyperplane in the hyperspace that optimally separates the samples into two classes. Support vectors are the data points that lie closest to the decision hyperplane. Let us consider the Binary classification problem with given
Where
The optimization problem for Support Vector Machine is feasible if training data is linearly separable. The idea of soft margin is presented in Support Vector Machine, for non-linearly separable cases. Soft margin means a hyperplane that separates many, but not all data points. It is the extended version of Hard-margin Support Vector Machine. In the formulation of Soft-margin SVM, Slack variables
Subjected to
This is called 1-norm soft margin problem. Here
UCI arrhythmia dataset (standard 12 lead ECG signal recordings data)
Configuration for the Elitism based GA
Proposed Support Vector Machine ensemble with Bagging.
The SVM is one of the best and powerful supervised learning models machine learning algorithms which provides a better generalization performance than most of the state-of-the-art classifiers in many cases [9]. To further improve the performance of SVM, we proposed the SVM ensemble (an ensemble of K independent binary SVMs). The ensemble learning is one of the most prominent topics in the field of Machine learning [3]. It has proved powerful and successful in a broad area of applications. The main idea of Ensemble is to train many models using the same learning algorithm. In this study, we used the SVM with bagging (bootstrap aggregating). We trained every single Support Vector Machine Classifier separately using the randomly chosen samples from training dataset through a bootstrap technique. Then, these individually trained Support Vector Machine Classifiers are aggregated using the double-layer hierarchical combining to produce a joint decision. In other words, we exercised another SVM for the fusing of Decisions of several SVMs in the SVM ensemble. (i.e. classification outputs of individual Support Vector Machine Classifiers present in the lower layer have been fed into a Master Support Vector Machine present in the upper layer in the network.) Figure 4 shows proposed Support Vector Machine Ensemble with Bagging.
Experiments and results
Dataset and preprocessing
The publicly available ECG arrhythmia dataset from the UCI Machine Learning Repository was used in this study. Details about this dataset are given in Table 1. In this paper, for Binary classification, we divided dataset samples into 2 classes (0 and 1) only. Class 0 includes all normal cases. Class 1 includes all remaining cases. We handled missing values for attributes by replacing them with the mean value of that attribute.
Proposed ensemble of SVMs with improved elitist GA based feature selection vs ensemble of SVMs without feature selection.
In feature selection phase, We used Matlab 2018a environment to implement Elitist GA-SVM feature selector. The configuration of GA used in this study is shown in Table 2. We implemented the SVM based fitness function using three matlab functions: ‘fitcsvm’ (with default parameters setting except standardization flag was set true), crossval (to cross-validate the SVM) and kFoldLoss (to estimate 10 fold cv classification error). The feature selection module produced 92 optimal features after the last generation. These 92 features were used in the classification phase.
In the classification phase, We trained every single Support Vector Machine classifier separately using the randomly chosen samples from training dataset through a bootstrap technique. We applied one SVM to aggregate the outputs of these SVMs in the proposed ensemble. This formed a double-layer of SVMs hierarchically in which Outputs of lower layer SVMs were fed into a super SVM in the upper layer. We also varied the number of SVMs in the ensemble to study the effect on accuracy, sensitivity, and specificity (Fig. 5). The Section 3.3 shows the performance indices in this study. Individual SVM in the ensemble classifier was also coded using the function fitcsvm in the Matlab. To find the optimal value of parameters ‘KernelScale’ flag (a scaling factor by which all inputs are divided) and Box Constraint’ flag C (penalty term), we used Grid search. 10-fold CV was chosen to evaluate the accuracy of the classifier. We also conducted same experimentation (used in phase 2) on bagging based SVM Ensemble classifier without feature selection phase to compare the performance.
Performance indices
We assessed the performance of the proposed classification method using three main indices; classification accuracy, sensitivity, and specificity. We also used receiver operating characteristics (ROC) and area under ROC curve (AUC).
These terms are defined using the elements of the confusion matrix given in Table 3.
Confusion matrix of classification
Comparison with other approaches in the literature on UCI Arrhythmia dataset
(a) ROC curve for proposed ensemble of SVMs with elitist GA-SVM features (b) ROC curve for ensemble of SVMs without feature selection.
The experimental results from Fig. 5 shows accuracy, Sensitivity, and Specificity, per different number of first layer SVMs in the ensemble. This shows that the proposed approach achieved the best true accuracy 88.72% with 41 first layer SVMs using 10 fold CV. With 41 first layer SVMs, we obtained Sensitivity and Specificity 83.57% and 93.06 % respectively (Fig. 6a shows ROC curve, AUC
Conclusion
Cardiac Arrhythmia can be identified by an electrocardiogram signal examination. Computer based diagnoses and analysis is very essential for automatic and Reliable classification of Arrhythmia from complex ECG signal. The objective of the study was to classify normal and abnormal ECG signal (Cardiac Arrhythmia) using Evolutionary feature selection based Ensemble classifier. This paper presented a novel approach of Enlist GA SVM wrapper based feature selection to derive optimal and significant features with bagging based SVM Ensemble classifier for diagnosis of Arrhythmia and our investigation on it. This model can distinguish Arrhythmia with minimal medical prior knowledge needed. Comparative performance evaluation on benchmark UCI Arrhythmia datasets shows that the proposed method outperforms other methods available in the literature. Additionally, the bagging based SVM Ensemble classifier with feature selection (Proposed approach) is compared to bagging based SVM Ensemble classifier without feature selection. The experimental results allowed us to conclude that proposed approach performs better than not only individual SVM but also SVM Ensemble classifier on The UCI dataset in terms of overall classification accuracy which is important in medical diagnosis of ECG arrhythmia. The experiment result shows that GA with SVM fitness function is suitable for effective feature selection. Therefore, we argue, that the proposed model can be powerful technique for automated diagnosis of ECG arrhythmia. This framework can be further improved and properly fine tuned for real life practical application. The system can also be customized for other medical diagnosis problems.
