Abstract
Electrocardiogram (ECG) data recorded by medical devices are hard to analyze manually. Therefore, it is important to analyze and categorize each heartbeat using machine learning. Recently, advancements in machine learning have made classification of complex data easy and fast. However, these machine learning algorithms require sufficient amount of training data and have limited performance in case the data is imbalance. In case of MIT-BIH arrhythmia dataset, the distribution of training instances are quite imbalance. Many machine learning, particularly deep learning, algorithms give high accuracy on these datasets but still the minority classes have zero accuracy. In this paper, we improve the accuracy of minority classes without hurting the overall accuracy of other classes using transfer learning. The accuracy of existing deep learning model is increased from 90.67% to 98.47%, respectively.
Introduction
Electrocardiogram (ECG) is a heart-generated electrical activity and is a non-invasive indicator of the operation of the heart. Cardiologists and other medical practitioners use it mainly to diagnose heart-related issues. Earlier, heart-related issues were manually diagnosed and examined, which was a very complicated process, and the margin of errors was quite high. Disturbances in ECG signals such as heart rate and rhythm disparities or the transmission of heart signals are referred to as arrhythmia [42] and are considered to be one of the fatal diseases by World Health Organization (WHO). Because it is the cause of death for about one-third of deaths around the world, diagnosing cardiovascular diseases properly is crucial. For a doctor, diagnosing morphological variations caused by cardiac anomalies [24] is a very lengthy and grueling process, because one can only devote a certain amount of time to a single patient, which consequently, stipulates the need for an automated signal classification system for the ECG. The Medical Instrumentation Advancement Association (AAMI) has proposed certain types of heartbeat that are shown in the Table 1. Apart from these, a person’s heartbeat can also rely on many other factors such as stress, exercise, excitement and emotional state. In this case, the lack of appropriate optimal classification system and heartbeat variation are major challenges to address.
ECG heartbeat classes as per AAMI recommendation
ECG heartbeat classes as per AAMI recommendation
ECG classification system is one of the challenging problems to solve in the field of machine learning. The major issue with ECG waveform is its variations; they show the number of dissimilarities for different patients. For example, it may be identical for different patients’ heartbeats and may also be distinct for the same individual’s heartbeats at different times. The varying nature of heartbeats therefore makes heartbeat detection a complex issue. Several state of art algorithms have been developed including wavelet transform [11], filter banks [32], Hidden Markov Models (HMM) [6], Support Vector Machine (SVM) [41], and Artificial Neural Networks (ANN) [12] for the given problem. Most of the time, ECG classification systems performs well on training datasets but provide poor performance on test dataset due to the above issues. In recent years, machine learning and especially deep neural networks (DNNs) that are analogous to Deep Convolutional Networks (DCNs) have gained a lot of attention. DCN has axiomatic features that eliminate the need to extract and process hand crafted features. DCNs consist of various hidden layers including either 1D or 2D convolution layers, sub-sampling layers and fully connected layers followed by the last layer of classification.
As previously stated that a heart arrhythmia or dysrhythmia refers to an irregular heartbeat [42] (i.e., too fast or too slow) indicating a heart problem that needs to be remedied in the near future to prevent additional complications or a heart attack. Arrhythmia can be caused by many factors including coronary artery disease, high blood pressure, valve disorders, cardiomyopathy, any prior heart attack damage, blood imbalances, electrolyte and/or other medical conditions. Arrhythmia may also occur in healthy hearts from medications, caffeine, nicotine, alcohol, cocaine, diet pills, etc. Depending on gender and age, heartbeats can vary from person to person. The regular heartbeats are between 50 and 100 beats per minute. There are four forms of arrhythmia; Extra beats (starting in the upper chamber); Tachycardia (showing rapid heartbeats, i.e., more than 100 beats per minute); Bradycardia (showing sluggish heartbeats, i.e., less than 60 beats per minute); and ventricular arrhythmia (originating in the lower heart chamber).
Most of the machine learning methods for classification of ECG heartbeats include pre-processing of input data in which the features of the heartbeat signal are extracted, which are then used for classification. Signal feature representation is one of the most significant steps in the classification of ECG heartbeat signals in the midst of all the above-mentioned issues, so several researchers have reported multiple pre-processing and feature extraction techniques for ECG heartbeat signals. Nevertheless, due to differences in their heartbeat signals, these hand-crafted and low-level feature representations do not perform well for all distinct patients. In addition, post-processing such as reduction of dimensionality, feature normalization will increase the process’s computational complexity and make it more challenging for a compact heartbeat monitoring system. The famous datasets, such as MIT-BIH (explained in Section 3.1), has accuracy more than 90% in various famous frameworks. It is to worth mentioning here that most of the models give zero accuracy for two classes on MIT-BIH; Fusion beat class and unknown beat class. The main reason for high accuracy of overall models is because of the first class, Normal beat class. The normal beat class has 82.77% training instances, the Fusion beat, and unknown beat have 6.61% and 0.73%, respectively; which makes MIT-BIH dataset highly imbalance. The normal beat class has high accuracy during training which uplift the overall accuracy of the system, despite the other two mentioned classes have zero accuracy. With all these issues in mind, we are therefore conducting a study on the effect of deep learning models on ECG heartbeat classification systems performance and reporting the evaluated results of various deep convolution network models [20].
In this article, we compare and evaluate famous existing models of Deep Convolutional Networks (DCN) [2, 33], which are found best in their respective tasks. We retrain these deep learning models from scratch using ECG heartbeat classification data while preserving their configuration and architecture, in order to evaluate their performance for ECG classification, and to see how better they can classify electrocardiogram (ECG) shapes of heartbeats for the normal case, and the cases affected by different arrhythmias and myocardial infarction. We evaluated the same models with and/or without transfer learning to evaluate how much the transfer learning technique can improve the classification results in case of ECG classification systems. Due to the lack of standardization in the development and validation requirements of these algorithms, different classification methods such as SVM, HMM, etc. for ECG classification could not be compared.
This work is organized in the following fashion. Literature review is discussed in Section 2. Our proposed learning model is given in section 3, where at first data processing is explained. After pre-processing of ECG heartbeat signals, we discourse about the process of deep features extraction using different deep convolutional models with and without transfer learning and discussed the architectures of DCN models for classification. Section 4 is dedicated to implementation details, experimental results and evaluations of different models followed by conclusions in section 5.
The use of computers in the field of medicine has significantly increased in the recent past, and now it also plays a vital role in the advancement of the field of medical diagnostics [1, 39]. Machine learning is a technique for teaching machines to recognize patterns in complex data and use these patterns to find an optimal solution to a problem with the aid of different algorithms [7, 23]. Machine learning is nothing more than applying statistical models to complex data in the real world using computing technology [3]. When life becomes busy day by day, there is not enough time for medical practitioners to go through the specifics of each and every patient. There is an increasing need to create a computer that can help not only the medical practitioner, but also the patients themselves, simply by feeding the data into the processing system. This could help to save countless lives in cases when a serious event is identified in time.
Authors in [30] use a shallow convolutional neural network to detect myocardial infarction and achieved 84.54% accuracy with 85.33% sensitivity and 84.09% specificity. A fully convolutional neural network on the PTB ECG dataset using 10-fold cross-validation, reaching 93.3% sensitivity and 89.7% specificity is employed in [36]. Feng et al. [8] applied 16 layers of convolutionary neural network and long-short term memory network for 1-level myocardial infarction and achieved 95.4% precision, 98.2% sensitivity, 86.5% specificity and 96.8% of F1 score.
Deep Learning enhances learning algorithms, particularly for automatic pattern extraction from the collection of big data, thus, advancing machine learning capabilities. Recently, due to its optimization ability, it has gained a lot of attention in the fields of prediction, classification and decision-making [38]. Authors in [5] have developed a recurrent neural network (RNN) model with a gated recurrent unit that uses patient history to predict diagnoses and medications. One of the methods of machine learning is deep learning that has gained popularity among researchers because of its ability to process extremely large amounts of data for healthcare purposes. Deep learning approaches are commonly used for numerous purposes in various areas of life, including classification of objects, process simulation, modeling, image recognition, fault diagnosis and detection, etc. Authors in [13] used extreme learning machine method to detect cardiac disease using the 300 patients’ age, sex, cholesterol, blood sugar, etc. and achieved 80% accuracy. In order to find an optimal solution to real-life problems, deep learning methods are used to construct an appropriate model.
It is a quite expensive, complex and time-consuming task to label medical images. To address this challenge, authors in [37] suggest a new technique for image augmentation based on a statistical model of shape and a three-dimensional thin plate spline, that can also produce multiple virtual images out of a limited number of real images. Initially, the shape data of the actual labeled images is built on the statistical shape system, and a series of virtual shapes are created via this approach by sampling. In order to produce the virtual images, the virtual shapes are packed with detail using three-dimensional thin plate spline. Eventually, for training DNNs, virtual images and actual images are used together. The suggested paradigm is a generic form of data augmentation to use with any DNN architecture in any geometric design segmentation activities.
Authors in [10] suggest a temporal transfer learning method to use knowledge from corresponding points in time to generate a predictive model of early cardiac arrest that is robust in accuracy and also preserves the computational complexity of parameters. This method calculates the parameters of logistic regression concurrently to share knowledge from various measurement periods. This approach can solve small sample size problems, resulting in reliable model parameters determination. However, this scheme does not give optimal results on large datasets. Similar to this, [34] presents a loop-locked system combining arrhythmia diagnosis, label query, and model fine-tuning based on transfer-and active learning. A multi-input DNN is pre-trained using labeled data from an existing training set. The model can be used for identification of arrhythmia in practical applications for investigations.
Authors in [40] suggested to employ greedy deep dictionary learning compared to prior surface dictionary learning. They follow layer-by-layer training philosophy to improve the hidden layer so that neighborhood knowledge between the layers may be trained to preserve their features by reducing the likelihood of overfitting. It also ensures that every layer is convergent i.e., optimizing training and learning performance. Furthermore, [22] mentions that it is a key challenge to improve deep learning methods for healthcare prediction and decision making. Following these steps, [26] applied deep convolutional neural network algorithms for detecting central retinal vein occlusion using ultrawide-field fundus ophthalmoscopy and achieved sensitivity of 98.4%, and a specificity of 87.5%.
A dataset is presented in [4] that aims at evaluating the cardiac MRI which includes data of multi-equipment MRI measurements with valid calculations and classification by medical professionals. The overall purpose of this paper is to calculate the degree to which state-of - the-art machine learning approaches can be applied to the evaluation of MRI, i.e., segmentation of the myocardium and the ventricles, and also the classification of their respective dysfunctions. They report on the outcomes of the deep learning methods provided by various researchers for the segmentation and classification tasks. Full assessment of their segmentation outcomes, however, shows that deep learning approaches often yield overly exaggerated findings unlike medical professionals.
Authors in [19] formulate a deep learning method in chest X-rays for calculating cardio-thoracic-ratio (CTR), which is an important parameter used in heartbeat measurements. For segmenting X-ray images in the chest and measure CTR, a convolutional neural network is used. Using Bland-Altman regression, linear correlation graphs and intra-class correlation observations, CTR measures generated from the deep learning model are compared to the default average. The heart enlargement detection model’s diagnostic efficiency is tested and pale in comparison with some other machine learning approaches and radiologists. There has been comparable diagnostic accuracy, coherence, and optimistic predictive value between both the techniques. Deep learning has demonstrated a comparatively high tolerance and negative predictive value. In this pilot study, efficiency of the deep learning method for calculating CTR is shown to be outstanding, whether the method can be implemented well in clinical settings must be tested prospectively in further research.
Classification accuracy in machine learning is generally not adequate for two main reasons: the selection of data used for training and testing can vary, and the amount of training data may not be appropriate. Furthermore, most solutions in machine learning produce black-box models that are hard to understand. To tackle these issues, authors in [14] combine transductive transfer learning, semi-supervised learning, and TSK fuzzy method. To train the machine, two learning algorithms have been proposed. Their findings indicate that the methods proposed can deliver better results than many state-of-the-art models for seizure classification. However, some work must be carried out to reduce the algorithm’s computational cost.
Proposed model architecture
This work evaluates four different CNN model used to classify the ECG heartbeats into five classes N, S, V F and Q of MIT-BIH database .
This section is organized as follows: 3.1 contains the information on the dataset and related pre-processing; section 3.2 explains the deep learning models used for transfer learning and baseline, the models include Mansar [21], T. J. Jun et al. [15], Baber et al. [2], and shan et al. [33]; section 3.3 explain the transfer learning framework.
ECG heartbeat data dataset
The research resource for complex physiologic signals provide various challenging datasets which are publicly available. We worked with PhysioNet MIT-BIH Arrhythmia dataset. The number of samples in both collections is large enough for training a deep neural network and is suitable for experimental evaluation of the CNN models [16]. This dataset is composed of 48, two channel ambulatory ECG recordings, each recording is a 30 min segment selected from 24 hours of recording of 48 varying subjects. Each ECG recording is filtered and sampled at a frequency of 360 Hz and annotated by two cardiologists.
For the purpose of experiments, we use the lead II re-sampled heartbeats to the sampling frequency of 125Hz as an input. We use 44 ECG records from MIT-BIH arrhythmia database for training and evaluations of different CNN models. The remaining four beats are excluded from the experimental evaluation process as per the recommendation of AAMI, as these beats do not have good signal quality and considered unreliable for experimental process [16]. According to AAMI recommendation, ECG dataset can be classified into five categories (N, S, V, F, Q), detail on these categories are given in the Table 1. The distribution of the training instances are shown in the Table 2.
MIT-BIH dataset classes distribution
MIT-BIH dataset classes distribution
An ECG signal is the recording of electrical activity of the human heart using an electrode. It is the tool which any cardiologist use in order to diagnose the diseases and cases related to heart. Convolution neural network (CNN) has a power to extract the strong and robust features from the given input signal/image.
As mentioned above, the aim of this article is to evaluate different DCN models which have high accuracy in their respective tasks, for the classification of electrocardiogram (ECG) shapes of heartbeats for the normal case and the cases affected by different arrhythmias and myocardial infarction. Figure 1 shows the generalized approach that is used to evaluate all the models.

Abstract flow diagram for designing deep CNN.
The training is pre-processed before it can be passed to the classifiers. The pre-processing of the ECG signals and extracting the beats in our experiments is similar to the what is already done in [16]. In order to train the classifiers, Convolution Neural Network (CNN) has been utilized. CNN is mainly composed of features extraction and classification parts. In feature extraction part, CNN model is responsible of extracting the robust and effective features from the given input, whereas, in classification part, the role of CNN is to classify the extracted feature in one of the predefined classes. This whole process is the part of supervised learning where the annotated data is provided as an input to the CNN model.
In this paper, four deep models with different configurations are used for evaluation. The CNN based deep models are basically designed to deal with 2D data. Since, we have 1D-ECG signals, so we need to adjust the structure of CNN models.
First evaluated CNN model is proposed by the Mansar [21] which is the modified version of [16]. The only difference is that this is without any residual blocks in CNN model. As stated above, the given ECG signals is 1-D in nature and CNN accepts 2-D signals. The architecture and configuration of the first model proposed by Mansar [21] is taken from source given by the author, for the evaluation purpose in this article 1 .
In [15], authors proposed a model with 2D convolution and transformed ECG signals into 128 × 128 gray scale images. They performed data augmentation for increasing the diversity of data. They apply batch normalization and use ELU as an activation function. Main structure of this model is somehow same as VGGNet [35], which optimizes various functions to reduce over-fitting and improve classification accuracy. However, for the fair comparison of the results, we transformed the 2D convolution into 1D convolution while keeping the ECG signals and not transforming the ECG signals into images. Layers in Table 3 represent the sequence layers/architecture of DCN proposed by [15].
CNN layer configuration [15]
CNN layer configuration [15]
Baber et al. proposed a deep CNN model (DCNN) for the evaluation of facial expression recognition. They had 5 hidden layers in their model with 3-convolution layers followed by max pooling layers with last two fully connected layers. Their DCNN model configuration for each layers is shown in Table 4. Their precise, simple and yet effective model outperforms in the evaluation of facial expression recognition. Authors apply 2D-convolution for the recognition of facial expression which is converted into 1D convolution while keeping the other configurations of the layers as it is. The model created using the architecture in [2] and is shown in Figure 2.
No. of Convolution filters and their sizes used in each layer along with pooling size [2]

1D convolution DCN model for heartbeat classification & DCN layers configuration [2].
Moreover, the last model that evaluated is resent20 model [33]. Authors use resnet20 model architecture for cifar10, in order to detect adversarial attacks on neural networks. While in our work, we use resnet20 model architecture to evaluate ECG heartbeat classification system. We apply this model on the same pre-processed MIT-BIH dataset. As mentioned, we implement the models with 1D-convolution for the processing of ECG signals and compare all these proposed CNN models for evaluation of ECG heartbeat classification.
In machine learning, there exist an effective and faster way of classification by using pre-trained models known as Transfer Learning. It is the method in which the already trained model is re-utilized and used as an initial point for solving the different problem of classification, by using the previous knowledge of the already trained model developed for the solution of some similar problem. It is easy and fast process of doing classification rather than training a DCN from scratch with initialized random weights, the generalized form of the transfer learning is presented in the Fig. 3.

DCN architecture with transfer learning.
Transfer learning is very useful and it guarantees good performance for the required task, but it mainly depends on few factors and constraints which should be meet to have good quality results. The first factor on which it depends is the similarity and other is size of the original dataset. One important point that should be keep in mind while working with DCNs, features of DCNs are more generic in the last few layers So, there are four scenerios on which the usefulness and effectiveness of transfer learning is based. When the new dataset is smaller in size but it is similar to the dataset on which the pre-trained network is trained-on. The CNN models over-fits on smaller datasets. In case of transfer learning, the pre-trained network on similar dataset will give support to the smaller dataset and chances of over-fitting are reduced. Conversely, when the new dataset is smaller in size and also not similar to the pre-trained network, then it better to train a classifier using the activation from somewhere else instead of from upper layers in the network, because it will not be a good idea to train a classifier from the top of the network which contains more specific features related to the original dataset which is quite different from new dataset. When your new dataset is large in size as well as similar to the content of original dataset of pre-trained network. In this case one would have more confidence of not being over-fit. When your new dataset is larger and is different, then better idea is to train a classifier from the scratch using the weights of some related pre-trained model which is an optional thing. However, in practice it is still beneficial to initialize with weights from a pre-trained model. In this case, you would have enough data and confidence to fine-tune through the entire network.
So, the first case is very similar to our problem. Therefore, we fine tuned the weights of pre-trained model by keeping its layers fixed to minimize the chances of over-fitting. Furthermore, two possibilities of transfer learning are also exploited in the experiments; 1) Base model is pre-trained on MIT-BIH database and it is fine-tuned and retrained on same dataset, 2) Base model is trained on CinC2017 and fine-tuned and retrained on MIT-BIH dataset.
Transfer learning is an optimization that allows rapid progress or improved performance when modeling the second task.
In our paper, we have two fine-tuned models. 1) First fine-tuned model is the combination of [2] and [21] as these two models are simple and effective. We used [21] as a base model and used [2] as a head model, and, both base and head models are trained on MIT-BIH dataset, and 2) In second model, we trained a base model on CinC2017 dataset, and fine-tuned it on MIT-BIH database. The architecture for the base model is taken from the paper [21] and architecture of head model is same as of [2]. To summarize the above two fine tuning possibilities, in first fine-tuned model, we used same dataset before and after fine tuning the network, we call this as MIT-BIH-FT. In second possibility, we trained the model on CinC2017 datsaet and fine tuned on MIT-BIH, we call this as Cin-MIT-BIH-FT. Experiments show that MIT-BIH-FT is more effective. In the Figure 3, it is shown that from pre-trained base models, last fully
connected (FC) layers are removed in order to do transfer learning. After removing FC layers of pre-trained models, we have intermediate representation of the input heartbeat signals. This intermediate representation is then used as an input for newly added layers to get the new fine-tuned DCN model, which is based on the previous knowledge of the pre-trained models. This new model is then used for testing and evaluation purpose on the test dataset.
Keras with Tensorflow computational library are used for implementation of previously explained framework, and cross entropy loss on softmax are used as loss functions with learning rate of 0.001 [17]. The upper limit for epochs is set to 1000 and scheduling is set to adjust learning rate during training.
Thanks to the technique known as learning rate annealing, which recommends to start with relatively higher learning rate and keep monitoring the accuracy of the network by gradually lowering the rate during the training. This trick helps to traverse quickly from the initial parameters to a range of good parameter values. The smaller values of learning rate helps to explore the network (from Karparthy’s CS231n notes) 2 . Another form of learning rate annealing is step decay, where the term learning rate is decreased by some percentage after some iterations/training epochs.
The evaluation of the results of arrhythmia classifier on 109446 samples with 125 frequency and 5 classes, where test dataset contains 4079 heartbeats (about 819 from each class), are shown in Table 5. It can be seen that both possibilities of fine tuning improves the accuracy of deep learning. For both possibilities, MIT-BIH-FT and Cin-MIT-BIH-FT, the base models are [2, 21]. However, the MIT-BIH-FT outperforms in experiments. The test dataset contains the heartbeats that are not used in the phase of network training and validation. Figure 4 shows the confusion matrices being evaluated on the training and test MIT-BIH dataset using the model proposed by [2] and [21]. It can be seen that using transfer learning, the minority classes accuracy is increased. The models without transfer learning have competitive accuracy despite the fact that minor classes have zero accuracy, as can be seen in the Fig. 4.

In this paper, we evaluated different DCN models with and without transfer learning using 1D convolutional neural network. MIT-BIH arrhythmia database ECG recording are used for this purpose. This dataset is widely used for various machine learning algorithms due to the distribution of the classes which are quite imbalance. The minority classes have zero accuracy using various deep learning models, despite with zeros accuracy of minority classes the overall accuracies are more than 90%. With our experimental evaluation of transfer learning, it can be seen that knowledge learned from MIT-BIH database can further be improved by transferring the knowledge to retrain the new model again with using the same and small datasets. We have used two possible configuration of transfer learning, MIT-BIH-FT and Cin-MIT-BIH-FT, to increase the accuracy of minority classes without hurting the overall accuracy. The deep learning models used in this paper are simple and effective which can be easily trained on CPU machines. Experiments also support that fine tuning is more effective if same dataset is used for base and extended models in transfer learning.
