Research on massive ECG data in XGBoost

Abstract

There exists a huge amount of ECG data available in heart disease diagnosis which is found difficult in handing. Recently, many researchers focused on mining disease diagnosis to innovate the hidden patterns and their relevant features. Mining bio-medical data is one of the predominant research areas where clustering techniques are emphasized in heart disease diagnosis. But few people deal with large heart disease datasets and then classify disease data sets according to heart disease feature. We propose a method of anomaly threshold based on multiple classifiers can be well suited to datasets containing abnormal data, and use XGBoost algorithm as a sub-classifier to process massive ECG data. This research focuses on the heart disease classification problem. The data set is divided into two categories, and then it was classified into more specific categories, experimental results show that this method can improve classification accuracy. The experiments are conducted on massive instances of different heart disease obtained from the hospital actual cases and two data sets of UCI. In fact, we compared SVM, C4.5, Naive Bayes, Logistic, RandomForest and XGBoost algorithms, and found that tree-based model classifier is the best fit to predict arrhythmia. The method proposed in this paper is of great significance to the processing and forecasting system of large medical data sets, and promote the development of wisdom medical care.

Keywords

ECG heart disease diagnosis heart disease intelligent medical

1 Introduction

Nowadays, the aging of the population, chronic cardiovascular and vascular diseases and unhealthy lifestyles exacerbate the incidence of heart disease. So, it is of great significance to find an effective way on heart diseases detection and prevention, and what is important is heart disease classification; that is to obtain the data of heart disease, analyze and classify it and get the effective information.

The data is usually from the electrocardiogram (ECG), which is the process of recording the electrical activity of the heart over a period of time using electrodes placed on the skin. These electrodes detect the tiny electrical changes on the skin that arise from the heart muscle’s electrophysiologic pattern of depolarizing and repolarizing during each heartbeat [11, 15]. As a very commonly performed cardiology test and the main analysis tool recording the changes of human heartbeat in the biological potential, it provides valuable information of cardiac rate and cardiovascular system functions. However, since the ECG heartbeat processing is a kind of biological signal, the reflection may appear randomly. In addition, it is common to meet a large amount of data and multiple patients situations. So there needs a long time to observe the ECG signal pattern and heart rate variation to diagnose heart disease [2]; especial when it is doctors that manually analyze the waveform, which not only make medical staff constrained, but adversely affecting the treatment of other patients. As we all know, it is the trend of the society to establish an automatic system to reduce the workload of humans, and it is applied likewise to the medical fields called wise information technology of 120(WIT120), which is a combination of bioscience and information technology [14, 16]. It aims at making real-time and reliable prediction and detection of diseases, achieving the effect of intelligent matching biosphere, building interactive platform for information sharing through the review of the previous electronic health records. The medical service model of WIT120 is centered on patient data, and it is the data classification and knowledge discovery technology that mainly support that [19].

So in order to increase efficiency of the ECG data analysis, we should at first know about data analyzing and knowledge discovery tools to find a better way for the medical applications. The machine learning method, a data-driven discipline, is the main analysis method in recent researches of classification, and it consists of two important steps: feature extraction and classification model training [1]. There appears all kinds of models these years for data mining and heart disease classification. The classification methods in use are mainly artificial neural network(ANN) based methods, support vector machine(SVM) relevant methods, associative methods and tree-based models, and they sometimes are not used alone. For example, the classification methods based on artificial neural network, like a hybrid method using fuzzy weighted prep-processing and artificial immune recognition system (AIRS) showed an effective performance on machine learning benchmark problems and medical classification problems [13]. It used k-fold cross-validation method and confusion matrix [9] to obtain a high accuracy. Others like REF neural network and BP neural network have their advantages. However, the obvious drawback of this kind is the complexity, which makes it difficult to train [1]. About the vector machine relevant model, the relevant vector machine (RVM) as a time-frequency analysis to classify, and combining with the rough k-means to form a hybrid recognition method can get good performance, but it neglected the data reprocessing which is very important in machine learning fields. Associative classification as a recent rewarding technique which integrates association rule mining and classification is a pretty good model for prediction [9], but it is n’t suitable for a large data set. The use of tree-based methods offers good performance. Regression trees for predicting and classifying HF sub-types in a population-based sample of patients from Ontario, Canada proved valid [10]. XGBoost model, a kind of gradient tree boosting method, we use in the paper is a tree-based method as well. It has been proved to produce state-of-the-art results on multiple machine learning problems [7, 12] and as a scalable machine learning system for tree boosting and an effective and extensible gradient framework implementation, it includes an efficient linear model solver and tree learning algorithm [3 , 8], which makes it converge faster than deep belief network (DBN) [5], and have a better performance in different datasets. Those make it outstanding over other tree-based models. What else, the user can modify the package according to their needs to customize their own goals. It is widely used in data mining fields and machine learning projects nowadays, and the impact of the system has been widely recognized in a number of machine learning and data mining challenges. For example, in the machine learning competition hosted by Kaggle site in 2015,17 solutions used XGBoost among 29 winners. Among them, eight solely used XGBoost to train the model, while most others combined XGBoost with neural nets in ensembles. And the second popular, deep neural nets, was only used in 11. It achieved a greater success in KDDCup 2015, where was used by every winning team in the top-10 [17]. In this research, a large dataset which consists of different types of attributes is used to train the classification model. For complicated situations, we make experiments for several times, and in order to find a better solution, we propose multiple schemes to deal with that, and finally get a fine model and pleasure results both in training and testing stage.

2 The experiment data materials

2.1 Question definition and algorithms

Classifying Heart Disease: The classification of information makes for orderliness; but orderliness,f or most individuals, without the appreciation of value, is not sufficiently rewarding to sustain the effort of classifying [18]. The purpose of this paper is to make a model that can discriminate the cardiac arrhythmias in massive ECG instances.

This paper constructs a multi-classifier classification task based on threshold data, according to the characteristics of the bit value (such as quartile) as C4.5 split points.They are used to divide the ranges of data into different parts, each of which used to construct a sub-classifier, and then the average classification accuracy of those standard sub-classifiers are computed to evaluate the whole.This article uses different algorithmic models as sub classifiers for comparison.

Why is XGBoost the best classifier as a sub-classifier?

Regularization upgrade;

Parallel processing (multi-threaded stand-alone CPU, support for GPU acceleration, support for distribution)

A high degree of flexibility(Allows users to customized optimization goals and evaluation criteria);

Pruning(Split to the specified maximum depth and then prune it);

Support multiple language interfaces

The accuracy of the model was verified by 10-fold cross-validation.

2.2 Model and algorithm

XGBoost (eXtreme Gradient Boosting) is based on gradient descent to combine a weak learner/model to become a strong learner. Actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms [6].

2.2.1 Regularized learning objective

Given a data set D (D = {(x_i, y_i)} (|D| = n, x_i ∈ R^m, y_i ∈ R)), and n is the count of samples and m the count of features. To predict the output, we introduce a function called K additive function as follow: ${\hat{y}}_{i} = φ (x_{i}) = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in F$ (1) where F = {f (x) = w_q(x)} (q : R^m → T, w ∈ R^T) is the space of regression trees (CART),and about the space, q represents the mapping relationship(i.e structure) of each tree between an example and the corresponding leaf index. T is the number of leaves in the tree. Each f_k corresponds to a leaf which weights w to the homologous independent tree structure q. It is easier to be confused with decision trees, but they are different. Each regression tree contains a continuous score on each of the leaf. The decision rules will be given by q and used to classify the leaves to get the final prediction by summing up each score of the corresponding weight(w) of leave.

We will use a set of functions in the model, and the first step is to minimize the regularized objective. $L (φ) = \sum_{i} l ({\hat{y}}_{i}, y_{i}) + \sum_{k} Ω (f_{k})$ (2)

Where $Ω (f) = γ T + \frac{1}{2} λ {∥ w ∥}^{2}$ Here l is a differentiated convex loss function used to measures the difference between the prediction ${\hat{y}}_{i}$ and the target y_i. Ω penalizes the complexity of the regression tree functions. Obviously, the additional regularization term Ω helps to smooth the final learned weights to avoid over-fitting. The regularized objective tends to select simple and predictive functions for the model. So an adequate regularization technique is needed. Maybe you’re familiar with the regularization technique in Regularized greedy forest (RGF) [6] model, and ours is similar to that, but simpler and easier to parallel owning to our objective and the corresponding learning algorithm. How to set the regularization parameter is important, and if it is set to zero, the objective will fall back to the traditional gradient tree boosting. We will introduce that in detail later. The objective function usually consists of two parts, one is the loss function, the other is the regularization. The loss function is related to the training task, and the regularization is related to the model complexity.

2.2.2 Gradient tree boosting algorithm

The tree ensemble model in Equation (2) includes functions as parameters but how to optimize that? The traditional optimization methods are used for the Euclidean space. So we should find another way to train our model. Let’s look at the improvement of Equation (2) as follows: $L^{(t)} = \sum_{i = 1}^{n} ((y_{i}, {\hat{y}}_{i}^{(t - 1)}) + f_{t} (x_{i})) + Ω (f_{t})$ (3)

${\hat{y}}_{i}^{(t)}$ is the prediction of the i-th instance at the t-th iteration. As you can see, we greedily add f_t to minimize the objective. To quickly optimize the objective in the general setting, we approximate it using the second order Taylor expansion. $\begin{matrix} L^{(t)} \approx \sum_{i = 1}^{n} [(y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (x_{i}) \\ + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t}) \end{matrix}$ (4) where $g_{i} = \partial_{{\hat{y}}^{(t - 1)}} l (y_{i}, {\hat{y}}^{(t - 1)})$ and $h_{i} = \partial_{{\hat{y}}^{(t - 1)}}^{2} l (y_{i},$ ${\hat{y}}^{(t - 1)})$ are first and second order gradient statistics on the loss function. We can remove the constant terms to obtain the following simplified t-th step objective. ${\bar{L}}^{(t)} = \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})$ (5)

We expand the regularization term Ω here, so the Equation (3) can be rewritten as: $\begin{matrix} {\bar{L}}^{(t)} = \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + γ T \\ + \sum_{i = 1}^{n} (γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}) \end{matrix}$ (6)

Define l_i = {i|q (x_i) = j} as the instance set of leaf j. Use it and improve right part of the equation above. Here is what we get: $\sum_{i = 1}^{T} [(\sum_{i \in I_{j}} g_{i}) w_{j} + \frac{1}{2} (\sum_{i \in I_{j}} h_{i} + λ) w_{j}^{2}] + γ T$ (7)

For a fixed structure q (x), we can compute the optimal weight $w_{j}^{*}$ of leaf j by $w_{j}^{*} = \frac{- \sum_{i \in I_{j}} g_{i}}{\sum_{i \in I_{j}} h_{i} + λ}$ (8) and calculate the corresponding optimal objective function value by ${\bar{L}}^{(t)} (q) = - \frac{1}{2} \sum_{j = 1}^{T} \frac{{(\sum_{i \in I_{j}} g_{i})}^{2}}{\sum_{i \in I_{j}} h_{i} + λ} + γ T$ (9)

Equation (6) is the scoring function. We use it to measure the quality of the structure q. This score is like the impurity score for evaluating decision trees, except that it is derived for a wider range of objective functions [4]. The tree structure q is usually too hard to enumerate one by one. So we instead use the greedy algorithm which starts from a single leaf and iteratively adds branches to the tree. The instance sets of left and right nodes after the split, which we use I_L and I_R to respectively represent, are extracted, and combined as another set I (I = I_L ∪ I_R), then the loss reduction after the split can be quantized by the three sets. That is: $\begin{matrix} L_{split} = \frac{1}{2} [\frac{{(\sum_{i \in I_{L}} g_{i})}^{2}}{\sum_{i \in I_{L}} h_{i} + λ} + \frac{{(\sum_{i \in I_{R}} g_{i})}^{2}}{\sum_{i \in I_{R}} h_{i} + λ} \\ - \frac{{(\sum_{i \in I} g_{i})}^{2}}{\sum_{i \in I} h_{i} + λ}] - γ \end{matrix}$ (10)

This formula is usually used in practice for evaluating the split candidates [4].

2.3 Dataset description

In our experiments, a large heart disease dataset with 202594 instances are collected from the database, and all the original data of our research are collected from the real cases.

We used three data sets. The first data set we use is from first people’s hospital of Yunnan province new kunhua hospital. The task is to predict the likelihood of heart disease of patients’. The Dataset of ECG contains the heart rate, the wavelength, the wave width, and the leads of the limbs and the leads of the chest. QT, PR are different waves, Rv1 and SV1 are leads, and P_width is the wavelength of p-wave. And all attributes are continuous. In this experiment, we simplified the task to a binary classification, which is heart disease represented by ‘–1’, and analysis by ‘1’. It contains 13 cardiac attributes provided by the data set the minimal and maximal value listed, and the average and standard deviation calculated in Table 1.

Table 1
Datasets of Binary classification

Attributes Min Max Average Standard deviation

HEART_RATE 3 5000 82.6 123.18

PR –100 5000 242.9 671.82

QT 2 5644 373.6 91.45

QRSDZ –179 5000 43.6 46.00

QRS_WIDTH 8 5000 88.1 98.61

P_WIDTH –179 179 38.4 36.92

R_WIDTH –179 5000 43.6 46.00

QTC –2414 5691 404.5 94.67

P –12 5000 216.4 737.08

RV1 –0.06 18.7 0.21 0.24

SV5 0 30.6 0.4 0.37

RV5 –0.37 50 1.27 0.65

SV1 –0.01 20.73 0.8 0.48

Attributes	Min	Max	Average	Standard deviation
HEART_RATE	3	5000	82.6	123.18
PR	–100	5000	242.9	671.82
QT	2	5644	373.6	91.45
QRSDZ	–179	5000	43.6	46.00
QRS_WIDTH	8	5000	88.1	98.61
P_WIDTH	–179	179	38.4	36.92
R_WIDTH	–179	5000	43.6	46.00
QTC	–2414	5691	404.5	94.67
P	–12	5000	216.4	737.08
RV1	–0.06	18.7	0.21	0.24
SV5	0	30.6	0.4	0.37
RV5	–0.37	50	1.27	0.65
SV1	–0.01	20.73	0.8	0.48

To further classify these data for different risk factors. We extracted the ‘1’ data from the first data set, and produce the second, and divided it into six categories by experiments. They respectively are:’1’ for sinus rhythm,’2’ for Sinus bradycardia, ’3’ for Sinus arrhythmia, ’4’ for Pacing rhythm,’5’for Right bundle branch block, ’6’ for atrial flutter.

The third one is a UCI dataset from the UCI official website, which is one of the most commonly used benchmarks of data acquirement. They are ecoli dataset and spectf data set. The uci data set will be used for the data imbalance problem by SMOTE algorithm, which would be described later part of this paper.

3 Experiments

We conduct the experiment to evaluate the algorithm proposed before. Because of issues such as errors caused by the experimental instruments, inconsistent measuring by individual differences and unreliable diagnosis that is for inferences only, there are many abnormal instances in the ECG data from First people’s hospital of Yunnan province new Kunhuna hospital. If we just simply delete the abnormal ones, it would definitely cause sizable quantities of instance loss, and make the classification model useless for the new abnormal instances. The ECG data from the hospital is binary classified, and after comparing part of the attribute values, we visualize them as Fig. 1.

Fig.1

Comparison of attribute values.

The red part in the figure stands for the features of normal instances, and the blue part is cardiac arrhythmia. The arrhythmia region is larger from Fig. 1, and abnormal thresholds of both kinds frequently appears in particularly huge and tiny data.

3.1 No data pre-processing

Instead of pre-processing the data, we analyse them with medical knowledge, and find many abnormal attribute values. We classify the non-pre-processed data with XGBoost, SVM, C4.5, Naive Bayes, Logistic and Randomforest classifiers.

It shows the highest data accuracy is from the XGBoost classifier, with 76.68% by 10-fold cross-validation.

The built-in feature selectors of XGBoost show the weight of every feature to the model’s training by scoring, namely, the features are chosen by the scores. The scores of not pre-processed data are shown in Fig. 2.

Fig.2

Feature score.

3.2 Data pre-processing

The pre-processing of the massive data is mainly for those abnormal instances.There are two ways for that: one is to directly delete the abnormal instances, and the other is an interpolation.

The former one, which can be called the outliers deletion(interpreted in Table 2), is used for the unimportant absent variants and attributes with abnormal characters when the amount of instances is large. The method is usually used for clustering and classification. While when the amount is small, the missing values are exactly the key, or regular variation is shown in the instances with time, it is non-trivial to interpolate. The common-used interpolation is shown in Table 2.

Table 2
Abnormal interpolation method

Interpolation method Description

Outlier deletion Directly delete the abnormal instances

Fixed value replacement Replace the absent value with a fixed constant

Mean value/median/mode interpolation Interpolate with the mean value, median or mode of the attribute value according to the value type

Neighbor interpolation Interpolate with the nearest attribute value in the record

Regression method Interpolation according to the existing data fit model

Calorie interpolation Find the similar object, and fill with the value

Multi-interpolation Replace every absent value with a series of probable values

Formula Interpolate with the mean value of the row and column combined

Interpolation method	Description
Outlier deletion	Directly delete the abnormal instances
Fixed value replacement	Replace the absent value with a fixed constant
Mean value/median/mode interpolation	Interpolate with the mean value, median or mode of the attribute value according to the value type
Neighbor interpolation	Interpolate with the nearest attribute value in the record
Regression method	Interpolation according to the existing data fit model
Calorie interpolation	Find the similar object, and fill with the value
Multi-interpolation	Replace every absent value with a series of probable values
Formula	Interpolate with the mean value of the row and column combined

3.2.1 Outliers deletion

We use the method called outliers deletion to deal with the instances, namely, to delete the ones with a column of abnormal attribute values. And there are 80758 instances left after the deletion.

It is easy to find from the improvement of classification accuracy after the instance deletion, and the XGBoost classifier performed the best, with 85.58%.

3.2.2 Interpolation

As is shown in Table 2, there are multiple methods for the abnormal attribute values. We choose three of them in this experiment, which are fixed value replacement, mean value and Calorie interpolation.

We determine the outliers based on the medical knowledge and the peak value of the instrument. The classification accuracy of the three interpolation methods did not increase much, fluctuating around 75%. The best algorithms one of them, however, is still the XGBoost.

3.3 Multi-classifier based on abnormal thresholds

To deal with the data, we specially establish a multi-classifier for the instances within the abnormal thresholds. The entries are firstly divided by the decision tree into two parts according to the thresholds, and to build two classifiers respectively. The overview of the model is illustrated in Fig. 3. The experimental results produced from the 202594 instances shows a better classification accuracy through the threshold partition than constructing one classifier, with up to 80% for data 1, and 79.5% for data 2. And it performs the best when the data is divided into 8 parts and 8 classifiers respectively built for them simultaneously.

What the decision tree model we used for the threshold partition is C4.5.Data 1 and data 2 are processed from the instances after the C4.5 classification. In the same way, the two data sets are respectively split by the C4.5 to produce data 3, data 4, data 5 and data 6. We would finally get 8 parts after the partition, and build 8 sub-classifiers with different algorithms. We may obtain a higher accuracy with more classifiers, but it easily causes the over-fitting problem that way. So we should specifically do the analysis for the given cases and balance the classification accuracy and over-fitting issues. The C4.5 algorithm uses data quantiles for the threshold partition.

Fig.3

Threshold classifier.

To take an example for further explanation, we can use the quantile of every column to split the threshold. We firstly set the quantile at 0.05 and 0.95. That is to say, the instance with attributes values between 0.05 and 0.95 will be classified into a threshold space, and otherwise into another threshold space if not within the range. More division could be made in this way. The initial intention of the quantile partition is to filter the threshold range with C4.5, and inversely modify the threshold range according to the average classification accuracy of those 8 sub-classifiers.

The operation, however, hasn’t totally automatically realized. The classification accuracy of the algorithms of the sub-classifiers can be seen from Table 3. We perform them on the binary classification data set and extract from the cardiac arrhythmia data for 6 classifications, which are Sinus rhythm, Sinus tachycardia, Sinus arrhythmia, Pacing rhythm, Right and left bundle branch block, and atrial flutter, with 387 real cases of every kind. And the 6 classifications of the abnormal thresholds-based multi-classifier also improve the accuracy.

We use 6 algorithms for the 8 classifiers, and compute the average classification precision of the 8 to evaluate the entire model. For the other methods, the threshold partition for multi-classifier building proved effective for the data with many abnormal instance values as well. Data imbalance may easily cause by the threshold partition sometimes, The large imbalance didn’t happen in this data set, though.So we added experiments to possible imbalanced data. Taking UCI benchmark dataset as an example, combined with SMOTE algorithm, good results were obtained, so as to solve the possible imbalances after using this method.

Table 3

Contrast of models’ accuracy

Classifier	Model	Accuracy (%)
		Tr_A	Te_A
Classifier1	XGBoost	94.81%	88.82%
	SVM	100%	56%
	C4.5	91.21%	88.84%
	Naive Bayes	78.27%	78.35%
	Logistic	66.34%	65.77%
	RandomForest	99.98%	89.32%
Classifier2	XGBoost	85.47%	83.86%
	SVM	76.30%	72.45%
	C4.5	96.42%	93.83%
	Naive Bayes	78.50%	78.51%
	Logistic	66.63%	66.62%
	RandomForest	89.32%	87.24%
Classifier3	XGBoost	92.43%	90.47%
	SVM	79.32%	72.40%
	C4.5	97.57 %	96.07%
	Naive Bayes	77.68%	77.69%
	Logistic	66.51 %	66.52%
	RandomForest	100%	96.57%
Classifier4	XGBoost	93.56%	90.76%
	SVM	82.31%	80.68%
	C4.5	98.51%	98.04%
	Naive Bayes	79.94%	79.95%
	Logistic	66.59 %	66.52%
	RandomForest	100%	98.20%
Classifier5	XGBoost	95.32%	90.34%
	SVM	78.32%	74.29%
	C4.5	100%	99.93%
	Naive Bayes	73.24%	73.20%
	Logistic	75.58%	75.50%
	RandomForest	100%	100%
Classifier6	XGBoost	93.45%	92.36%
	SVM	78.78%	75.32%
	C4.5	99.75%	99.65%
	Naive Bayes	67.20%	67.18%
	Logistic	65.57%	65.53%
	RandomForest	100%	99.65%
Classifier7	XGBoost	87.34%	82.31%
	SVM	73.23%	72.17%
	C4.5	100%	100%
	Naive Bayes	69.98%	69.98%
	Logistic	65.64%	65.55%
	RandomForest	100%	100%
Classifier8	XGBoost	94.21%	91.59%
	SVM	79.32%	78.24%
	C4.5	92.15%	87.35%
	Naive Bayes	72.59%	72.62%
	Logistic	76.00%	75.95%
	RandomForest	91.23%	90.08%

Table 4

Information on data sets

Dataset	classification	Attributes	Attribute	Maximum imbalance
Ecoli	5	327(20/35/52/77/143)	8	7.2
Spectf	2	267(212/55)	44	3.9

The two data sets of UCI are dealt with by the SMOTE method. It operates on the balanced dataset using oversampling with different algorithms. We synthesized the SPECTF training and testing data sets for a new one. Table 4 describes the basic information of the data.

SMOTE has two methods: oversampling and undersampling. The former is to reduce a large amount of instances data, and the latter is to increase the number of instances.

Table 5

Classification accuracy of different algorithms

Data	Model	Accuracy(%)
		Tr_A	Te_A
Ecoli	XGBoost	94.20%	87.81%
	SVM	78.60%	78.29%
	C4.5	91.13%	83.49%
	Naive Bayes	89.30%	87.46%
	Logistic	89.60%	86.24%
	RandomForest	100%	87.46%
Ecoli	XGBoost+Smote	96.50%	88.05%
	SVM+Smote	76.32%	74.95%
	C4.5+Smote	95.70%	87.28%
	Naive Bayes+Smote	89.60%	88.85%
	Logistic+Smote	90.22%	87.67%
	RandomForest+Smote	100%	91.98%
Spectf	XGBoost	100%	78.24%
	SVM	100%	79.4%
	C4.5	98.50%	74.91%
	Naive Bayes	68.91%	68.54%
	Logistic	89.51%	79.40%
	RandomForest	100%	80.90%
Spectf	XGBoost+Smote	100%	86.07%
	SVM+Smote	100%	75.78%
	C4.5+Smote	97.83%	80.43%
	Naive Bayes+Smote	76.40%	75.16%
	Logistic+Smote	89.13%	76.40%
	RandomForest+Smote	100%	86.65%

Table 5 shows the data processed by the SMOTE method, and that can be seen improves the classification accuracy and robustness of all kinds of data. Among the models, tree-based classifier, such as C4.5, Random Forest, and XGBoost, performed the best.

We build the classification models for ECG data under the circumstances of no-data pre-processing, data pre-processing and abnormal threshold based multi-classifier, made the binary-classification to the 202594 instances from the hospital, and continued to make a six-classification for the cardiac arrhythmia data. It can be clearly seen from the experiments that the tree-based classifiers can match better with the medical data. To the imbalance situation that might appear during building the abnormal threshold based multi-classifier, we test that with the UCI data sets using SMOTE algorithm to deal with that, and get an improvement as result. We test that with the UCI data sets using SMOTE algorithm, and get an improvement as result. We choose the XGBoost as the sub-classifier, which is because it supports multi-threaded CPU, and greatly speeds up the computation. What’s more, the result of testing can be better through adjusting the parameters, and we can also avoid the overfitting problem by parameter adjusting.

4 Discussion

From the massive cardiac arrhythmia instances and the experimental results, building a multi-classifier based on abnormal threshold is of better classification accuracy, and the good performance in heart disease classification provides the bioinformatics application with a proper solution. What the data we use in this experiment is from the real medical scenario. We build multiple classifiers based on abnormal thresholds(data quantiles) to make the binary and six classifications, and compared the classification accuracy of XGBoost, SVM, C4.5, Naive Bayes, Logistic and Random Forest. The multi-classifier proved an effective and improved classification model for noisy data. And for the probability of data imbalance this method may cause, we made a comparison with the level of imbalance using the UCI dataset, and find the highest classification accuracy the SMOTE combined with tree-based algorithm performed.

5 Conclusion

What the method proposed for the massive ECG data in this paper is abnormal thresholds-based multi-classifiers classification with the tree-based model–XGBoost. So when encountering the massive ECG data, we could firstly do a binary-classification to pick out the cardiac arrhythmia ones for the further classification. The method we used in the experiment is a kind of ensemble learning, and we can get a higher accuracy through building multiple classifiers on different thresholds, which could not only avoid a big waste of instances with abnormal values but also produce a classifier with a better robustness. Although xgboost is not as high as C4.5 in the classification accuracy of some sub classifiers, we choose it because of the many advantages of XGBoost.

And during the imbalance solution, what the method we use for, testing and comparing with a public data set, is also a new idea, and the best combination, SMOTE+ tree-based model, can also provide an effective way of heart disease classification for the medical fields. The whole set of thought and method can also be applied to other eras.

Footnotes

Acknowledgments

This work has been supported by the National Science Foundation of China Grant No. 61762092, “Dynamic multi-objective requirement optimization based on transfer learning”, National Science Foundation of China Grant No. 61762089, “The key research of high order tensor decomposition in distributed environment”, Open Foundation of Key Laboratory in Software Engineering of Yunnan Province under Grant NO. 2017SE204, “Research on extracting software feature models using transfer learning”.

References

Chen

, Fu

, Zuo

et al. Radar emitter classification for large data set based on weighted-XGBoost[J], Iet Radar Sonar & Navigation 11(8) (2017), 1203–1207.

Kshirsagar

P.R.

, Akojwar

S.G.

and Dhanoriya

, Classification of ECG-signals using Artificial Neural Networks[C], International Conference on Electrical, Computer and Communication Technologies, 2017.

Fitriah

, Wijaya

S.K.

, Fanany

M.I.

et al. EEG Channels Reduction using PCA to Increase XGBoost’s Accuracy for Stroke Detection[C], Iscpms AIP Publishing LLC, 2017,pp. 2489–2492.

Ramraj

, Nishant

, Sunil

, et al., Experimenting XGBoost Algorithm Prediction and Classification of Different Datasets[C], National Conference on Recent Innovations in Software Engineering and Computer Technologies, 2017.

Chen

, He

, Benesty

et al. XGBoost: Extreme Gradient Boosting[J], 2016.

Chen

and Guestrin

, XGBoost: A Scalable Tree Boosting System[J], 2016, pp. 785–794.

Pavlyshenko

B.M.

Linear, machine learning and probabilistic approaches for time series analysis, IEEE First Int Conf on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 2016, pp. 377–381.

Zhang

and Johnson

, Learning nonlinear functions using regularized greedy forest, IEEE Transactions on Pattern Analysis and Machine Intelligence 36(5) (2014).

Jabbar

M.A.

, Deekshatulu

B.L.

and Chandra

, Heart disease prediction system using associative classification and genetic algorithm[J], Computer Science (2013).

10.

Austin

P.C.

, Tu

J.V.

, Ho

J.E.

et al. Using methods from the data mining and machine learning literature for disease classification and prediction: A case study examining classification of heart failure sub-types[J], Journal of Clinical Epidemiology 66(4) (2013), 398.

11.

Gonçalves

Guizzardi

and J.G.P.

Filho

, Using an ECG reference ontology for semantic interoperability of ECG data[J], Journal of Biomedical Informatics 44(1) (2011), 126.

12.

Burges

C.J.

, From ranknet to lambdarank to lambdamart: An overview, Learning 11 (2010), 23–581.

13.

Polat

, Günes

and S.

Tosun

, Diagnosis of heart disease using artificial immune recognition system and fuzzy weighted pre-processing[J], Pattern Recognition 39(11) (2006), 2186–2193.

14.

C.P.

and Reilly

R.B.

, A patient-adapting heartbeat classifier using ECG morphology and heartbeat interval features [J], IEEE Transactions on Biomedical Engineering 53(1) (2006), 2535–2543.

15.

Güler

İ.

and Übeyli

E.D.

, ECG beat classifier designed by combined neural network model[J], Pattern Recognition 38(2) (2005), 199–208.

16.

Israel

S.A.

, Irvine

J.M.

, Cheng

et al. ECG to identify individuals[J], Pattern Recognition 38(1) (2005), 133–142.

17.

Bekkerman

, The present and the future of the kdd cup competition: An outsider’s perspective.

18.

Willis

H.J.

, Morris

D.C.

and Wayne

A.R.

, The use of the New York Heart Association’s classification of cardiovascular disease as part of the patient’s complete Problem List [J], Clinical Cardiology 22(6) (1999), 385.

19.

Kotch

J.B.

, The effectiveness of medical care: Validating clinical wisdom, by barbara starfield[J], Journal of Public Health Policy 7(2) (1986), 268–270.