A data mining scheme for detection and classification of diabetes mellitus using voting expert strategy

Abstract

In this work, an efficient scheme has been proposed for the computer-aided detection of the wide-spread disease diabetes. This scheme involves certain data mining techniques for the purpose of detecting the chances of diabetes by looking into a patient’s medical record. This work attempts to classify the nature of diabetes (Type-I and Type-II) as well. It also tries to determine the level of risk associated presently with the affected patient. Four different algorithms namely decision tree, Naive Bayes, support vector machine (SVM), and Adaboost-M1 have been used for the purpose of labeling the records as either diabetic or non-diabetic. A comparison strategy is then followed to adopt the best scheme among these through the voting expert. The proposed work gives satisfactory diagnosis result when compared to the ground-truth data. Overall accuracy rate of 95% is achieved through k-fold cross-validation ( $k=10$ ) method. Comparison of the proposed work with other state-of-the-art schemes has also been performed that favors the said work.

Keywords

Diabetes mellitus data mining adaboost voting expert

1. Introduction

Diabetes is generally not a disease rather the inability of the human body through which it can not break the blood sugar into its constituents. The technical nomenclature for this is Diabetes Mellitus [1]. There may be mild to critical health condition pertaining to this inability. The person suffering from this medical condition has to take care of his/her blood sugar level with utmost care failing which it may lead to critical and sometimes fatal situations. There are several theories behind this medical conditions. It may be the pancreas which may not be able to produce the insulin which is solely responsible for dealing with the balance maintenance of the blood sugar. It may be the insulin strength which may not be strong enough able to do the task of balancing the blood sugar. Several other theories does exist as well. Based on the pre-conditions, diabetes mellitus is categorized into three different types. They are:

1.
Type-1-diabetes mallitus (DM): A person suffering for this condition is unable to secret sufficient insulin through his/her pancreas. The reason behind this condition is still a research area in medical science and thus unknown.
2.
Type-2 DM: In this medical condition, the boy of the person becomes resistant to the insulin and thus can not break the blood sugar.
3.
Gestational DM: This type of medical condition occurs due to high blood pressure and hypertension over a long period of time. Generally, pregnant woman suffer from this type.

As per the survey report by national diabetic official forum approximately 10% among the US citizens are suffering from DM [2]. Worldwide, more than 420 millions of people (equally men and women) have got DM.

Data mining refers to the extraction of meaningful information from a large and vivid repository. It is treated as a KDD (knowledge data discovery) process (Fig. 1). The analysis pertains to known patterns and the challenge is to predict the nature of the unknown pattern. In this regard, data mining has been considered as a boon in the field of medical science. It is thus gaining importance day by day. By this, numerous disease and medical conditions can be predicated with accuracy and with least human interventions. Its applications in medical science includes analysis of several medical conditions like cancer, diabetes, heart problems, liver health condition, and brain related conditions. However diagnosis and analysis of DM is gaining importance world wide as this medical condition is spreading rapidly.

Figure 1.
General overview of a KDD system [3].

1.1 Related works

Numerous analytical tools and techniques have been proposed for the purpose. Several schemes have been proposed for the diagnosis and detection of DM. The schemes generally employ different concepts like Naive Bayes, Kernels, Support vector, Probabilistic models etc. [4, 5, 6].

In [7], the author have presented a scheme called CoLe that can predict ealy stage of diabetes. CoLe is an integrated framework. This can be trated as a pure KDD process. In [8], different types of classifiers have been individually implemented. These algorithms includes KNN (K nearest neighbor), Naive Bayes, LDA (linear discriminant analysis), C-4.5, and ID3. The C-4.5 algorithm has been proved to be a better classifier for prediction of DM with an accuracy rate of 91%. In [9], an expert clinical system has been proposed for the diagnosis of DM. It uses XCS (extended classification system) as the constituent classifier. In [10], the apriori scheme has been utilized for the intra-classification of type-2 DM. Association rule mining has been the key classifier in this context. Suitable pre-processing techniques have also been applied for the purpose. In [11], another framework called duo-mining tool is introduced for the diagnosis of diabetes. This also focuses on the intra-classification of type-2 DM. In [12], hierarchical clustering technique is used discover the different models for predicting the controlling mechanism for DM. In [13], the CART has been used for prediction of diabetes. It also predict the risk factor associated with the medical condition of the person. In [14], both the SVM and Naive Bayes are used for the purpose of predicting DM. This work focuses on diabetic retinopathy. It gives an overall accuracy rate of 84%. In [15], a database has been proposed for the purpose of diagnosis of DM through several data mining techniques. Disease correlation approach has been proposed in [16] using Naive Bayes method for diagnosis of heart related diseases occuring due to DM.

It is observed from the literature that none of the methods are focusing on an ensemble model for the purpose. An ensemble would help in taking the best algorithm always which can predict the medical condition with utmost accuracy at a particular instance. The proposed work here focuses on such an ensemble which enables it to take always the best among all accuracies providing algorithm and use it for the purpose of diagnosis and prediction.

2. Proposed work

A general overview of the proposed work has been shown in Fig. 2.

The proposed work employs an ensemble that uses the voting expert scheme. This scheme takes into account three different algorithms namely, decision tree, Naive Bayes, and SVM to generate the output. If only multiple algorithms give identical prediction label then only the output is considered as consistent and it is given as final output. Thus it eliminates the chance of misclassification. This is quit helpful in correct prediction which is most needful so far as medical conditions are concerned. All the three algorithms are discussed below in a sequence.

Figure 2.

General overview of the proposed work.

Figure 3.

Sample representation of decision tree.

2.1 Decision tree (J-48)

Decision tree (Fig. 3) has been successfully used for business decision analysis. However, it can also be used for prediction of DM as in our case. It is a tree with internal nodes pertaining to certain criteria and the external nodes or leaf nodes as the label for prediction. Based on the training data, the tree structure is designed. Each of the internal node is having an information gain value ( $I_{g}$ ). This value can be obtained by logically considering the concept that how well this node can classify the given set of data. The mathematical formula involved in computing the $I-g$ value is given below:

$\displaystyle I_{g}=E_{\textit{initial}}-E_{\textit{split}}$ (1)

$E_{\textit{initial}}$ is the initial entropy before splitting the dataset through the particular node.

$\displaystyle E_{\textit{split}}=-\sum\limits_{i=1}^{\max}f_{i}\times p(i)\log p% (i)$ (2)

where, the sum is taken over the total number of classes involved in the training dataset, $E_{\textit{split}}$ is the entropy value after splitting the dataset and $f_{i}$ is the specific part number of the sample of class- $i$ . In general, the internal nodes are the key attributes and leaf nodes are class-labels. In our case the class-labels are (DM-1, Dm-2, DM-3, and Healthy). A sample representation of the decision tree has been shown in Fig. 2.

2.2 Naive Bayes

The Bayes theorem for conditional probability has been adopted for this purpose. The features (attributes) are analyzed independently. This is because each of these features are considered equally important. Naive Bayes is powerful in the sense it can be applied to a vast dataset. For this case our data has been split into train/test samples in the ration 60/40.

2.3 SMO (sequential minimal optimization)

In general, the quadratic algorithmic problems are solved using this SMO. It uses quadratic kernel from the SVM. Thus, it is an analytical process and can be used in our case. It is robust as it can fill the missing values in a processed feature set. The core input to this process are the binarized values those are obtained by conversion from nominal values. The same ration of train/test samples (60/40) have considered in this case as well. A simple snapshot of the SMO flowchart has been shown in Fig. 4.

Figure 4.

A simple snapshot of the SMO flowchart.

2.4 Adaboost

Adaboost stands for adaptive boosting. This is a meta-algorithm which supports pattern classification. It is generally combined with few other weak classifiers. A weighted sum strategy is considered for boosting the classifier. The term boosting here refers to the fact that the weak classifiers that are combined with this method are subjected to adaptation so that the misclassification can be reduced over a certain iteration of learning. Thus it is less prone to the over-fitting pro- blem. A symbolic representation of this theory has been shown in Fig. 5.

Figure 5.

General theory for ada-boosting.

2.5 SVM

SVM has been successfully used as an efficient classifier for many a non-linear classification problems. The objective of a SVM classifier is to generate the hyperplane between two classes of objects among a distribution. Simultaneously, maximizing the width of the margin. The corresponding objective in terms of equation for the same is given below:

$\displaystyle\textit{Maximize}∼{}∼{}\frac{2}{\|\textit{margin}\|}$ (3)

such that, $(\textit{margin}.x+\textit{constant})\geqslant 1;\forall x$ in Class-A, and $(\textit{margin}.x+\textit{constant})\leqslant 1;\forall x$ NOT in Class-A. The corresponding steps involved for the purpose are listed below in a sequence:

Define a margin with optimum width;

Introduce a penalty value with respect to miss classification and thus subtend the margin width definition (Linearly non-separable cases);

Perform mapping of the data points up to the level where which is suitable for classification with linear surface;

Output the final optimized margin width.

The train and test samples used for the SVM set up are taken to be 400 and 300 samples respectively.

2.6 Dataset

The Pima Indians Diabetes Database has been used for the experimental evaluation of the proposed work. The dataset is available at [17]. This dataset belongs to National Institute of Diabetes and Digestive and Kidney Diseases. It contains 768 number of instances and each instance contain at least 8 number of feature attributes. The feature values are depicted blow:

•
Number of times pregnant.
•
Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
•
Diastolic blood pressure (mm Hg).
•
Triceps skin fold thickness (mm).
•
2-Hour serum insulin (mu U/ml).
•
Body mass index (weight in kg/(height in m) ${}^{2})$ .
•
Diabetes pedigree function.
•
Age (years).

Out of these data, a total of 700 samples have been considered for our case. Out of these again, different number of train/test samples are considered for all the classifiers separately. The detail of training samples are given below:

•
For J-48; $|$ train $|$ $=$ 420.
•
For AdaBoost-M1; $|$ train $|$ $=$ 300.
•
For Naive-Bayes; $|$ train $|$ $=$ 450.
•
For SVM; $|$ train $|$ $=$ 200.

Table 1
performance comparison (rates of accuracy)

Evaluation/methods J-48 Naive-Bayes SMO AdaBoost SVM Voting

True-classified 600 615 665 632 674 632

False-classified 100 85 35 68 26 68

Kappa 0.4985 0.4588 0.3594 0.3946 0.3685 0.3946

MSE 0.24 0.22 0.11 0.16 0.06 0.16

RMSE 0.0039 0.0037 0.0029 0.0036 0.0023 0.0029

Total 700 700 700 700 700 700

Accuracy in % ( $k$ -fold) 82 83.5 85.5 91 92 82

Accuracy in % (Precision) 82.25 83 85.75 90 92.25 82.25

2.7 Working steps

Evaluation/methods	J-48	Naive-Bayes	SMO	AdaBoost	SVM	Voting
True-classified	600	615	665	632	674	632
False-classified	100	85	35	68	26	68
Kappa	0.4985	0.4588	0.3594	0.3946	0.3685	0.3946
MSE	0.24	0.22	0.11	0.16	0.06	0.16
RMSE	0.0039	0.0037	0.0029	0.0036	0.0023	0.0029
Total	700	700	700	700	700	700
Accuracy in % ( $k$ -fold)	82	83.5	85.5	91	92	82
Accuracy in % (Precision)	82.25	83	85.75	90	92.25	82.25

Mentioned below is the list of working steps which are executed in sequence for implementing the proposed work.

indent=2em Proposed algorithm (Voting expert with the quad)[1] Load raw dataset Apply data pre-processing and filtering with target attribute as the last attribute (class-label); (Weka tools) Replace and standardize missing values to generate final dataset $D$ Implement J-48 to $D$ and record the results Implement Adaboost-M1 to $D$ and record the results Implement SMO to $D$ and record the results Implement Naive Bayes to $D$ and record the results Implement SVM to $D$ and record the results Test the above four models using separate test dataset with $k$ -fold ( $k=10$ ) cross validation set up Evaluate all the four algorithms through voting expert and output the label which ever getting higher favor of prediction (both DM-1 and DM-2) In case of tie, output the label as DM-3 Output the diagnosed result

3. Experimental evaluation

The algorithm is executed using the Weka tools which is an open source tools developed by the Waikato University. Satisfactory results are obtained pertaining to the proposed scheme. The performance measure of the proposed scheme has been generated using two different techniques separately. One is the k-fold cross validation method, where the value for k has been considered to be 10. The second method is using the precision and recall method that consider the TP (True positive), TN (True Negative), FP (False Positive), and FN (False Negative) parameters. Here the accuracy is computed as per the formula given below:

$\displaystyle\textit{Accuracy}=\frac{\textit{TP}+\textit{TN}}{\textit{TP}+% \textit{TN}+\textit{FP}+\textit{FN}}$

Figure 6.

Comparison of the rate of recognition accuracy of constituent schemes individually.

The results so obtained are depicted in Table 1. It can be observed that, for the particular dataset, the SMO scheme outperforms others with respect to the overall rate of accuracy.

Comparative analysis of the proposed scheme is also carried out with three other state-of-the-art schemes namely neural network, Cole, and RBF-NN (Radial basis) with the same dataset. The corresponding plot is shown in Fig. 6. It can be noted that the proposed scheme outperform others in terms of overall rate of accuracy. This validates te robustness and effectiveness of the proposed scheme.

4. Conclusion

A novel data mining approach has been proposed for the detection and classification of the medical condition diabetes mellitus. An ensemble of analytic algorithms is designed by taking four different algorithms (Quad). Voting expert strategy has been implemented to predict the correct label for the patients medical condition as such DM1, DM2, DM3, and healthy. The proposed work has been validated on benchmark dataset after proper pre-processing and missing value replacement. Satisfactory rate of accuracy (95%) has been achieved which validates the robustness of the proposed approach. The scheme has also been compared with three different state-of-the-art schemes for the related work and it outperforms others in terms of accuracy. Future works may include the disease correlation establishment between DM and other related medical conditions that arises mainly due to excess sugar component in human blood.

References

http://www.diabetes.org/diabetes-basics/type.

National Diabetes Statistics Report. 2014. Available from: http://www.cdc.gov/diabetes/pubs/statsreport14/national-diabetes-report-web.pdf.

Pei

Han

and Kamber

, Data Mining: Concepts and Techniques, Elsevier, 2011.

Burges

C.J.

, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, Springer, The Netherlands 2(2) (1998), 121–167.

Stuart

and Peter

, Artificial Intelligence: A Modern Approach, Prentice Hall, 2003.

Arndt

and Arndt

, A tree kernel based on classification and citation data to analyse patent documents, Data Analysis, and Knowledge Organization, Springer, Berlin, Heidelberg, 2010.

Gao

J.R.

and Denzinger

, Cole: A cooperative data mining approach and its application to early diabetes detection, in: Proceedings of the 5th International Conference on Data Mining (ICDM-05), 2005.

Rajesh

and Sangeetha

, Application of data mining methods and techniques for diabetes diagnosis, International Journal of Engineering and Innovative Technology (IJEIT) 2(3) (2012), 224–229.

Afrand

Yazdani

N.M.

Moetamedzadeh

Naderi

and Panahi

M.S.

, Design and implementation of an expert clinical system for diabetes diagnosis, Global Journal of Science (2012), 23–31.

10.

Patil

B.M.

Joshi

R.C.

and Toshniwal

, Association rule for classification of type-2 diabetic patients, in: 2nd International Conference of IEEE on Machine Learning and Computing, 2010, pp. 67–73.

11.

Jaya

R.K.V.V.

Chandra

S.D.V.

Satya

P.R.

and Rao

K.R.H.

, An empirical study about type-2 diabetes suing duo mining approach, International Journal of Computational Engineering Research 2 (2012), 33–42.

12.

Mandal

and Dubey

, Implementation and evaluation of diabetes management system using clustering technique, Special Issue of International Journal of Computer Science and Informatics 2 (2012), 33–46.

13.

Kavitha

and Sarojamma

R.M.

, Monitoring of diabetes with data mining via cart method, Special Issue of International Journal of Computer Science and Informatics 2 (2012), 157–162.

14.

Ananthapadmanaban

K.R.

and Parthiban

, Prediction of chances – diabetic retinopathy using data mining classification techniques, Indian Journal of Science and Technology 7 (2014), 1498–1503.

15.

Breault

J.L.

, Data mining diabetic databases: Are rough sets a useful addition? in: Proceedings of the 4th National Conference, INDIA Com-2010 Computing for Nation Development, 2010.

16.

Parthiban

S.G.

and Rajesh

, Diagnosis of heart disease for diabetic patients using naive bayes method, International Journal of Computer Applications 24 (2011), 75–87.

17.

Lichman

, UCI machine learning repository, 2013. [Online]. Available: http://archive.ics.uci.edu/ml.