Abstract
Breakthrough classification performances have been achieved by utilizing ensemble techniques in machine learning and data mining. Bagging is one such ensemble technique that has outperformed single models in obtaining higher predictive performances. This paper proposes an ensemble technique by utilizing the basic bootstrap aggregating technique on hybridization of two base learners namely Naïve Bayes (NB) and Decision Tree (DT). Before induction of the DT, NB algorithm is employed for eliminating mislabeled or contradictory instances from the training set. Consequently, bagging approach is applied on hybrid NBDT as the base learner. The resultant Bagged Naïve Bayes-Decision Tree (BNBDT) algorithm is then used for improving the classification accuracy of various multi-class problems. This algorithm iteratively trains the base learner from random samples of the training set, and then performs majority voting of their predictions. The proposed algorithm is compared with both ensemble and single classification techniques such as Random Forest, Bagged NB, Bagged DT, NB, and DT. Experimental results over 52 UCI data sets with bag size 100 demonstrate that the proposed algorithm significantly outperforms the existing algorithms.
Keywords
Introduction
In the area of machine learning and data mining, supervised classification mechanisms are considered important tools for decision making. Classification can be defined as a technique for grouping or categorizing the data instances into known classes. It has been applied to numerous applications like bioinformatics, fraud detection, banking, and other areas [1, 2]. As a kind of inductive learning algorithm, decision tree algorithms have been successful to build classifiers with the aim to maximize the classification accuracy. Among probabilistic induction techniques, Naïve Bayes has been widely used due to its competitive behavior with the best inductive learning methods and their inherent resilience to noise [3]. The ensemble classification approaches hybridize multiple hypotheses induced by different base learners to obtain better classification performance [4].
In the last few decades, one of the major issues which has received great attention in classification problems is how to obtain better classifications. This problem intensifies even more when the number of classes are increased. In multi-class scenario, it is assumed that class-labels are independent of one another and therefore most of the proposed techniques attempt to improve the performance of classifiers based on it [5]. Many real-world applications related to disease diagnosis, credit risk scoring, information retrieval and other predictive domains are associated with multi-class classification problems [6]. Such problems have drawn growing interest recently due to their classification difficulty caused by multiple classes. These problems arise in the data instances that belong to one specific class among more than two different classes. In particular, many ensemble and hybrid approaches have been proposed to deal with such multi-class issues. However, most efforts so far only focus on utilizing single learner technique for dealing with binary or multi class datasets. Moreover, various challenges are posed by the multi-class classification problems which need to be effectively investigated [7, 8].
During past years, numerous works have been performed to improve the performance of the learners [9]. The techniques proposed in the existing studies belong to one of the three categories: (1) designing better learning approaches, (2) applying some type of transformation on training dataset, and (3) modifying the predictions produced by the learner. The concept of ensemble learning which is composed of bagging and boosting multiple classifier systems falls into type (1). Most existing classification systems employ individual learners usually designed for solving binary-class problems. Such classification systems are less effective and are unable to deal with multi-class tasks. Although many solutions were proposed for multi-class classification problems, most attention in the literature was given to hybrid or ensemble strategies [10]. Among multi-classifier systems, several aspects on the fusion process of individual classifiers such as each classifier’s accuracy, diversity and independence need to be studied for robust construction of ensemble of hybrid classifiers. Therefore, it is desirable to develop an effective and efficient technique for handling the multi-class classification problems [11, 12].
In this paper, a novel bagged naïve-bayes decision tree methodology has been proposed for solving multi-class classification problems. Firstly, the algorithm utilizes NB classifier for searching troublesome and arduous instances from the training dataset and then removes these samples from the training dataset before construction of the DT classifier for obtaining decisions. The presence of noisy instances in the dataset may produce many potential negative consequences which can decrease the classification accuracy. In addition, the complexity of the classifier may increase due to a number of redundant training samples. Researchers utilize NB classifier for detection of noisy or contradictory instances by taking into consideration the independence assumption and class conditional probabilities of the features. With removal of noisy samples from the data, the quality of the data gets improved which results in building of more accurate and precise models. The real-world data contains various types of errors which are implicit and explicit in nature. Previous researches have shown that among various classifiers, NB is the most noise tolerant algorithm which attempts to reduce the noise levels and improves the quality of data [13, 14]. Due to this reason, the DT classifier might suffer from the problem of overfitting leading to reduced classification accuracy. Furthermore, for datasets with huge number of attributes, the calculation of class conditional independence by a naïve bayes classifier enhances the computational cost. The performance of the proposed bagged NB-DT classifier is evaluated against three ensemble approaches namely bagged NB, bagged DT, RF and two existing individual classifiers viz. DT and NB taking classification accuracy as the performance parameter. An extensive empirical experimental analysis was conducted using 10 runs of 10-fold CV on 52 benchmark datasets from University of California (UCI) machine learning database [15]. Additionally, we compare the proposed and existing algorithms using two-tailed student’s T-test having confidence level of 95%. On the basis of statistical methods, the difference between the proposed and existing algorithm is statistically significant only if the P-value between both the algorithms for a T-test is less than 0.05. The experimental outcomes reveal that the proposed method exhibits encouraging classification results in solving real-life multi-class problems. More specifically, the method proves to be more robust during the presence of noisy instances in the dataset, and therefore produces more effective and accurate classification results.
The remainder of the paper is organized as follows. In Section 2, related work and some background on bagging approach, Naïve Bayes and decision trees classification algorithms is provided. Section 3 addresses the preliminary concepts and descriptions about the classification techniques. Section 4 introduces our novel proposed bagged NB-DT framework, followed by the experimental analysis and discussions in Section 5. Finally, we conclude the paper in Section 6 with suggestions for future works.
Literature review and background
This section reviews the recent researches performed on decision tree, naïve bayes and bagging-based learning models in solving numerous challenging real-life classification problems. Such problems are usually categorized into binary and multi-class tasks. Decision tree classification methodology is the most widely used data mining technique which develops prediction systems using large number of covariates and a target variable. It can efficiently deal with complicated and huge size datasets, and are robust to outliers. The complexity of constructed decision tree can be controlled by employing the stopping criteria and pruning method which can improve the classification accuracy. De Caigny et al. [16] presented a hybrid classification algorithm which utilizes decision trees and logistic regression classifiers for customer churn prediction. The proposed logit leaf model (LLM) exploits the capabilities of both techniques where in DT can effectively deal with interaction effects among variables and LR can handle linear relations among attributes very efficiently. This technique enhances the predictive performance of the classifiers as well as interpretability of the results. Kotsiantis [17] proposed a decision tree technique by combining local application of naïve bayes classifier to increase the classification accuracy of the model. Further, the proposed algorithm was compared with the other state-of-the-art methods on 30 benchmark datasets for attaining the performance parameters. Carvalho and Freitas [18] proposed a hybrid-based decision tree/ genetic algorithm describing the data mining concept of small disjuncts. In this approach, two genetic algorithms (GAs) have been developed for discovery of rules belonging to small disjuncts as opposed to the DT approach that generates rules to classify instances belonging to large disjuncts.
Panhalkar and Doye [19] presented a literature survey about various hybridized methods for DTs based on integration of the basic DT with several approaches like AVL tree, clustering, genetic algorithm, Naïve Bayes and fuzzy techniques. Wang et al. [20] developed a novel self-adaptive NBTree methodology which creates a hybrid of DT and NB. The NB classifier that builds DT can deal with continuous variables and solves the overgeneralization as well as overspecialization problems that are mostly observed in DT. Polat and Gunes [21] presented a new hybrid classification algorithm for solving multi-class problems which are based on the combination of C4.5 DT learner and one-against-all technique. Classification accuracy, sensitivity and specificity results were obtained with 10-fold cross validation on UCI datasets. Lee et al. [22] proposed a novel bagging C4.5 approach that supports wrapper based feature selection for improving the diagnostic accuracy of clinical decision support systems. The proposed sampling method (S-C4.5-SMOTE) not only deals with the data distortion problem but also boosts the overall classification performance of system by maintaining a balanced dataset. Singh and Verma [23] proposed a multi-classifier model for predicting the faulty modules of software system. The proposed model which is an amalgamation of NB, SVM and RF provides better results than each of the individual classifiers, AIR1, AIR2 and proves to be efficient and robust in fault prediction of numerous software projects.
During the past few years, ensemble or multiple classifier systems (MCS) have proved their excellence in terms of accuracy and competency when compared with single classifier models. Ala’raj and Abbod [24] presented a novel combination technique based on consensus of heterogeneous classifiers for integrating MCS which possess diverse classification algorithms. The 5 major base classifiers used were neural networks, SVMs, RFs, DT and NB. The performance results of the proposed consensus approach were compared with two benchmark algorithms viz. logistic regression (LR) and multivariate adaptive regression splines (MARS) on 5 real-world credit scoring datasets. Also, the model was validated over 7 traditional combination methods using 4 different performance evaluation measures namely accuracy, AUC, H-measure and Brier score. Sun et al. [25] proposed a DT ensemble technique for enterprise and bank credit evaluation based on SMOTE (Synthetic Minority Over-sampling Technique) for class-imbalanced data and bagging ensemble method with DSR (Differentiated Sampling Rates), known as DTE-SBD (Decision Tree Ensemble based on SMOTE, Bagging and DSR). The model not only solves the problem of class imbalance but also enhances the diversity among base classifiers during ensemble modeling of DT. The experiments were conducted 100 times on financial data, and DTE-SBD significantly outperformed when compared with the 5 other models namely pure DT, bagging DT, SMOTE DT, over under-sampling DT and oversampling DT.
Another commonly used data mining tool is the NB classifier used for prediction and classification purposes due to its high computational efficiency, simplicity and better classification accuracy. This classifier is predominantly designed for huge size datasets. It is based on the assumption of strong conditional independence among attributes which in most cases decreases the prediction performance Wu et al. [26] proposed a novel self-adaptive variable weighing method based on the concept of Artificial Immune System (AIS) for classifying naïve bayes. The proposed method AISWNB accurately calculates the conditional probability and increases the assumption of conditional independence. Chandra and Gupta [27] presented a robust NB classification algorithm (R-NB) for overcoming its two main drawbacks viz. overfitting and underflow for prediction and classification of high-dimensional gene data. The underflow problem is solved by adding the logarithms of the probabilities instead of multiplying them, and the overfitting problem by using the estimate approach. The novel approach uses a robust function for determining the NB classifierprobabilities.
Among ensemble classification approaches random forests has proved its merit in solving several multi-class classification problems in the domain of machine learning. It can be regarded as a meta-estimator that fits a large number of DT classifiers on many subsamples of the dataset It uses averaging to control overfitting thus improving the classification accuracy. Wei et al. [28] developed a novel cascade random forest (CRF) algorithm to resolve the issue of data imbalance that occurs in prediction of Protein-Protein Interaction sites (PPIs). Lin et al. [29] proposed a random forest based ELM ensemble approach for multi-regime time series prediction. The regularized ELM combined with ensemble technique improves the prediction accuracy as well as the performance of generalization. Chaudhary [9] et al. presented an improved random forest classifier (RFC) for multi-class disease classification problem. The proposed approach is an integration of RF, attribute evaluation method and a filter method which shows best performance among the five benchmark datasets. Silva-Palacios et al. proposed two approaches to solve the multi-class classification difficulties by hierarchically decomposing the original problem and decreasing the number of classes of each local subproblem [30].
Supervised classification methods
Classification is one of the most frequently used data mining and machine learning mechanisms used for prediction and decision making. The following sub-sections discuss about the three state-of-the-art supervised learning techniques i.e. Naïve-Bayes, Decision Tree and Random Forest utilized for performing the comparative study.
Decision tree induction
A decision tree algorithm generates trees as well as rules for data classification. A common DT algorithm is C4.5 which is an improvement over the ID3 algorithm since it deals with continuous and discrete variables as well as missing values. The final output of DT is either a fully constructed tree from root to leaves or a ruleset that assigns class label to a new data sample. For a given training dataset
The information needed after using attributes
The information gained by branching on attribute
Naïve Bayes (NB) based on the Bayes’ theorem is a probabilistic classifier [31] with assumptions of attribute-independence. It predicts the class membership probabilities as described in the following equation:
where P (C
i
/X) denotes the posterior probability, P (X/C
i
) as likelihood, P (C
i
) as class prior probability and P (X) as the predicted prior probability. NB classifiers assume that the effect of an attribute value on a given class is independent of the values of other attributes. Thus, it is based on the assumption of class conditional independence. Further, let the training dataset be
In Equation (5), 1 ≤ j ≤ m, j ≠ i. Further from Equation (4), since P (X) is constant for all the given classes, the term P (X/C
i
) P (C
i
) needs to be maximized. In case, the prior probabilities of the classes are unknown, it is presumed that classes are having equally likely distribution such that P (C1) = P (C2) = … = P (C
m
)and therefore maximize X/C
i
or else maximize P (X/C
i
) P (C
i
). The prior probabilities of the classes are computed using P (C
i
) = |Ci,D|/|D|, where |Ci,D| represents the number of training instances of class C
i
in dataset D. Among high dimensional datasets, it becomes extremely expensive to compute P (X/C
i
). Therefore, given the class label of a tuple, NB class independence assumes that the attribute values are conditionally independent of one another. Consequently, this reduces the computation in evaluating P (X/C
i
) which is defined as
The probabilities P (x1/C
i
) , P (x2/C
i
) … P (x
n
/C
i
) can be estimated from the training tuples where x
k
denotes the value of attribute A
k
for tuple X. For each of the given attribute, we determine whether the attribute is categorical or continuous-valued. In computation of P (X/C
i
), if A
k
is categorical, then P (x
k
/C
i
) is the probability of determing x
k
for A
k
given the class C
i
in the dataset D. If A
k
is continuous-valued, then it is assumed to have Gaussian distribution with mean μ and standard deviation σ given by
Random Forest proposed by Breiman [32] is an ensemble classification approach which integrates multiple tree predictors in such a way that each tree depends on the values of an independently sampled random vector with similar distribution among all the trees in the forest. The RF algorithm produces a random vector θk, independent from the past random vectors θ1, …, θ
Proposed Bagged Naïve Bayes-Decision Tree approach
This section discusses the proposed Bagged Naïve Bayes-Decision Tree technique based on the application of ensemble approach to the integration of C4.5 decision tree with NB classifier. For a given training dataset D = {x1, x2, …, x n }, each sample is depicted as x i = {x i 1, x i 2, …, x i h}, and the variables are shown as {A1, A2, …, A n }. The set of class labels in the training data are defined as C = {C1, C2, …, C m }. Firstly, the NB classifier is applied to dataset D for prediction of each x i . Then, for each class C i ∈ D, the prior probability P (C i ) and for each variable A ij , the class conditional probability P (A ij |C i ), are calculated for dataset D. Further, each training sample is classified by utilizing these probabilities and finally the sample belongs to class C i which has the highest posterior probability P (C i |x i ). Thus, all the wrongly classified training instances are removed from the dataset D. These misclassified samples are the troublesome instances which adversely affect the classification accuracy and are therefore eliminated by the NB classifier. Such noisy samples contain improper features and contradictory class labels as compared to the original data. Due to the presence of noisy samples, the DT learner might overfit or its accuracy may decrease.

Workflow of the proposed Bagged Naïve-Bayes Decision Tree algorithm.
After removal of noisy instances from D, the decision tree learner is built from the newly constructed noise free dataset D′. For DT construction, the best splitting attribute is obtained which produces highest information gain. Further, the accuracy of the proposed model can be increased by optimizing the DT parameters such as tree size, split parameter etc. The flow chart of the proposed model is shown in Fig. 1.
As an ensemble-based technique, bagging integrates the predictions of multiple classifiers to produce an individual classifier. The resulting single classifier is more accurate than any of the individual classifiers forming the ensemble. Previous researches have demonstrated that a better ensemble can only be formed when the single learners forming the ensemble are accurate and generate their errors on distinct regions of the input space. One of the key factors which improves the performance as well aspredictive accuracy of the bagging technique is stability whereas instability refers to small modifications in the dataset which leads to crucial changes in the prediction results. Bagging creates accurate ensembles on the basis of resampling techniques, also known as bootstrapping methods by obtaining varied training datasets for each individual learner. This technique is more efficient on learning algorithms such as neural networks and decision trees which are unstable in nature [33]. Bagging improves the classification accuracy of unstable machine learning algorithms but not of the stable machine learning algorithms. It helps reduce variance and avoids overfitting of the classifier.
In this paper, two approaches NB and DT are selected as weak leaners for ensemble learning. Firstly, both NB and DT are hybridized into one single entity over which the bagging approach is applied with creation of 100 such bags. Finally, the 100 prediction outputs are combined through majority voting technique as a classifier consensus system. The class label obtaining the highest number of votes is regarded as the final predicted class for the test sample. The basic steps of proposed approach are as follows: Suppose T = {(x1, y1) , (x2, y2) , …, (x
m
, y
n
)} depicts a training dataset where m is the total number of samples and their corresponding class labels are represented by n, such that an individual instance with its associated class is shown as (x
i
, y
i
). A uniform sampling with replacement technique is adopted for obtaining a new training dataset D with some repeated observations, with D having approximately 63.2% of unique samples of T and the remaining left samples are duplicates. Due to this, the overall bootstrap sampled dataset consists of some repeatable instances whereas some samples are eliminated. Finally, the hybrid NB-DT algorithm is applied to the different bootstrap data bags for obtaining their respective predictions, to which majority voting is applied for attaining an integrated output. The Bagged Naïve Bayes Decision Tree approach is explained as below inalgorithm 1:
The detailed experiments were performed on 52 benchmark datasets. The missing values were removed from the dataset as a part of preprocessing mechanism. The flowchart of the proposed work has been shown as Fig. 1. Furthermore, the benchmark data and parameters, comparison with the baseline methods, tabular analysis with the accuracy as performance parameter is shown.
Details of the experimental analysis results on classification accuracy and standard deviation (SD) % on b = 100
Details of the experimental analysis results on classification accuracy and standard deviation (SD) % on b = 100
Two-tailed Students’ t-test on classification accuracy (ACC).
Two-tailed Students’ t-test on classification accuracy (ACC).
The whole experiment is conducted in the MATLAB platform, which runs on Windows 8 operating system with Intel® Core™ i7-4770 CPU (3.4GHz) and 8 GB of RAM. The proposed method is implemented on 52 benchmark datasets from the UCI repository [34] and some other benchmark databases. Appendicitis dataset [35], three datasets namely ring, titanic, and twonorm from delve repository [36], and saheart dataset [37] utilized in the experimental analysis are obtained from different sources other than UCI. The different datasets includes data from life science, physical sciences and other areas. As a data preprocessing step, we have removed those instances from the dataset which contain missing values. The results of classification are achieved by experimental analysis of ten runs of ten-fold CV with proposed model obtained from training data and performance parameters evaluated from thetesting data.
In our experiments, the selected algorithms are evaluated using classification accuracy (measured by ACC) as the performance parameter. The accuracy of each algorithm is computed using the percentage of correctly classified instances in the testing dataset. ACC is defined by
In order to perform a comparative analysis, the proposed algorithm has been compared with the below given baseline methods:
Table 1 reports the detailed results of the average accuracy (ACC) and the corresponding standard deviation values of Bagged NB-DT and other baseline algorithms. The ACC values of competing algorithm are associated with either a or b or none. Here a represents that the ACC values of the algorithm is significantly higher (upgradation) than the proposed algorithm. Similarly, b shows that the ACC value of the competing algorithm is significantly lower (degradation) than that of the proposed algorithm. Neither a nor b write up along with the ACC value of any competing algorithm means that there is no significant difference between the ACC values of that competing algorithm and the proposed algorithm. For both a and b the level of significance denoted by p is less than 0.05 i.e. the confidence level is 95 percent. Thus, the symbols a and b represent statistically significant upgradation and degradation respectively over the proposed Bagged NB-DT with the p-value less than 0.05.
The performance of the proposed Bagged NB-DT, which uses the concept of bagging applied to hybridization of Naïve Bayes and Decision Tree is determined in terms of classification accuracy (ACC). In the Fig. 2, data points below the y = x diagonal line are those datasets on which the proposed Bagged NB-DT algorithm achieves better results than the competing algorithm.

Bagged NB-DT vs. other algorithms: classification accuracy (ACC).
In addition, Table 2 illustrates the compared results of two-tailed t-test, in which each entry w/t/l means that the algorithm in the corresponding row wins in w datasets, ties in t datasets, loses in l datasets on the 52 varied datasets compared to the algorithm in the corresponding column. Altogether, the results of Table 2 can be summarized as follows: The proposed Bagged NB-DT algorithm shows the best performance when compared with Bagged NB, wins 22 datasets, ties 28 datasets, loses 2 datasets on ACC. Comparison with Bagged DT shows poor performance (4 wins and 12 losses). With RF, the performance is slightly degraded with 9 wins and 11 losses and 32 ties. When compared to standard algorithms namely NB and DT, it depicts better performance (24 wins, 2 losses and 26 ties) with NB and (23 wins, 3 losses and 26 ties) with DT. DT shows worst performance when compared with Bagged DT and RF with 0 wins on any dataset. It shows bad performance with Bagged NB (12 wins and 15 losses) and almost ties NB (14 wins and 15 losses). NB is inferior to Bagged NB (0 wins and 2 losses), Bagged DT (5 wins and 21 losses) and RF (4 wins and 25 losses). RF slightly outperforms Bagged NB. The results show that RF has a slightly better performance (22 wins and 15 losses) and ties Bagged DT (3 wins and 3 losses). Bagged DT shows better performance (18 wins and 4 losses) compared with Bagged NB. It ties RF (3 wins and 3 losses), shows better significantly outperforms on NB (21 wins and 5 losses), and depicts excellent performance on DT (27 wins and 0 losses
Overall, our experiments reveal that the compared algorithms cannot achieve good performance with respect to accuracy (ACC) measure. Among the competing algorithms, Bagged-DT and RF show slightly better performance on various datasets.
In this paper, we have proposed bagged naïve bayes decision tree algorithm for solving multi-class classification problems. During the first phase, both the classifiers naïve-bayes and decision tree are hybridized to become a single entity which acts as base classifier. This is achieved through removal of misclassified or noisy instances by NB classifier and then the modified data is used for DT induction and classification. The second phase consists of bagging the hybridized classifier obtained from the previous phase. This hybrid classifier acts as a base learner for the bagging algorithm. Considering the optimal bag size as 100, majority voting combination method was used for obtaining the final prediction. The performance of the proposed model was compared with five other classification algorithms namely, Bagged NB, Bagged DT, RF as ensemble techniques and NB, DT as individual models on 52 benchmark datasets from UCI machine learning database. The experimental results exhibit that the conceptualized bagged NB-DT classifier significantly outperforms Bagged NB, Bagged DT, RF, NB and DT. As future work, other types of learners can be hybridized with bagging, boosting and stacking ensemble approaches or genetic algorithms can be used to solve real-life dynamic multi-class problems.
