Abstract
Almost all real-world datasets contain missing values. Classification of data with missing values can adversely affect the performance of a classifier if not handled correctly. A common approach used for classification with incomplete data is imputation. Imputation transforms incomplete data with missing values to complete data. Single imputation methods are mostly less accurate than multiple imputation methods which are often computationally much more expensive. This study proposes an imputed feature selected bagging (IFBag) method which uses multiple imputation, feature selection and bagging ensemble learning approach to construct a number of base classifiers to classify new incomplete instances without any need for imputation in testing phase. In bagging ensemble learning approach, data is resampled multiple times with substitution, which can lead to diversity in data thus resulting in more accurate classifiers. The experimental results show the proposed IFBag method is considerably fast and gives 97.26% accuracy for classification with incomplete data as compared to common methods used.
Introduction
Data classification is one of the tasks performed in data mining and machine-learning. Classification consists of training and testing phase [1]. The training phase is used to build a classifier on a specific training data set, while testing phase is used to classify new instances using learned classifier. Various classification strategies in domains such as medical diagnosis, credit card fraud and facial recognition have been applied successfully. Predictive algorithms such as supporting vector machines, neural network and logistic regression mostly require complete data for classification problems, while these algorithms do not work with incomplete data which has missing values. It is quite common to find missing values in real-world datasets [2]. About 45% of all the datasets available on UCI machine learning repository, which is the most commonly used repository, have missing values [3]. During the data collection process, results can be missing due to various reasons. For example, machine failure during the industrial experiment, respondents ignoring some questions during social surveys, not all tests are carried out on every patient, and data change in the financial datasets can cause missing values in datasets [4, 5]. Thus, handling missing values is an important part of data mining. Inappropriate handling of missing values can have serious impact on performance of predictive models which could lead to invalid conclusions.
The simplest way to handle missing value problem is by deleting instances [6], which contain even a single missing attribute. However, this approach of listwise deletion can introduce biases in the study as demonstrated in [7] and this is more likely to happen when the missing data are not randomly distributed. Thus, most commonly used approach of classification with incomplete data are to substitute missing values with probable values using imputation methods [11]. Thus far, many of imputation methods have been developed [8, 9, 10]. For example, when mean imputation is applied it substitutes features all missing values with median of all the values of the same feature, thus providing a complete data on which any predictive algorithm can be used. Simple/single imputation methods like mean imputation are not very accurate as they treat all values including imputed values as true values while they do not take into consideration the uncertainty of the missing value [12]. However, powerful methods of imputation such as multiple imputations methods are often more precise than single imputation, while being more computationally intensive [13]. When combining both classification algorithms and imputation, it is not easy to determine whether it will be both effective and efficient, mostly during the testing phase.
Ensemble learning is a machine learning model in which a collection of base classifiers is used for classification problem and could improve classification accuracy [14]. Ensemble learning model, when implemented for classification with missing data, builds multiple classifiers in the training phase. In testing phase, each trained classifier classifies incomplete instance while giving majority voted class as class output [15]. Nonetheless, classification with incomplete data is often not possible for the datasets with large number of missing values using available ensemble methods [16]. Hence, work must be done to enable ensemble learning models to classify instances even when a dataset has numerous missing values. The process of selecting significant features from initial features is known as feature selection, which is extensively used on complete datasets in order to improve classification [17]. Studies on the selection of features have also been done on incomplete data [18], but incomplete instances in current approaches are usually replaced by calculating missing values using imputation before classification. Feature selection has the ability to decrease the number of missing instances by removing redundant information. Thus, helping in increasing the precision and speed of the incomplete instance classification.
We propose IFBag that uses multiple imputation, feature selection and bagging ensemble for classification which improves the performance significantly. The proposed IFBag method is tested using eight incomplete real-world datasets taken from UCI. Each dataset contains incomplete instances varying from low as well as high in number. IFBag method is tested using three different classifiers algorithms to analyze the proposed method performance. The proposed IFBag method decreases computational complexity. This is achieved by using bagging ensemble to use set of base classifiers instead of using missing pattern to make large number of classifiers particularly where data has large quantity of missing values. In this study, we will present how to increase accuracy and accelerate the testing phase for classifying missing values. Multiple imputation method is used to compensate for the uncertainty due to missing data to transform missing training data into full training data [19]. The training data features were further pruned by utilizing feature selection. Afterwards, the proposed method uses bagging ensemble learner on the training data. Where bootstrapping helps in minimizing effects of uncertainty due to sampling fluctuation. While ensemble creates multiple base classifiers, which are trained on sub-datasets made by bootstrapping complete training data. These base classifiers are able to classify incoming incomplete instances without any need for imputation. We used four benchmark methods to compare and validate the relative performance of the proposed imputed feature selected bagging (IFBag) methods. We used three different learning classifier algorithms on the proposed IFBag method with eight different UCI datasets. The results show considerable performance improvement as compared to other benchmark methods.
The study is aimed at improving classification performance of existing classification methods. Its main objectives are as follows. Firstly, to use an imputation approach which gives better accuracy and has less computation complexity when used for classifying incomplete data. Secondly, the feature selection method with an ensemble learner is to be selected in such a way that they give high accuracy and takes less execution time to classify new incomplete instances. Lastly, the proposed method will be evaluated to show that it gives higher accuracy and is less computationally expensive as compared to other ensemble methods.
The remaining study is organized as follows. The Section 2 presents related works. In Section 3 the proposed method IFBag is described. Section 4 provides an insight of the experimental setup. All the results are discussed in Section 5. Section 6 concludes the study with and provides future direction.
Related work
In this section, existing traditional approaches are discussed that have been generally used for classifying incomplete data. Then, it discusses classification of incomplete data using ensemble learning techniques. In the end, it describes how feature selection has been used with incomplete data in general.
Numerous approaches have been proposed for classification with incomplete data. Most popular approach for classifying incomplete data is imputation [20]. Imputation can be of two types: single imputation and multiple imputations. Single imputation methods only estimate single value to replace each missing value. Ratolojanahary et al. [21] showed that the k-nearest neighbors (kNN) imputation is the most effective single imputation method to determine missing values. kNN imputation checks the closest
Ensemble learner is a method that builds a collection of classifiers in the training phase for classification task. Each trained base classifiers then predict the class label of instance which are combined to predict a single class label in the testing phase. As compared to traditional classifiers, multiple classifier sets have been proven to be much more accurate. Ensemble learning has also been used historically for classification of missing data. Tran et al. [23] proposed an approach in which feature selection and imputation are applied on incomplete training data after which based on pattern of missing values it creates an ensemble of base classifiers. This approach benefits by classifying without any imputation of new incomplete instances. Empirical results sometimes show it fails to classify the new instances when classifiers feature pattern does not match with trained classifiers. Khan et al. [24] proposed a multiple imputation method with bagging ensemble to classify instances with missing values. This ensemble is able to model uncertainty and diversity in the data while being able to handle high number of missing values. Results showed that bootstrapping with multiple imputation using expectation maximization gives better performance at high missingness ratio while performing similarly at a low missingness ratio as compared to single imputation. Twala and Cartwright[25] proposed bagging ensemble approach. This ensemble approach helps in creating sub-samples of incomplete data all of which are used to train a decision tree classifier. This results in giving an optimized ensemble in size as it only selects decorrelated decision trees while combined outputs of classifiers give its result. No imputation was used in this approach, but a few multiple imputation methods were added later. However, it is not clear how multiple imputation was used in this approach. Baneshi and Talei[26] used multiple imputation using MICE method on incomplete data. On which bootstrapping is applied and the results from it are aggregated using statistical techniques. Valdiviezo et al. [27] approach used a combination of missing data procedures with tree-based classification methods after single and multiple imputation methods. They came to a conclusion that for small missingness single imputation is enough while for large missingness multiple imputation should be used with tree-bagging. Schomaker and Heumann [28] suggested that the best way for calculating randomization valid confidence intervals when using multiple imputation on bootstrapped samples and bootstrapped samples on multiple imputed datasets. He also suggested that multiple imputation of bootstrap sampling should be used when there is low missingness and bootstrapping of multiple imputation should be used during high missingness. The researcher also explored other types classifier techniques to handle incomplete data. Su et al. [29] presented an approach which used an ensemble classifier to handle missing values. They treated incomplete data by removing a fixed amount of attribute values thus creating multiple versions of original incomplete dataset. All the datasets are then imputed and trained on separate classifiers and their classification results are combined. The results of their study showed that ensemble learners with Bayesian as base classifier while expectation maximization as imputation gives improved performance as compared to single classifiers. A draw back in this approach was that it removes more missing values from already incomplete data thus reducing on accuracy. Nanni et al. [30] proposed an approach which uses random sub spaces with multiple imputation. The main idea behind their approach was to divide incomplete data into fixed number of clusters. In each cluster, they replace missing values of missing data objects with its center value or mean value of the cluster. This approach reduces information loss by mean imputation in case of full data replacement by mean. Random subspace is executed multiple time on imputed data to make an ensemble. Increased performance is shown by this method on several medical datasets while not dropping when missingness increases to 30%. However, it is necessary to further investigate ensemble methods for classifying incomplete data.
Feature selection is an approach that is used to extract relevant subsets of features from dataset features by removing unnecessary and insignificant features [17]. Thus, reducing training time and improving classification accuracy. Feature selection can be divided in to two parts: evaluation technique and a search technique. In order to evaluate feature subsets value, we use evaluation technique. Search technique is used to search for new feature subsets. Both techniques affect the result we get after using feature selection as pointed by Chandrashekar and Sahin [31]. We can subdivide evaluation measure into wrapper and filter methods. The wrapper approach tests subsets of feature using a classification algorithm. While mutual information is used to score feature subsets by filter approach. As wrapper approaches use classification algorithms to evaluate feature subsets, and it happens to be more accurate. Filter approaches are often more effective when opposed to the wrapper approach, because they are independent of any classification algorithm. Hancer et al. [32] explains that Search technique is divided into deterministic and evolutionary search techniques for feature selection. Example of deterministic search techniques is the sequential forward selection (SFS) and sequential backward selection (SBS). On the other hand, for feature selection, Evolutionary search techniques like Particle Swarm optimization (PSO), Genetic Programming (GP) and Genetic algorithms (GAs) are used. Evolutionary search techniques can find the global best solution without any assumptions related to search space or domain know how. Qian and Shu[33] points out that although feature selection is applied on data with missing values, most of the classification algorithms still cannot use them directly. Thus, it must be further studied on how we can use feature selection with data having missing values in such a way that it improves performance of classification algorithms.
Proposed IFBag method
In this part, we will discuss the proposed IFBag method. It starts with a section which presents the overall structure and concept behind the proposed method. Next section explains comprehensively on the training and testing phase.
A detailed overview of proposed IFBag method
The proposed IFBag method can be divided into two phases: a training phase and testing phase. Training phase of the IFBag method builds a collection of classifiers to be trained by training data. The testing phase classifies new instances using the trained base classifiers. This proposed method does not require any prior domain knowledge to use. The flowchart of IFBag method is shown in Fig. 1. The proposed IFBag method works on three fundamental concepts. The first concept is to build ensemble of base classifiers. These base classifiers are trained on sub-dataset extracted from bootstrapping training data. This is done as bootstrapping introduces diversity in the training data thus making diverse and accurate classifiers. It allows incomplete instances to be classified by the ensemble without any need for estimating missing values using imputation.
Complete overview of the proposed IFBag method.
The other fundamental concept is based on using imputation on incomplete instances during training phase, and not during testing phase. Powerful imputation method tends to create high quality training data, so classifiers built on imputed data tend to be more accurate. The biggest drawback of using powerful imputation is that they are computationally expensive to perform. But, in most situations, the training phase has no time limit. Therefore, use of multiple imputation methods in the training phase is not a problem with high computation costs. On the other hand, in the testing phase there can be limited time to classify instances, thus high computation cost of using multiple imputation makes it impractical to be used. Last thing is to apply feature selection to further enhance training data by removing unnecessary and insignificant features. Using feature selection can also remove missing values, thus reducing time needed to approximate missing values using imputation.
The main purpose of the training phase is to create a collection of trained classifiers that are used to classify instances with missing values. The basic steps of training phase are shown in Algorithm 1 which are domain independent in nature. The algorithm takes training data
Training phase[1]
Training phase can be divided into three parts: imputation, feature selection and bagging ensemble.
Imputation
We used MICE to estimate the missing values in training data. However, the MICE can be computationally expensive and thus is used only in training phase. Algorithm 2 describes MICE imputation [34]. The MICE algorithm initially initializes all missing values with mean values of each column to get a complete the dataset
MICE[1]
Shows working of MICE imputation.
Feature selection is done to remove redundant and irrelevant features from the dataset. If the search and evaluation methods are not chosen carefully feature selection can become an expensive process. Evaluation methods such as wrapper methods while being more accurate are often computationally expensive as compared to filter methods. This happens as wrapper methods use classification algorithm meanwhile filter methods use mutual information to score feature subsets. In case of a large training data or when using classifiers like multiple layer perceptron wrapper methods tends to be computationally expensive. Some filter methods like correlation feature selection (CFS) [35] and minimal-redundancy-maximal-relevance criterion (mRMR) [36] have accuracy close to wrapper method while being computationally less expensive. Thus, as the datasets we are using contains numerous features and instances, we will be using CFS filter methods. The heuristic approach [37] used by CFS is given in Eq. (1).
In Eq. (1),
Feature selection method working mechanism. (a) Detail working of feature selection method irrespective of dataset. (b) Working of feature selection method when using mammographic dataset.
Evolutionary search techniques such as GA and PSO are proven to give better performance for feature selection [23]. Thus, to do feature selection efficiently, we used GA to search for feature subsets while CFS for evaluation of subset. Figure 3a shows in detail the working of feature selection method. Where GA searches for feature subsets, which are evaluated by CFS evaluator on the basis of correlation between features and sends back MeritS score of each feature subset, which is then used to select the best subset of features. The Fig. 3b shows how feature selection works with mammographic dataset, which is one of the eight dataset used for experiments. The features of mammographic dataset, including BI-RADS, Age, Margin and Density are shown in Fig. 3b. The GA search method searches for different feature subsets that are then sent for evaluation to CFS, in turns obtained the
After imputation and feature selection on the training data, we get a complete data free from irrelevant features and missing values. Then a bagging ensemble technique is applied to the training data, which creates twenty-five base classifiers and each having its own diverse sub dataset created by bootstrap sampling of training data with replacement. Bootstrap sampling helps in minimizing effects of uncertainty due to sampling fluctuation in the training dataset, which occurs when using cross-fold validation. Bagging ensemble combines weak learners sequentially to achieve better performance by introducing diversity in the training data thus making diverse and accurate classifiers.
This ensemble learner can use any classification algorithm where each base classifier is trained on different bootstrap sample which then majority votes over the result. Depending upon the number of base classifiers
Testing phase[1]
Apply each TC to reduced X
The testing phase is where new data instances are classified using ensemble of trained classifiers from the training phase. Testing phase steps are shown in Algorithm 3. Selected features SF, new data instance
Experimental setup
In this section, we explain the setup of the experiments, which are performed. We describe the datasets, methods used for comparison and parameter used in the experiments.
Datasets
Experiments were performed on eight real-world incomplete datasets. These datasets were extracted from UCI machine learning repository [3]. Details regarding the datasets are summarized in Table 1 such as name, number of instances, feature numbers and type of data (Real/integer/Nominal), the percentage of missing instances, and number of classes. In this study we used quantitative and qualitative attributes. The quantitative attributes are easy to handle. However, for qualitative attributes, it requires pre-processing to convert into quantitative values for smooth handling. The selected datasets contain different percentages of missing instances ranging from 1.98% in case of heart-c dataset to 100% in heart-h dataset. Even a single missing feature value is construed as incomplete instance. In the heart-h dataset since all the instances have at least one missing feature value, thus all of them are considered as incomplete instances. The problem of high number of instances such as mammographic dataset with 961 total instances to small number of total instances such as hepatitis dataset having 155 instances. Another problem in the datasets is huge range between high to low dimensionality such has chronic and mammographic dataset has 24 and 5 features respectively. Datasets being used to benchmark this proposed method also have both binary and multiple-class labels.
During experimentation, we used ten-fold cross-validation. In which one-fold is used as testing while nine folds are used as training iteratively ten times as seen in Fig. 4. To eliminate any statistical variation during cross-validation, we performed cross-validation 30 times on given datasets.
List of datasets used
List of datasets used
Ten-fold cross-validation.
To understand how our proposed IFBag method works, we compared it with four benchmark methods to its accuracy and run time. The first two benchmark methods use simple imputation on incomplete data before classification. The remaining two benchmark methods use ensemble approach for classifying incomplete data and thus do not need imputation on new incomplete instance. Following are the details of four benchmark methods used for comparison:
The first benchmark method is named kNNI. Single imputation kNN is used to approximate missing values for both training and test values. Both complete training and testing values provided by the benchmark method allows any classification algorithm to be used.
The second benchmark method is named MICEI. This benchmark method uses MICE. MICE is used to estimate missing values in training and testing instances. TIFBag method and MICEI benchmark method both use MICE to replace missing values. However, MICEI method uses MICE to replace missing data in test instances before classifying them while IFBag method replaces missing values with static values before classifying.
The third benchmark method is named Ensemble [23]. This is an ensemble method which can classify incomplete data. Ensemble [23] benchmark and proposed IFBag methods both apply feature selection technique to reduce features in training data. These both methods also use powerful imputation to predict missing values to make the incomplete training data into a complete training data. The difference between IFBag method and this benchmark method is that IFBag method makes a fixed number of base classifiers which are trained on sub-dataset made from bootstrap with replacement so that each instance can be picked more than once. Meanwhile the benchmark method builds base classifier for each missing value pattern found in the training data. So, if training data has high amount of missing values, it can be translated into many base classifiers which can considerably slow down the method. On the other hand, the proposed classifier makes same set of base classifiers thus it would not consume much time based on each incomplete data instances its trained on.
The final benchmark method is named Bagging ensemble [24]. Which uses ensemble of classifiers to predict instances with missing values. The IFBag method and Bagging ensemble [24] method builds multiple sets of base classifiers which are trained on sub-datasets made from training data. These sub-datasets are made using bootstrap with replacement so that each instance, can be picked more than once. The difference between these two methods is that in the benchmark method, the imputation is applied on each incomplete sub-dataset to estimate missing values before training a base classifier. On the other hand, IFBag method uses imputation on incomplete training dataset to estimate missing values before it is divided into sub-datasets. Another difference between these methods is that no technique is used to reduce the number of missing values in benchmark method thus more time will be used to estimate values using imputation. While the proposed IFBag method uses feature selection technique which can reduce missing values in the incomplete training data thus reducing the amount of values to be estimated, thus saving more time. To demonstrate the edge of IFBag method on this Bagging ensemble [24] benchmark method a comparison is done.
Experimentation parameters
Imputation
The experiments were performed using MICE imputation. The MICE imputation was implemented using the fancy impute python package during our experiments. MICE imputation was used to get complete training dataset by estimating all missing values.
Feature selection
During experiments, instead of using wrapper method for feature selection we used filter-based method which takes less time. To evaluate feature subsets, we used the CFS measure [35]. CFS was used to evaluate feature subsets as its ability to evaluate both uncorrelation between features and correlation between each feature with a class simultaneously. CFS also has been proved to be accurate as wrapper methods in most cases while taking less execution time [35]. For searching feature subsets, we used GA as its been widely used for feature selection. Following are the parameters we set for GA. Population size 50, maximum number of generations 100 with probability 0.9 and mutation probability 0.1. Feature selection was implemented using python-weka-wrapper python package.
Learning algorithms
For comparison of the proposed IFBag method with the previously proposed benchmark methods, we used three learning algorithms; tree algorithm (J48), kNN and multilayer perceptron (MLP). All the three classification algorithms are implemented in python sklearn package. The number of classifiers in IFBag method and Bagging ensemble [24] was set to 25 as mentioned in [24]. The Ensemble [23] meanwhile automatically identifies the number of classifiers to be made based on training data provided to it.
Different classification methods accuracy (mean and standard deviation) when using different classifiers
Different classification methods accuracy (mean and standard deviation) when using different classifiers
Here, we will discuss the experimental results of the proposed IFBag method in comparison to benchmark methods.
Accuracy comparison
Table 2 presents both proposed and benchmark methods classification accuracies. Each method’s accuracy has a mean and standard deviation. The table’s first column lists the dataset’s name and the second column names the classifier algorithm used. The proposed IFBag method’s accuracy is given in the third column. The last four columns show accuracies of the four benchmark methods. The first benchmark method named “kNNI” uses kNN bases imputation to predict missing values. The second benchmark method named “MICEI” uses MICE imputation to predict missing values. “Ensemble [32]” and “Bagging ensemble [23]” are the two ensemble benchmark methods. The table shows values which are accuracy results of classifier algorithm(row) with specific method used for classification (column) of the given datasets. The values show average classification accuracy
In order to correctly evaluate the results, we must choose a suitable statistical test. A test must be multiple tests instead of a pair test as we are comparing proposed IFBag method with multiple methods and it should also be a non-parametric test as it does not require a normal distribution of data as parametric tests. One of the popular multiple non-parametric test called Friedman test [39] is used. This test is helpful in making us understand the importance of the results. The Friedman test shows the existence of an important difference in the results of each method in each classifier and with each dataset. The holm procedure [39] is used which helps in performing pair tests between two methods. To show that IFBag method is better, the symbol
The Fig. 5 shows the calculated Precision, Recall and F1 score of all methods on eight datasets using J48 classifier. Reason behind using J48 is that IFBag method when used with J48 outperforms other methods as compared to when used with other classifiers. IFBag method in mostly has higher Precision, Recall and F1 score as seen in Fig. 5. Precision gives us percentage of correctly predicted positive classes as compared to total positive predictions. The Eq. (3) shows how Precision is calculated. where TP is True Positive, and FP is False Positive.
Recall parameter for benchmark methods is calculated using Eq. (4) which helps us in finding percentage of positive classes as compared to total class predictions. Where TP is true positive, and FN is false positive. Recall adds TP and FN then divides TP with sum of TP and FN.
Lastly F1 score parameter is a weighted average of precision and recall which is calculated using Eq. (5). Where it first multiplies Precision and Recall then dividing it with sum of Precision and Recall which is then multiplied by 2 to get F1 score.
Shows comparison of Precision, Recall and F1 score of different methods on eight different datasets.
Figure 6 compares each individual benchmark methods results with proposed method and shows us the percentage of times it performs better, similar, or worse in performance. The IFBag performs much better than the other methods in most cases. IFBag method achieves similar or better results in 68% of the cases when compared to kNNI. Meanwhile, it performs better or similarly in 69% of cases when compared to MICE benchmark method.
This is possible due to feature selection which helps in removing unnecessary and insignificant features. This feature reduction helps in improving classification accuracy of IFBag method as compared to other imputation methods. Also, as the IFBag method uses multiple classifiers thus it is able to better generalize the results.
Comparing results of individual benchmark methods with the proposed IFBag method and its performance in percentage.
The proposed IFBag will in most cases achieve greater accuracy in comparison to other benchmark ensemble methods as seen in Fig. 6. The IFBag method when compared to bagging ensemble [24] is much more accurate in 46% and similar in 30% of the cases. While in comparison to ensemble [23] IFBag method is more accurate in 58% and similar in 6% of the cases. The proposed IFBag method has better accuracy than bagging ensemble [24] because feature selection used in IFBag removes redundant and irrelevant features. This removal of features helps in improving quality of training data for the IFBag method. The proposed IFBag method is comparatively more accurate then ensemble [23] as it uses bagging ensemble to make a set of base classifiers instead of making base classifiers based on missing value pattern than using majority voting ensemble. The bagging ensemble helps in reducing variance and helps in avoiding overfitting while in base of voting ensemble it simply outputs majority vote as output. This complete analysis with different datasets shows that the proposed IFBag method results vary based on the dataset and classifier used, sometimes giving slightly lower performance compared to other methods.
Comparison with other classification algorithms
The proposed IFBag method gives better accuracy on datasets which have small number of missing values. In addition, IFBag method also gives better accuracy on datasets which have many missing values as can be seen in Table 2. For example, the IFBag method gives better accuracy on mammographic dataset containing only 13.63% missing values. On the other hand, IFBag method gives better accuracy with J48 or kNN on chronic dataset containing only 60.5% incomplete instances. Thus, we can say that IFBag method gives better performance in most cases when using kNN and J48 classifiers.
Comparison of different methods with various incomplete datasets
Different classification methods are compared using various incomplete datasets in Table 2. IFBag method is compared using heart-c dataset which has 1.98% incomplete instances. Table shows that IFBag method in comparison to other methods shows considerable improvement when using J48 and kNN classifiers while showing similar performance when using MLP. Comparison of IFBag method using chronic dataset having 60.5% incomplete instances is shown in Table 2. Results show that J48 and kNN classifiers with IFBag method show considerable performance gains. On the other hand, MLP decreases performance of the IFBag method as compared to other benchmark methods. Credit dataset with 5.36% incomplete instances when used with IFBag method shows higher performance as compared other benchmark methods as shown in Table 2. J48 and MLP classification algorithms show considerable improvement in performance. While kNN shows little performance gain in comparison to kNNI and MICEI while slightly worse in comparison to other ensemble methods. In Table 2 comparison is done using heart-h dataset which has 100% incomplete instances. The table shows that IFBag method shows an improvement on kNN and J48 while performs lower worse when compared to bagging ensemble [24] method. Using kNN classifier with IFBag method, it shows better results on housevotes dataset which contains 46.67% incomplete instances. The proposed method on mammographic dataset with 13.63% incomplete instances shows both MLP and J48 achieve good performance gains compared to kNN. Tumor dataset with 61.01% incomplete instances is shown in Table 2. IFBag method performs worse compared to other methods. Hepatitis dataset with 48.39% incomplete instances performs similarly to other methods with different classifiers while never outperforming any of them as seen in Table 2.
Comparison of different classification algorithms with the proposed IFBag method and its performance in percentage.
Overall, we can say that J48 with IFBag method gives best accuracy with heart-c, chronic, housevotes, mammographic datasets. MLP with IFBag method gives best accuracy with credit dataset alone. kNN meanwhile does not give overall better performance than other classification algorithms. However, it gives best accuracy in heart-c, chronic, house votes, mammographic and heart-h datasets when we only compare kNN classifier results with another benchmark methods. Figure 7 shows that J48 algorithm gives 62% better and 28% similar performance with IFBag method. kNN algorithm gives 63% better and 15% similar performance when used with IFBag method according to Fig. 7. Among all the three classification algorithms, MLP classifier has the worst performance when used with IFBag method as compared to other benchmark methods. MLP gives 38% better while 2% similar performance which is lowest when compared to other classifiers used with the IFBag method. Thus, it is safe to say that J-48 seems to benefit the most from proposed IFBag method. The reason for this could be feature selection is able to remove most features which are irrelevant and redundant. But as J48 performs feature selection while constructing classifiers they can be more sustainable to missing values.
Comparison of weighted average computation time (milliseconds) and weighted average classification accuracy using J48 classifier.
Comparison of weighted average computation time (milliseconds) and weighted average classification accuracy using kNN classifier.
Comparison of weighted average computation time (milliseconds) and weighted average classification accuracy using MLP classifier.
When performing classification, we focus on computation time taken when classifying new instance in the testing phase. Figures 8–10 show a comparison of weighted average computation time for classifying instances in the testing phase and weighted average classification accuracy when using J48, kNN and MLP classifier, respectively. Equation (6) was used to find weighted average computation time and weighted average accuracy of each method, where
The proposed IFBag method takes less computation time than other methods, which use imputation in most cases as compared to other ensemble methods. It is clear from Figs 8–10 that IFBag method is multiple times faster than MICEI irrespective of which classifier is used. MICEI is slower as MICE being multiple imputation is slower in estimating missing values while the IFBag method does not need to estimate any missing values. Figures 8–10 also clearly shows that kNNI is less computationally expensive than IFBag method even though no imputation is used to estimate missing values. The long computation time in the IFBag method can be due to multiple base classifiers which are applied to classify instance.
Comparison with other ensemble methods
Figures 8–10 that the proposed IFBag method is computationally less expensive than other benchmark methods. This is possible due to usage of feature selection before the classifiers are built. Feature selection helps in removing redundant and irrelevant features from training data, thus helping in generating simpler classifiers. Feature selection can also help in reducing missing values in various datasets thus reducing the need for imputation. The IFBag method unlike ensemble [11] only builds a specific number of base classifiers as compared to building a classifier for each missing value pattern. So IFBag method can classify instances much fast as compared to other ensemble benchmark methods. It also helps in reducing instances in the testing phase thus taking less time. In short, the IFBag method does not only reduce computation time, but also improves the accuracy.
Critical discussion
Here, we will be discussing if and how our proposed IFBag method fulfils the desired requirements. The first requirement was that an imputation method used must give higher accuracy and take less computation time when used with incomplete data. So, instead of using imputation method such as MICE in both training and test phase we used it in training phase while using static value in testing phase to replace missing values. Using Fig. 8 to compare IFBag method with kNNI and MICEI we see at least 2.51% and 2.49% increase in accuracy respectively using J48 classifier. While 73.35% decrease in time computation for MICEI and almost similar for kNNI. Using kNN classifier we see at least 3.85% and 3.91% increase in accuracy in kNNI and MICEI methods respectively as compared to IFBag method as seen in Fig. 9. The execution time on the other hand decreases 73% for MICEI while giving comparable result for kNNI. MLP classifier results as seen in Fig. 10 shows that IFBag method gives comparable accuracy with kNNI and MICEI, taking 73.68% less execution time for MICEI. Thus, we can conclude from Figs 8–10 that our proposed method meets this requirement.
Our second requirement was use of feature selection method with ensemble learner in increasing accuracy and reducing execution time. The proposed IFBag method used CFS as evaluator and GA as search method for feature subsets while using Bagging ensemble learner for classification. Fig. 8 shows that in comparison with Ensemble [23] our IFBag method is accurate by 9.88% while taking 90.39% less computation time when using J48 classifier. Similarly, when using kNN classifier IFBag method increases accuracy by 6.34% and decreasing execution time by 90.25% as seen in Fig. 9 as compared to Ensemble [23]. Comparing MLP classifier, our IFBag method produced better results with increase of 11.88% accuracy and execution time reduction of 90.1% as seen in Fig. 10. This comparison shows that we met our second requirement as well.
The last requirement of IFBag method was that overall, the model must have higher accuracy and be less computationally expensive than other benchmark ensemble methods. When we look at results of the IFBag method in conjunction with other ensemble approaches like Ensemble [23] and Bagging ensemble [24] it gives average increase in accuracy of 9.33% and 7.87% while taking 90.40% and 97.56% average less execution time when using three different classifiers. Figures 8–10 shows results with different classifying algorithms from which we can conclude that the proposed IFBag method performs much better in both accuracy and execution time. Thus, the IFBag method achieved all the requirements set for this study.
Conclusion and future work
In data mining, classifying incomplete data is a challenging and difficult task. Thus, this study proposed IFBag method to classify incomplete data using imputation, feature selection and ensemble learning. The proposed IFBag method handles incomplete data by using imputation of training data to get complete training data. Then feature selection is applied to remove unnecessary and insignificant features from complete dataset to get a reduced complete dataset. The proposed IFBag method then uses bootstrapping which incorporates diversity in the training sub-dataset to build an ensemble of classifiers to classify incomplete instances. Combining these ideas together helps in achieving diverse classifier which is more accurate and computationally less expensive. Experimentation was done to compare four benchmark methods with the proposed IFBag method based on classification accuracy and computation time taken. Among the four benchmark methods, two methods used imputation in both training and testing phase while having a single classifier. The other two methods use ensemble learning with imputation only on training phase to classify incomplete data. The results indicate that the proposed IFBag methods achieved higher classification accuracy while taking comparatively less computation time in most cases as compared to other benchmark methods.
In future work, we would like to further investigate why datasets with multiple classes give lower accuracy during classification with both IFBag method and benchmark methods.
Footnotes
Acknowledgments
First of all, thanks to Allah Almighty who gives me strength and confidence to complete this paper. After that, I would like to express my profound appreciation to my supervisor, Dr. Basit Raza, for providing guidance and feedback throughout this paper. I would also like to thank my parents for their love and support throughout my life, without whom, I would have never been in a position to do this research.
