A fuzzy-based methodology for accurate classification and prediction in large datasets

Abstract

Data mining and machine learning methods have been utilized successfully in the past for identifying and forecasting meaningful patterns from data repositories of diverse application domains. However, the high number of dimensions and instances present in large datasets pose great technical challenges to these existing methods of classification and prediction. The presence of noisy data and missing values makes it even tougher to achieve accurate prediction outcomes. A number of hybrid methodologies constituting dimensionality reduction, feature selection and noise removal methods have been proposed in the literature. However, majority of these techniques force the analysts to compromise on accuracy of classification and prediction results. Therefore, there is a strong need of a methodology that not only scales well with the sheer size and volume of data but also provides near to accurate classification and prediction results by effectively handling the noise in data variables. This paper proposes a fuzzy-based methodology which ranks the dimensions in order of importance and exploits Fuzzy Nearest Neighbor (FNN) approaches for accurate classification and prediction. An experimental evaluation on real world datasets, taken from UCI machine learning repository, shows that the proposed approach outperforms the existing classification and prediction methods by employing only a subset of important features to achieve high prediction accuracy rates at multiple levels of data abstraction.

Keywords

Classification fuzzy nearest neighbor prediction large datasets feature selection pattern recognition

1 Introduction

In the pursuit of discovering useful and interesting patterns from large datasets, data mining and machine learning methods have been extensively used in the past in a variety of application domains [1 –7]. These methods not only enabled analysts to discover strongly associated patterns but also empowered them to predict future trends. Association rule mining and classification are two well-known techniques in machine learning and pattern recognition area [2 , 9]. Association rule mining allows analysts to find hidden correlations that exist in the data variables and classification permits labelling and prediction of unseen instances based on previously seen data. Association rule mining and classification problems have been dealt separately in the literature. It is considered that the aim of rule mining techniques is to find strong associative rules which helps in understanding the underlying data whereas classification focuses on predicting appropriate class of new instances.

Recently, researchers have started using these techniques in a hybrid fashion [5 , 10–16]. This integrated use is termed as associative classification. The purpose of associative classification is to provide a set of useful associations from which classification models can learn and predict the new instances more effectively. Although such integrated use of methods produced meaningful and interesting patterns, however, a number of issues remained unresolved in the previously reported work [2, 17].

Firstly, the previously reported mechanism of ranking the dimensions to filter out the less important features from a large feature set is not exhaustive to identify those features which play a vital role in achieving accurate predictions.

Secondly, the prediction accuracies achieved through the use of associative classification, namely Predictive Apriori, at multiple levels of data abstraction are extremely low. Such low accuracy rates indicate that the set of associations are interesting in defining the underlying data but even such strong associations are not precise enough to train a classification model for accurate prediction of new unlabeled data. Lastly, the Apriori-based prediction methods, exploited in the past, are sensitive to noisy data and to variables having large number of missing values. Real-world datasets usually have both of the aforesaid elements and such Apriori-based methods do not have the potential to be utilized for predicting future trends and patterns in large datasets.

This paper proposes a fuzzy-based methodology to overcome the aforementioned issues. For selecting and ranking the important features, data distribution of each variable gets tested using statistical tests to check normality. Furthermore, Fuzzy Nearest Neighbor (FNN) based approaches, an improvement of well-known KNN method, has been adopted to resolve the issues of noisy data as these method were reported to handle the noisy data in an efficient manner [18 –21]. This combined use of normality based ranking and fuzzy based classification permits the analysts to achieve high accuracy rates efficiently with the limited number of highly ranked informative dimensions.

Experimental evaluation, performed on real-world datasets, namely Adult and ForestCoverType taken from UCI machine learning repository [22], validates that the proposed methodology meets its objectives by allowing analysts to predict patterns at various levels of data abstraction with high precision and accuracy in a timely manner.

The rest of the paper is organized as follows: Section 2 presents the review of similar approaches reported in the literature. Section 3 gives an overview of the proposed methodology, followed by Section 4 covering the experimental results and findings along with the discussion. Finally, Section 5 presents the conclusion and possible future work in this direction.

2 Review of related work

In this section, we provide a brief review of the related work in this area. It is important to highlight that majority of the work in the past deal with the classification and association problems separately. However, in this review we identify the work which closely relates to our proposal and on what aspects the proposed work distinguishes from the previously reported work.

Brunzel et al. in [17] proposed a methodology for feature reduction in order to alleviate the prediction accuracy rates. Authors emphasized that feature reduction is a very important step to efficiently process the data and the selection of informative variables enables the classification algorithms to achieve higher accuracies of prediction. Their proposal was to use liner transformation of data variables. Such transformation creates linear combinations of the original features, however, these are observed to be better approximation using a normal distribution. In this paper, the normality of data variables is considered but linear transformation of the variables is not suggested as it requires manual inspection of each variable afterwards. In case of high dimensional data, it would be laborious task to first perform transformations and then visually inspect each of them.

A similar proposal of selecting the important features at the data pre-processing stage was proposed in [23]. Information gain measure has been utilized to rank the features and to improve the classification accuracies by selecting the top ranked or highly important variables. Although improvements were shown by the authors as empirical findings but the issue with the improvements was that the accuracies were not consistently improving for diverse datasets. For instance, J48 classification algorithm on IRIS data set after information gain based preprocessing showed 95% accuracy. On the other hand, the same pre-processing was not able to improve the accuracy rate on a separate dataset using the same classifier. We argue that there should be a method for preprocessing that allows consistent results on diverse datasets. However, we are aware of the fact that it is extremely hard to achieve this goal considering the different nature of data in various applicationdomains.

Authors in [24] emphasized that the accuracy rates as well as the training time to build the prediction models are the key factor in evaluating any classification algorithm. They conducted a study on model building time for Bayesian based and Decision tree based approaches. It was identified that the Bayesian approaches are robust as compared to tree based approaches; however, the analysts have to compromise on the accuracy rates with Bayesian approach. We also believe that getting the highest accuracy on a given dataset is one important aspect of evaluation but the time required for training is equally vital. In this paper, we note the training time of a number of classifiers to highlight which classifier works best for large datasets.

In a different approach, authors in [25] proposed a method of feature selection for classification algorithms exploiting hierarchical relations between features. The feature selection uses correlation and information gain in order to select best features for the classification algorithms. The approach works in two steps, in first step it removes the redundant features along the hierarchy’s paths and then prunes the remaining set based on the feature’s predictive power. Such approach provides better accuracy of classification algorithms when tested on 5 real world datasets. Authors suggest that the methodology may be extended for association rule mining, clustering and outlier detections.

Instead of feature reduction, [26] used hierarchical classification for Automated Essay Scoring system in order to achieve better accuracy of classification algorithms. The authors found that classification algorithms provide notable advantage in such systems when applied at multiple levels in hierarchy. Authors suggest the use of different variables at different levels in hierarchy and apply classification algorithms in order to evaluate the scoring. Authors have tested their approach with a data set of 1243 and found that their model provides 92% adjacent accuracy between the predicted scores and human scores. The proposal lacks the validation on datasets from diverse application domains.

Naeini et al. in [27] worked on improvement of Correct Classification Rate (CCR) of ordinary classifiers in complex classification tasks. The authors suggest that classification algorithms perform better while applied at multiple levels. The methodology improves CCR of an ordinary classifier as well it is robust to the noise. The model has been tested on real data set and improvement has been achieved by 10%. In this paper, we also apply the fuzzy-based methodology at multiple levels of data abstraction. Moreover, our proposal is more appropriate on diverse datasets as compared to the work in [27].

Authors in [28] identified that most of the classification algorithms do not provide better accuracy in case of huge data sets with several groups of data. Authors proposed that K-Mean clustering may be used to create different clusters of data and then classification algorithms are applied at each cluster level. Authors tested the approach with Naïve Bayes algorithm which resulted in better accuracy as compared to its application on the whole data set. However it is not evident as if using K-Mean clustering with other classification algorithms yields results in better accuracy of respective algorithms. We also believe that accuracies are better if they are tested at multiple hierarchical levels. In this regard, our proposal differs from the work done in [28] as we use hierarchical clustering algorithm instead of K-means algorithm. Moreover, we apply a variety of classification algorithms at various hierarchical levels and evaluate the effect of each algorithm at multiple levels of data abstraction.

We conclude this section by reporting the closely related work which also formed the basis of our proposal. The work done in [2] applied hierarchical clustering to generate groups of data at hierarchical data levels. Predictive Apriori was utilized by the authors to generate associative rules at multiple levels in the hierarchy. The importance of the generated rules was tested using statistical measures and it was proven that the rules generated are informative and statistically significant and accurate. However, the proposal lacked the evidence of accuracy. The reported accuracy results only utilized a single classification algorithm which was not enough to determine the accuracy of the results. Moreover, they compared the accuracy of the algorithm with raw data and with the presence of schema structure.

We consider that their work was substantial in terms of discovering diverse association rules. However, we strongly argue that the accuracy rates achieved with their proposal was not adequate for acceptance. Traditional classification algorithms produce more accurate results on the datasets used by them but there is a need to evaluate the accuracy rates of these algorithms at different levels of hierarchy and with diverse data present in different data clusters. Moreover, there is a need to evaluate different classification algorithms with respect to the model building time.

Likewise [26], we also believe that training time of the classifier is a vital aspect for the evaluation of classifies and specifically for large datasets this has to be tested properly so that proper classifier could be identified which work well for large datasets. This motivated us to develop a methodology which not only selects important features for prediction but also achieves high accuracy rates at different levels of data abstraction. In the next section, we present the overview of our proposed methodology to highlight the methods utilized to for classification and prediction of large datasets.

3 Proposed fuzzy-based methodology for accurate classification and prediction

In this section, the methodological steps have been explained which assists data analysts towards precise classification and prediction. Figure 1 depicts the steps of the proposed methodology. The overview of each step is given in the sub-sections.

3.1 Apply hierarchical clustering

In the first step, agglomerative hierarchical clustering is applied on the numeric variables of the given dataset. The purpose of using the clustering technique at the first step is to find natural groupings in data at various hierarchical levels. After obtaining the clusters, proper abbreviations are given. For instance, when the dataset splits into two clusters then one is abbreviated as C1 and the other is C2. Similarly, when C1 cluster splits into two child clusters then the child clusters are named as C11 and C12.

3.2 Rank variables on normality distribution

The second step of the proposed methodology is to rank the variables present in each data cluster. In order to do that we apply Kolmogorov-Smirnov (K-S) test to check the distribution of each variable and pick the ones which passes the p-value thresholds. Our ranking process is based on normal distribution presents the distribution of data in a variable which is symmetrical with a single central peak at mean of the data. The shape of the curve for normal distribution is described as bell-shaped. It is also termed at Gaussian Curve. In a perfect normal distribution, fifty percent of the distribution lies to the left of mean and fifty percent lies to the right of the mean. The mean, median and mode are equal in a normal distribution. Figure 2 shows a perfect normal distribution curve. In cases where multiple variables return the similar calculated values for the K-S test, we plot the histograms to visually inspect the distribution curves and rank the ones higher which have the better curves as compared to others.

3.3 Build fuzzy classification model

In this step, we utilize the ranking achieved in the last step and start building the classification models using various fuzzy based classifiers [18 –21]. The justification of using fuzzy classifiers at this step is that these nearest neighbors (NN) based fuzzy classifiers handle noisy data more efficiently as compared to conventional classifiers. Moreover they have the tendency to predict more accurately as they exploit the weighting mechanism to give more importance to those neighboring objects which are closer to the selected object.

3.4 Evaluate fuzzy classification models

After building the fuzzy classification training models, the models are evaluated with testing data. At this stage, we used 10-fold cross validation for evaluating the models. We calculate the accuracy rates of each model but we prefer to select a model that not only gives high accuracy rate on test data but also takes the minimum time as compared to other fuzzy classifier. The reason of considering time as a selection factor is that if many models predict accurately when the datasets is small but they are unable to handle high dimensional and large data volumes. Therefore, in this step the model with the lowest model building time and with the acceptable accuracy rate is selected for finalprediction.

3.5 Predict unseen instances via best model

In the last step of the proposed methodology, the prediction process is conducted using the best selected model in the previous step. It is important to clarify here that the prediction accuracies calculated in the previous step of the methodology were calculated on the basis of test data. However, in the last step, a validation set, a separate data held out only for validating, is used for the final prediction.

It can be seen from the methodological steps that the proposed methodology give ways to analyst to adopt the suitable fuzzy based classifiers for a given dataset. Moreover, the methodology achieves its objective of providing ranked variables to the analysts as a part of feature selection process to accurately predict unseen data in an efficient manner.

4 Experimental results on real world datasets

In this section, experimental results on two real world datasets, taken from UCI Machine Learning repository has been presented to validate our proposal.

4.1 First dataset – Adult

For the first experiment, the Adult [29] dataset has been selected. The data set has a mixture of numerical and nominal type of variables (13 in total). The whole data set contains 48,842 records. Eight variables of the dataset are nominal and remaining five variables are numeric. Numeric variables include Age, Fnl_Wgt, Edu_Num, Cap_Gain, Cap_Loss, Hrs_per_week. The details of the nominal varilable along with the distinct values present in each variable can be found the aforementioned reference. Since the available data set contains missing values, we removed records with missing values and used 61% of the total records (30,162) for our case study.

In the first step, we apply agglomerative Hierarchical clustering to all data based on numerical variables. We used HCE Explorer tool, proposed by Rosario in [30], to perform hierarchical clustering. Clusters at different levels of hierarchy or data abstraction were generated and their respective values have been stored for further processing. It generates the dendrogram structure. Each level at a dendrogram presents a set of clusters. These cluster further split into smaller clusters at a lower level. As explained in the previous section, we labeled each cluster by giving simple abbreviations. For instance, the dataset is split into C1 and C2 at first level and in next level C2 is split into C21 and C22.

In our next step, variables were ranked in each cluster at multiple levels. This process helps to select the highly ranked variables in a given cluster under analysis. We rank variables based on the result sof normality test i.e. K-S test. If a variable contains data which satisfies normality test and has low skewness values than the other variables, it is ranked higher. We review all variables having same calculated values for the K-S test by their normality curves and pick top three variables from the whole set of variables for training the model. Figure 3 shows the histogram of Age variable which passed the K-S test.

Similarly, other variables are ranked in each cluster at multiple levels of the hierarchy. Ranking of all variables for clusters C2, C21, and C22 are shown in Fig. 4.

For Adult dataset, we find that, in cluster C1, Occupation, Education, Work Class and Relationship are the nominal variable which are ranked on top and thus can be used to build the training models for classification for this particular cluster.

As per the next step of our proposed methodology, the top ranked variable from each cluster are then used to build the fuzzy classification models. We utilize state of the art fuzzy classifier to build our model in Weka tool. However, in order to evaluate our model fairly, three well-known and widely used classifiers from three diverse categories of classifiers have been chosen. From the tree based category, J48 has been selected. Likewise, Predictive Apriori and Naïve Bayes have been picked to evaluate and to show the strength of the fuzzy based classification models.

The next step of the proposed methodology is to evaluate the classifier. As described in the Section 3, we suggest evaluating the classifiers in terms of percentage accuracy using test data and model building time using training data.

Firstly, we looked into the percentage accuracy for the three clusters C1, C11 and C12 of Adult dataset. The results are given in Fig. 5.

In Fig. 5, fuzzy based classifiers are depicted using the blue color and all other classifiers are shown in red. It can be seen form Fig. 5 that the fuzzy based classifiers outperform the classifiers taken from tree based and rule based categories in each cluster.

For instance, Vaguely Quantified Nearest Neighbor (VQNN) achieves 77% accuracy for cluster C1, whereas J48, Naïve Bayes and Apriori have 76%, 71% and 41% accuracies respectively. Likewise, VQNN appeared to be the best choice as far as the accuracy is concerned as it outclassed all the other classifier in the three clusters.

It is important point out that the results in Fig. 5 were generated by predicting Work_Class variable using the 3 top ranked variables for each cluster. In order highlight the effectiveness of our proposed ranking procedure based on normality tests, we performed another experiment to predict the same Work_Class using the complete set of variables instead of the top 3 ranked variables. The results of our experiment are given in Fig. 6.

It is clear from Fig. 6 that our proposed ranking procedure does not drop the percentage accuracies in any given cluster. The accuracies achieved from any of the classifiers using all the variables could be achieved through the limited number of variables top ranked variables. This will not only improve the performance of the classifiers but will also keep the precision of accurate prediction. Again, we can see that fuzzy based classifier such as VQNN, FRNN and OWANN perform significantly well with less number of variables.

For example in cluster C1, VQNN classifier provides 76.84% accuracy using all variables to predict the class variable and almost the same accuracy is achieved with the top ranked variables. It clearly shows the impact of our ranking procedure, which provides feature selection without compromising accuracy. It is important to highlight here that our proposed ranking mechanism provides acceptable results not only for fuzzy based classifiers but for other classifiers too. However, we argue that the ranked set of variables is not adequate for selecting the best classifier. There is a need to evaluate the model building time of each classifier to identify the one which can work for the largedatasets.

After determining the top fuzzy classifiers in terms of accuracy rates, and ensuring that the top ranked variables are significant enough to achieve the best accuracies. The next step of our proposal is to compare the model building time of these fuzzy classifiers to select the one that not only predicts accurately but also takes reasonable time for building from the training data. We purposely do not compare the training time of non-fuzzy classifiers as those didn’t provide highest classification rates. Figure 7 depicts the model building time of the fuzzy based classifier. The time has been calculated by using only 3 top ranked variables to predict the class variable.

It can be seen from Fig. 7 that each of the fuzzy classifiers vary with respect to model building time. For instance, VQNN, which had the highest accuracy for the three clusters, is not the most efficient classifier. It takes 20 ms more as compared to FuzzyNN and OWANN classifier. Similarly, it is evident that FRNN which also achieved high accuracy rates takes twice the time comparing to FuzzyNN and OWANN. We emphasize that in order to select a particular fuzzy classifier the training time must be considered so that the algorithm having the best performance should be selected to deal with large dataset classification and prediction.

If we compare the accuracies of VQNN and FuzzyNN from Fig. 5 then it can be seen that each cluster the difference in accuracies is not more that 2%. In other words, the FuzzyNN achieves the same accuracy for each cluster of Adult dataset but the time it takes to predict is considerably lower as compared to VQNN. Therefore, the choice for large dataset prediction and classification will be to select FuzzyNN as the best candidate for predicting the unseen instances which is the last step of our proposed methodology.

4.2 Second dataset – ForestCoverType

For the second experiment, we have taken a much larger dataset, namely ForestCoverType [31] to validate our proposed methodology. In this section, we will focus on the results and findings. Due to lack of space the details of the implementation have been purposely skipped. However, our implementation is in line with the first experiment on Adult dataset.

The ForestCoverType dataset has a mixture of numerical and nominal type of variables (13 in total). The whole dataset contains 581012 records. 10 variables of the dataset are numeric and rest three variables are nominal. Numeric variables include Elevation, Aspect, Slope, Horizontal Distance To Hydrology, Vertical Distance To Hydrology, Horizontal Distance To Roadways, Hillshade 9 am, Hillshade Noon, Hillshade 3 pm, Horizontal Distance To Fire Points. We show the nominal variables and their distinct values in the Table 1.

For the first step of applying hierarchical clustering, we used stratified random sampling to obtain an unbiased sample in view of the unbalanced nature of the class distribution in the dataset. In this process we divided the records into homogeneous subgroups, defined by the cover type class variable prior to sampling, thus improving the representativeness of each class in the sample. We used a sample size of 40,000 records to generate clusters at multiple levels of data abstraction. Although a stratified sampling method was adopted to produce a sample that retains the diversity of classes, the sample size was rather small at approximately 8% of the total volume of records. As a consequence, overall information loss was high enough to prevent the accurate depiction of all trends and patterns which exist in the dataset.

After generating the dendrogram with a small sample, we distributed the original (non-sampled) data by allocating each record to the cluster whose centroid was closest the current record in Euclidean terms. The population of the dendrogram with the entire dataset was done in order to eliminate the problem of small volume preventing the identification of certain patterns.

Similar to the previous experiment, clusters were labelled and the variables in each cluster are ranked on the basis of the outcome of non-parametric normality test. Figure 8 presents the ranking achieved after the application of the K-S test of normality.

The top ranked variables from each cluster are then used to build the fuzzy classification models.

The next step of the proposed methodology is to evaluate the classifiers. As done in the previous dataset, we suggest evaluating the classifiers in terms of percentage accuracy using test data and model building time using training data.

Firstly, we looked into the percentage accuracy for the three clusters C1, C11 and C12 of ForestCoverType dataset. The results are given in Fig. 9.

In Fig. 9, fuzzy based classifiers outperform the classifiers taken from tree based and rule based categories in each cluster for this larger dataset used for the experiments.

Similar to the previous study, Vaguely Quantified Nearest Neighbor (VQNN) achieves 85.55% accuracy for cluster C1, whereas J48, Naïve Bayes and Apriori have 84%, 74% and 64% accuracies respectively. Likewise, VQNN appeared to be the best choice as far as the accuracy is concerned as it outclassed all the other classifier in the three clusters.

We have already shown in previous case study that our approach of using top-ranked for classification works perfectly with all classifiers. We are able to achieve almost same accuracy using all classifiers using 3 top-ranked variables. We have not provided comparison of both cases in study of this dataset due to lack of space.

Again, we observed in this study that fuzzy based classifier such as VQNN, FRNN and OWANN perform significantly well with less number of variables.

After determining the top fuzzy classifiers in terms of accuracy rates, and ensuring that the top ranked variables are significant enough to achieve the best accuracies. The next step of our proposal is to compare the model building time of these fuzzy classifiers to select the one that not only predicts accurately but also takes reasonable time for building from the training data. In this study, we also compared the training time of non-fuzzy classifiers as well. In order to highlight the fact that non-fuzzy classifiers such as J48 and Naïve Bayes require much longer time as compared to fuzzy classifiers. For instance J48 takes 1320 ms to build the model with the only 3 top ranked variables. Figure 10 depicts the model building time of the fuzzy based classifiers. The time has been calculated by using only 3 top ranked variables to predict the class variable.

It can be seen from Fig. 10 that each of the fuzzy classifiers vary with respect to model building time. For instance, VQNN, which had the highest accuracy for the three clusters, is not the most efficient classifier. It takes 40 ms more time as compared to FRNN classifier. We emphasize that in order to select a particular fuzzy classifier the training time must be considered so that the algorithm having the best performance should be selected to deal with large dataset classification and prediction.

If we compare the accuracies of VQNN and F RNN from Fig. 9 then it can be seen that each cluster the difference in accuracies is not more that 1%. In other words, the FRNN achieves the same accuracy for each cluster of ForestCoverType dataset but the time it takes to predict is considerably lower as compared to VQNN. Therefore, the choice for large dataset prediction and classification will be to select FRNN as the best candidate for predicting the unseen instances which is the last step of our proposed methodology.

It has been observed from the experiments performed using two large datasets from machine learning repository of UCI that our proposed methodology achieves the two major objectives. Firstly, it allows analysts to pick the most suitable features from the data for the purpose of classification and prediction. Secondly, it gives ways to the analysts to evaluate the fuzzy based classifier in a realistic manner to adopt the one which not only gives high accuracy rates but also calculate the results efficiently. We believe that for large datasets it is very important to process the classification problem in an efficient manner and there no study in the past, to the best of our knowledge that evaluated the fuzzy based classifiers with the other predominant classification methods to highlight the effectiveness of prediction in large datasets.

5 Conclusion and future work

In the last few decades, both data mining and machine learning algorithms have been exploited for the purpose of accurate classification and prediction. However, these algorithms face a number of technical challenges when it comes to the application of such algorithms in those application domains where data is available in high volume and big size. Real world datasets are the typical examples of large datasets and in addition to the sheer size, these dataset consist of noisy data variable and missing values. Dealing with these aspects of data makes it really challenging for the data analysts and algorithm developers to handle these problems effectively. A number of hybrid methodologies constituting dimensionality reduction, feature selection and noise removal methods have been proposed in the literature. However, majority of these techniques force the analysts to compromise on accuracy of classification and prediction results.

This paper identifies the need of a methodology which allows the best possible features to be selected from the data to classify and predict unseen data efficiently. Such a methodology has been proposed in this paper which not only scales well with the sheer size and volume of data but also provides near to accurate classification and prediction results using fuzzy classifier. The proposed fuzzy-based methodology ranks the dimensions in order of importance using statistical tests and exploits Fuzzy Nearest Neighbor (FNN) approaches for accurate classification and prediction. An experimental evaluation on real world datasets validated that the proposed approach outperforms the existing classification and prediction methods by employing only a subset of important features to achieve high prediction accuracy rates at multiple levels of data abstraction.

We conclude that our methodology can help in a variety of application domains such as business, engineering, education, agriculture etc. as all these domains consistently store data for discovering knowledge from large repositories. Using the proposed methodology, analysts from a variety of application domains, can find out the best features for building up their fuzzy classification models. These models will allow them to efficiently classify and predict new instances and future trends. Moreover, the findings from this study suggest that solely achieving the good accuracy rates is one evaluation aspect of the classifier but another important measure is to check the performance of each classifier. In many cases, the classifiers which produce accurate results fail to work on large datasets.

Future work is intended towards the evaluation of fuzzy rule based classification. The main reason is to generate a good set of rules for the analysts to understand the underlying trends and patterns without losing the prediction accuracy rates. For instance, one of the main advantages of Apriori based approach is that it allows the analysts to form meaningful and useful rules, although such rules are not adequate enough for prediction, which can help in summarizing the hidden trends and patterns in data. In future, we aim to develop a methodology that not only gives accurate prediction but also generate meaningful rules from data clusters for decision making purposes.

References

Usman

, Pears

, and Fong

A.C.M.

, A data mining approach to knowledge discovery from multidimensional cube structures, Knowledge-Based Systems40(0) (2013), 36–49.

Usman

, Pears

, and Fong

A.C.M.

, Discovering diverse association rules from multidimensional schema, Expert Systems with Applications40(15) (2013), 5975–5996.

Dimokas

, et al., A Prototype System for Educational Data Warehousing and Mining, 2008, IEEE, pp. 199–203.

Usman

, Pears

, and Fong

, Data guided approach to generate multi-dimensional schema for targeted knowledge discovery, 2012.

Lee

C.K.H.

, et al., A hybrid OLAP-association rule mining based quality management system for extracting defect patterns in the garment industry, Expert Systems with Applications40(7) (2013), 2435–2446.

Nahar

, et al., Association rule mining to detect factors which contribute to heart disease in males and females, Expert Systems with Applications40(4) (2013), 1086–1093.

Usman

, and Ahmad

, A conceptual model for multi-level mining and visualization of association rules, in Ninth International Conference on Digital Information Management (ICDIM), 2014.

Pears

, et al., Weighted association rule mining via a graph based connectivity model, Information Sciences218(0) (2013), 61–84.

, et al., A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data, Knowledge-Based Systems30 (2012), 129–135.

10.

Usman

, Pears

, and Fong

A.C.M.

, Data guided approach to generate multidimensional schema for targeted knowledge discovery, in Tenth Australasin Data Mining Conference (AusDM,12), 2012, pp. 229–240.

11.

Usman

, and Pears

, Integration of data mining and data warehousing: A practical methodology, International Journal of Advancements in Computing Technology2(3) (2010), 31–46.

12.

Usman

, and Pears

, A methodology for integrating and exploiting data mining techniques in the design of data warehouses, in Advanced Information Management and Service (IMS), 6th International Conference on, 2010.

13.

Missaoui

, et al., Toward Integrating Data Warehousing with Data Mining Techniques, Data warehouses and OLAP: Concepts, architectures, and solutions, 2007, p. 253.

14.

Zubcoff

, Pardillo

, and Trujillo

, Integrating clustering data mining into the multidimensional modeling of data warehouses with UML profiles, Data Warehousing and Knowledge Discovery (2007), 199–208.

15.

Messaoud

R.B.

, et al., OLEMAR: An on-line environment for mining association rules in multidimensional data, Advances in Data Warehousing and Mining, IGI Global2 (2007), 1–35.

16.

Blanchard

, Guillet

, and Briand

, Interactive visual exploration of association rules with rule-focusing methodology, Knowledge and Information Systems13(1) (2007), 43–75.

17.

Brunzell

, and Eriksson

, Feature reduction for classification of multidimensional data, Pattern Recognition33(10) (2000), 1741–1748.

18.

Jensen

, and Cornelis

, Fuzzy-rough nearest neighbour classification and prediction, Theoretical Computer Science412(42) (2011), 5871–5884.

19.

Jensen

, and Cornelis

, Fuzzy-Rough Nearest Neighbour Classification, in Transactions on Rough Sets XIII, Peters

J.F.

, et al., Editors. Springer Berlin Heidelberg: Berlin, Heidelberg, 2011, pp. 56–72.

20.

Verbiest

, Cornelis

, and Jensen

, Fuzzy rough positive region based nearest neighbour classification, IEEE, in Fuzzy Systems (FUZZ-IEEE), 2012 IEEE International Conference on, 2012, IEEE.

21.

Derrac

, García

, and Herrera

, Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects, Information Sciences260 (2014), 98–119.

22.

Asuncion

, and Newman

D.J.

, UCI machine learning repository [http://archive.ics.uci.edu/ml] Irvine, CA: University of California, School of Information and Computer Science, 2010.

23.

Shazmeen

S.F.

, Baig

M.M.A.

, and Pawar

M.R.

, Performance evaluation of different data mining classification algorithm and predictive analysis, Journal of Computer Engineering10(6) (2013), 01–06.

24.

Akinola

S.O.

, and Oyabugbe

O.J.

, Accuracies and training times of data mining classification algorithms: An empirical comparative study, Journal of Software Engineering and Applications8(9) (2015), 470.

25.

Ristoski

, and Paulheim

, Feature selection in hierarchical feature spaces. in Discovery Science. 2014, Springer.

26.

McNamara

D.S.

, et al., A hierarchical classification approach to automated essay scoring, Assessing Writing23 (2015), 35–59.

27.

Naeini

M.P.

, et al., Learning by abstraction: Hierarchical classification model using evidential theoretic approach and Bayesian ensemble model, Neurocomputing130 (2014), 73–82.

28.

Kabir

M.F.

, et al., Enhanced classification accuracy on naive bayes data mining models, International Journal of Computer Applications28(3) (2011), 9–16.

29.

Kohavi

, and Becker

, Adult dataset. 1996; Available from: http://archive.ics.uci.edu/ml/datasets/Adult.

30.

Rosario

G.E.

, et al., Mapping nominal values to numbers for effective visualization, Information Visualization3(2) (2004), 80–95.

31.

Blackard

J.A.

, Dean

, and Anderson

, The forest covertype dataset, 1998.