Abstract
The task of selecting the most suitable classification algorithm for each data set under analysis is still today a unsolved research problem. This paper therefore proposes a meta-learning based framework that helps both, practitioners and non-experts data mining users to make informed decisions about the goodness and suitability of each available technique for their data set at hand. In short, the framework is supported by an experimental database that is fed with the meta-features extracted from training data sets and the performance obtained by a set of classifiers applied over them, with the aim of building an algorithm recommender using regressors. This will allow the end-user to know, for a new unseen data set, the predicted accuracy of this set of algorithms ranked by this value. The experimentation performed and discussed in this paper is addressed to evaluate which meta-features are more significant and useful for characterising data sets with the end goal of building algorithm recommenders and to test the feasibility of these recommenders. The study is carried out on data sets from the educational arena, in particular, targeted to predict students’ performance in e-learning courses.
Introduction
In this new era of so-named datification [26], there is an urgent need of both data scientists and tools addressed to make data management and analysis easier. Even though new technologies are arising in the data mining field, the automation of the Knowledge Discovery Process (KDD) is still an open problem. As it is well-known, the KDD process [28], which aims at identifying valid, novel, potentially useful, and ultimately understandable patterns in data, comprises several phases: preprocessing, modeling, mining and testing, and each one, in turn, includes a large number of tasks which must be performed. Most of these steps can be chained by means of scientific workflows but the choice of the most suitable techniques and algorithms to be utilised is not yet automatised. In particular, this work is a step towards the dynamic selection of the classifier to be applied on a certain data set at hand. As derived from the no-free-lunch theorem, no learning algorithm outperforms better than others for the set of all real-world problems [10], mechanisms that help data miners to select the most appropriate technique must be therefore developed.
Rice [15] was the one who first formulated this issue and since then different approaches have been proposed, for instance: a) a traditional approach based on a costly trial-and-error procedure; b) the use of ensemble methods to obtain better predictive performance; or c) an approach based on meta-learning, able to automatically provide guidance on the best alternative from a set of meta-features.
The proposed framework follows this last approach since the characterisation of data sets and the analysis of the behaviour of different machine learners applied on them have shown to be suitable and efficient (see Section 1). Although there are several works about meta-learning, only a few have focused on the algorithm recommendation and even fewer have studied which meta-features are the most suitable for this goal. This paper thus advances in both research lines with the following contributions: The explanation of how to build algorithm recommenders supported by an experimental database. The description of the set of meta-features that can be used for data sets’ characterization. A sounded case study that analyses the behaviour of different sets of meta-features on data sets from the educational arena and their suitability for the building of algorithm recommenders.
This framework could be well utilised by practitioners in order to create and feed their own experimental database and use it as benchmark [16] as well as to build recommenders that help non-expert data miners to take advantage of mining techniques but hiding their choice and setting [19]. This is the case of teachers involved in virtual education, that due to the lack of face-to-face contact, require to analyse the activity performed in the e-learning platform to guide learners [9].
This work is a widely extended version of [19] in which the framework and the experimental database have been more clearly described and a new experimental study has been carried out using a higher number of algorithms and some different data sets in the training phase. Moreover, instead of one, two algorithms recommenders have been built and next compared to draw conclusions about the feasibility of the proposal, one based on linear regression and the other one on Multilayer Perceptron. Related work has been also reviewed.
This paper is organised as follows: Section 2 briefly describes the different elements which comprise our framework, previously introducing the meta-learning field and the set of meta-features that can be used. Section 3 describes the methodology followed in the case study and the setting of the experiments performed. Section 4 presents and discusses the results obtained showing the feasibility of the proposal. Finally, conclusions and future works are outlined in Section 5.
Background and an overview of our framework
Meta-learning is a subfield of machine learning which aims at applying learning algorithms on meta-features extracted from machine learning experiments in order to better understand how these algorithms can become flexible in solving different kinds of learning problems, hence to improve the performance of existing learning algorithms [25] or to assist the user to determine the most suitable learning algorithm(s) for a problem at hand [1], among others.
In this paper, meta-learning is used with the aim of learning the relationship between the meta-features extracted from the data sets and the algorithms performance applied on them. Therefore, a meta-learning system consists of two main stages: a training phase and a prediction phase. In the training stage, data sets are first characterised by a set of measurable meta-features and then, a set of classifiers are executed on them. The performance of these models generally measured by their accuracy (although other measures can be used such as f-measure, error rate, etc.) is next linked to the meta-features of each data set involved. Later, a learning algorithm is trained on the collected meta-features yielding a model which will be used to predict which the best algorithm to be applied on a new data set is. On other occasions, instead of selecting an algorithm, a ranking of algorithms is provided [7]. Different approaches for building the predictor are found, mainly based on classification [18, 22] and regression [5, 23].
Recently, a new approach called meta-learning template [24] has arisen with the aim of recommending a hierarchical combination of algorithms. On the other hand, Britto et al. [2] studied the classification problem complexity using multiple classifier systems based on the dynamic selection of classifiers.
Regarding the type of meta-features, many have been proposed and applied, being commonly categorised in: Simple meta-features, such as the number of attributes, the number of instances, the type of attributes (numerical, categorical or mixed), the number of values of the target attribute and dimensionality of the data set, i.e., the ratio between the number of attributes and the number of instances. Statistical meta-features, like skewness, kurtosis among others which characterise data distribution [25, 27]. Information theoretic meta-features used for characterising data sets containing categorical attributes such as class entropy or noise to signal ratio [20]. Model-based meta-features, which collect the features of the mining model built, for instance, the structural shape and size of a decision tree trained on the data sets [30]. Landmarkers, which are meta-features calculated as the performance measures achieved by using simple classifiers [4]. Complexity meta-features, that characterise the apparent complexity of data sets for supervised learning. These are provided by DCoL (data complexity library) [29]. They have been used in several metalearning works [7, 18] and, recently, Herrera et al. [14] applied them to obtain the domains of competence of a classifier, which allows to predict if any data set will be suitable for such learning method or not. Contextual meta-features, i.e., characteristics related to data set domain [7, 18].
For the sake of a better understanding, a modular schema of the proposed framework is depicted in Fig. 1. As can be observed, this framework basically makes use of four workflows in the training phase: one for extracting meta-features of the data sets (WF1); another one, for generating models with each classifier under study (WF2); a third one, responsible for loading the descriptive information of each experiment performed on each data set into the database (WF3); and the fourth one, in charge of building a regressor for each type of algorithm used in the training phase establishing as class the value of the type of measure desired (accuracy, f-measure, and so on) (WF4), being accuracy the most frequently utilised [6]. Regarding the predictive phase, only a workflow is required. This is responsible for reading each new data set, extracting its meta-features, applying this meta-data set to the regressors previously built and showing the algorithms ranked according to the value of accuracy (or whatever evaluation measure chosen) predicted by themselves.
The database schema used to gather the experiments is depicted in Fig. 2. This was designed based on the one proposed in [16] which collects machine learning experiments. But this schema had to be extended to store meta-features extracted from each data set (Fields and DatasetMetaFeatures classes) and the set of meta-features which describes each mining model built (MiningModels and Measures classes).
Meta-learning in educational arena
Meta-learning is a subfield of machine learning which has not been yet well-explored to be applied on the educational arena, despite the huge quantity of data available in learning platforms and also the urgent need of analysing and improving the learning processes to improve the academic performance and avoid the worrying dropout. Although many issues have been addressed according to these surveys [3, 8, 9], few works relied on a metalearning based solution. Romero et al. [21] proposed the use of metalearning for the automatic setting of two parameters of the J48 algorithm with the aim of increasing the model accuracy for predicting student’s performance. Later, this research group built a recommender for selecting the best classifier using a nearest neighbor (1-NN) approach on statistical, complexity and domain meta-features [7] and next, applied a multi-label learning algorithm [13] for the same goal. On the other hand, Zorrilla et al. [19] evaluated the possibility of building a recommender to be wrapped in a data mining service using only the classifier that outperformed in the training phase of the meta-learning process. The results were positive but limited because the number of algorithms could not be increased due to the reduced number of training data sets available. Some authors like [17] solved this issue first clustering algorithms based on behaviour similarity and, then recommending a set of algorithms, instead of an algorithm whereas others followed an approach based on regression [19, 23]. According to Lemke et al. [6], this last approach has been barely studied, thus the experimental study carried out in this work is addressed to fill this gap.
Experiment design
This mining experiment aims at studying the relevance and advantages that each group of meta-features provides for the building of algorithm recommenders as well as assessing the predictive power of these recommenders. The process followed is the one detailed in Fig. 1.
All data sets included in the experiment came from thirty different blended and virtual learning courses hosted in a Moodle platform. These gather the activity carried out by the students measured by means of metrics defined at course level and at tool level which are available in the course such as the total number of sessions opened by each student in the course, time spent per session, number of self-tests performed, number of messages posted and answered in the forum, among others. All attributes are numeric except the class attribute which collects whether the learner failed (positive class) or passed (negative class) the course.
Once training data sets were loaded, their meta-features were extracted. Concretely, the following ones were used: a) the number of instances, the number of attributes and the dimensionality as simple meta-features; b) the minimum, the maximum and the average value of the skewness and kurtosis of all attributes of the data set calculated as statistical meta-features, by means of the MATH3-apache Java library; c) the fourteen complexity meta-features offered by DCoL software [29] that are, the maximum Fisher’s discriminant ratio (F1), the directional-vector maximum Fisher’s discriminant ratio (F1v), the overlap of the per-class bounding boxes (F2), the maximum (individual) feature efficiency (F3), the collective feature efficiency (sum of each feature efficiency)(F4), the fraction of points on the class boundary (N1), the ratio of average intra/inter class nearest neighbor distance (N2), the training error of a linear classifier (N3), the fraction of maximum covering spheres (T1), the average number of points per dimension (ratio of the number of examples in the data set to the number of attributes)(T2), the leave-one-out error rate of the one-nearest neighbor classifier (L1), the minimized sum of the error distance of a linear classifier (L2) and the nonlinearity of a linear classifier (L3); d) the accuracy achieved by the following weak classifiers as landmakers: LinearDiscriminant (LD), BestNode with gain-ratio criterion (BN), RandomNode (RN), NaïveBayes (NB) and 1-NN, all available in Weka or RapidMiner.
As a consequence of the fact that the data sets only had numeric features, no information-theory measures were used. Likewise, due to the great variability of algorithms used in the training phase, model based meta-features were discarded. Table 1 shows that the meta-features extracted from training data sets take a wide range of values.
Next, eighteen classifiers from different approaches were run on these thirty data sets using their default setting. The implementations chosen were AdaBoostM1, ADTree, Bagging, BayesNet, BFTree, J48, Jrip, Logistic, MultiBoostAB, MultilayerPerceptron, NNge, OneR, RandomForest, Ridor, SimpleCart, SMO and VotedPerceptron, all of them available in Weka. The accuracy achieved by each classifier on each data set was stored in the data base. The validation process followed was leave-one-out.
Then, eighteen meta-data sets were generated, one for each classifier. Each data set contained the meta-features of the training data sets along with the accuracy achieved by that specific classifier. For the sake of studying the behaviour of each group of meta-features, we built different linear regression models by using different combinations of meta-features: Using all the meta features available. Using only the meta-features which belong to each group (simple, statistical, complexity or landmarkers) separately. Using only the most relevant meta-features chosen by a feature-selection algorithm.
For this last task, the ClassifierSubSet algorithm offered by Weka was executed. This was configured with the BestFirst algorithm as search method and LinearRegression as base classifier. The leave-one-out method was used for its evaluation. In order to choose features according to their relevance, ten thresholds were defined: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 95%.
All the linear regression models (meta-learners) were generated using the leave-one-out strategy as evaluation process. The accuracy and the root-mean-square error (RMSE) of each regression model computed were both stored in the experimental database. Later, the same task was repeated but, this time, the MultilayerPerceptron algorithm was used as base classifier. The analysis and discussion of the results are written in Section 4.
Finally, one algorithm recommender was built with the best setting chosen from the analysis of meta-features previously performed. Next, two new data sets were loaded to the system with the aim of uncovering the ranking of algorithms that the recommender offered as outcome and check how good this recommendation was. The results and conclusions drawn are written in the next section.
Results and discussion
This section is organised in twofolds: first, the analysis of meta-features is presented and discussed, and next, the feasibility and suitability of meta-learning based recommenders is demonstrated.
Meta-feature analysis
With the aim of studying and comparing the behaviour of each group of meta-features, two different regressors were used: LinearRegresion and MultilayerPerceptron. Table 2 displays the RMSE obtained in the building of each linear regression model for each algorithm by using all meta-features (“all”), only the complexity ones (“comp”), only the simple ones (“simp”), only the statistical ones (“stat”) and, finally, only the landmarkers (“land”). The last column, labelled “Avg. Acc.” gathers the average RMSE computed as the mean of the RMSE achieved for each classifier on each data set. This will be used as base value in our experiment. Columns with * denote the RMSE obtained when a feature selection process was applied on all meta-features (“all”) or a set of them (“comp*”, “stat*”, “simp*”, “land*”) with a threshold of 95%. The value in bold points out the set of meta-features for each classifier that builds the best meta-learner, that means, the one with the lowest RMSE.
The reason why this threshold was chosen can be found in Fig. 3, which shows the average RMSE as result of applying the feature selection process on all meta-features with different thresholds for the eighteen classifiers. As can be observed, the RMSE decreases when the threshold increases, being 95% thus selected. This is a consequence of the fact that, by using Linear Regression, the ClassifierSubSetEval process tends to assign very high values of relevance to most of the meta-features, so that the improvement is barely significative. Suffice it to show that the RMSE is lower than 0.08 with a threshold of 90%, and 0.061 with 95%.
Back to the results shown in Table 2, it can be stated that the feature selection process notably improves the RMSE of all linear regression models built independently of the classifier or the set of meta-features chosen. It is worth noting the improvement in the prediction of the accuracy of classifiers such as VotedPerceptron that reduces the RMSE from 0.229 to 0.092. Moreover, there are some linear regression models which achieve a RMSE lower than 0.05 and even 0.04 by applying feature selection on all the meta-features. That is the case of AdaboostM1, ADtree and J48, with a RMSE of 0.036, 0.033 and 0.038 respectively. Furthermore, nine out of the eighteen algorithms built the best linear regression models when using feature selection on all the meta-features. The same conclusions can be drawn when the results obtained by using feature selection on the complexity meta-features are observed, where the average RMSE decreases from 0.134 to 0.076. In fact, in five times, this setting is the best.
The linear models built from simple and statistical meta-features are quite worse than previous ones with RMSE higher than 0.1 independently of using feature selection or not. However, the behaviour of landmakers must be highlighted since they achieved models with very low RMSE. In fact, the average improvement as result of applying the feature selection process is smaller than in the other cases as can be better visualized in Fig. 4.
Finally, it is worth noting that the average RMSE for predicting the accuracy of an algorithm using meta-learnig is far lower than the RMSE base (see column “Avg. Acc.”). This error is higher than 0.1 in eleven out of eighteen algorithms, and higher than 0.08 in the rest, meanwhile this value using all meta-features, or only the complexity ones or only the landmarkers is quite lower (0.061, 0.076 and 0.063 respectively). Moreover, the p value of the two-paired t-test performed to compare these values with respect to the ones in the column “Avg. Acc.” is, for all cases, lower than 0.01, so that it can be concluded that the meta-learning process does not only achieve better results, but the difference is very significative.
Next, Table 3 summarises how many times (in percentage) each meta-feature was selected when the feature selection with a threshold of 95% was performed. As can be observed, there are four landmarkers, BestNode, RandomNode, NaïveBayes and 1-NN, and three complexity measures,F1v, N1 and N4, that were chosen at least a 50% of times, highlighting RandomNode with a 83.33%. This fact explains why landmarkers achieve, alone, a good RMSE in most of the linear regression models, even when the feature selection process is not applied. Nevertheless, simple and statistical meta-features are not, in general terms, highly relevant. Only the number of attributes (N# attributes) shows a selection ratio of 55.56%, being the rest of them below 30% and even 20%.
Thus, these results show that building algorithm recommenders based on linear regressors for predicting the accuracy of each classifier is viable and furthermore highly suitable, since the error ratio is very low. Landmarkers and complexity measures have the highest predictive power, however, when all meta-features are used, included simple and statistical, the best regression models are built for the majority of the classifiers.
Next, the same analysis is performed for meta-learners built with MultilayerPerceptron, a more complex technique than linear regression. Table 4 and Figs. 5 and 6 show the results achieved. As can be observed, the feature selection process, in this case, has a minor effect on the improvement of the RMSE. Moreover, the best results are obtained, on average, with a threshold of 80%. In this scenario, the best average RMSE is achieved by using only landmarkers after applying the feature selection, with a RMSE of 0.076, nearly followed by the average RMSE obtained by using all meta-features, 0.079. Unlike linear regression, complexity measures are not so effective since the models have an average RMSE of 0.104, although, if the results are analysed classifier by classifier, there are four of them, MultilayerPerceptron, NNge, Ridor and SimpleCart, whose results are the best.
Table 5 shows the percentage of times that each meta-feature was selected when a feature selection with a threshold of 80% was performed. These results explain why the feature selection process in this case does not achieve the same degree of improvement that the one obtained when LinearRegression was used and, furthermore, they also explain why the RMSE obtained with MultilayerPerceptron is, in general, higher. As can be observed, the only remarkable meta-feature is a landmark, RandomNode, that was selected 66.67% of times. The other landmarkers and the complexity meta-features are barely selected, a 27.78% and 22.22% of times respectively. In fact, some complexity meta-features, such as F1v, F4 and L3 that were relevant with LinearRegression, are never included in these regression models. Regarding simple and statistical meta-features, these are also barely chosen no more than the 11.11% of times.
Finally, Table 6 is shown with the aim of comparing the number of times that LinearRegression and MultilayerPerceptron algorithms obtained a lower RMSE after applying a feature selection process, that means, they built a better model. It is worth noting that LinearRegression achieves a better model in fourteen out of eighteen classifiers when all meta-features are used. This fact is even more remarkable when only complexity meta-features or landmarkers are selected. In the light of these results it can be stated that using a simple technique as LinearRegression for building meta-learners leads to better models than when more complex techniques like MultiplayerPerceptron are applied. In both cases, the feature selection process makes improvements and thus, it should be always performed.
Predictive power of algorithms recommenders
Next, the predictive power of an algorithm recommender built with LinearRegresion after having applied a feature selection process with a threshold of 95% is demonstrated. For that, two unseen data sets were loaded to the system and their recommendation compared with the best real model built. Columns “P. Acc.” and “R. Acc.” in Tables 7 and 8 contain the predicted or expected accuracy and the real accuracy obtained for the classifier respectively, and columns “Rank” and “R. Rank” gather their position in the ranking. As can be observed, the recommended classifier for the first data set, SMO, achieves the second highest real accuracy among all classifiers, 85.94%. Moreover, the classifiers with the third, fourth and fifth higher accuracies are ranked in the second, third and fourth position. However, the best real classifier, NNge, is ranked in the fifth place. It is true that the system has not recommended the best classifier, however, it has been able to recommend the second one and rank some of the better classifiers at the top. Furthermore, the worst sixth classifiers are ranked at the lowest places in the list.
The results for the second data set can be observed in Table 8. The first and the second classifier reach the third higher real accuracy, meanwhile two of the classifiers with the best real accuracy, Bagging and MultiBoost, are classified in the third and fourth position. The reason why Adaboost and RamdonForest are ranked in the two first positions is due to the fact that the error rate in the prediction of the accuracy is higher than in the rest. For example, the predicted accuracy for Multiboost is 88.69% and the real accuracy is 88.73%, having thus a prediction error of only 0.04%, whereas RandomForest is 8.93%. On the other hand, the four worst classifiers, BFTree, PART, SMO and VotedPerceptron are indeed at the bottom of the ranking, so it can be concluded that the meta-learner, as happened previously, works fine on detecting algorithms with low performance, and thus it does not recommend them. Moreover, the difference between the accuracies of the fourth worst classifier, PART, with 80.28% and one of the best classifiers, such as Multiboost with 88.73%, is higher than an 8%, meanwhile the difference between the real accuracy of one of the recommended classifiers, RandomForest, with 85.92%, and Multiboost is lesser than 3%. Hence, it can be stated that the system works suitably ranking those that performed better in the first places and the worst ones in the last positions.
Another important conclusion that this experiment draws is the need of working with a wide variety of algorithms because depending on the meta-features of the data set, an algorithm as SMO can be the best (first data set) or the worst (second data set) choice. This leads to corroborate that if the data mining process for non-expert miners want to be automated with a certain degree of quality, a metalearning-based algorithm recommender is a feasible solution.
Conclusions
This paper describes a meta-learning based framework which allows practitioners to discover what algorithm is the most suitable for applying on certain bi-classs data set. Unlike the previous works [6], in this paper, regressors have been used as meta-learners to offer a ranking of algorithms instead of only the best, since most times the difference in accuracy is pretty inappreciable. Likewise, a thorough study about what meta-features are more useful for this end has been carried out on data sets from educational arena. In the light of the results analysed, the landmakers are the most informative meta-features independently of what meta-regressor is used, although the best meta-learners are achieved when all meta-features are utilised and a feature selection process is previously run on them. Another interesting point is that a simpler algorithm such as LinearRegresion behave better as meta-learner than more complex ones, such as MultilayerPerceptron. Finally, the suitability and feasibility of the framework have been tested and very satisfactory results on data sets from the educational arena have been achieved.
In the near future, the framework will be extended to address multi-class problems and nominal attributes where there are still many issues opened [11]. Likewise, other regressors, based on decision trees or rules will be tested and compared with the aim of determining which one is the most suitable for the educational domain. In order to increase the number of training sets, the artificial generation of data sets will be also assessed.
Acknowledgments
This work has been partially funded by the Spanish Government under grant TIN2014-56158-C4-2-P (M2C2).
