Abstract
Meta-Learning has been largely used over the last years to support the recommendation of the most suitable machine learning algorithm(s) and hyperparameters for new datasets. Traditionally, a meta-base is created containing meta-features extracted from several datasets along with the performance of a pool of machine learning algorithms when applied to these datasets. The meta-features must describe essential aspects of the dataset and distinguish different problems and solutions. However, if one wants the use of Meta-Learning to be computationally efficient, the extraction of the meta-feature values should also show a low computational cost, considering a trade-off between the time spent to run all the algorithms and the time required to extract the meta-features. One class of measures with successful results in the characterization of classification datasets is concerned with estimating the underlying complexity of the classification problem. These data complexity measures take into account the overlap between classes imposed by the feature values, the separability of the classes and distribution of the instances within the classes. However, the extraction of these measures from datasets usually presents a high computational cost. In this paper, we propose an empirical approach designed to decrease the computational cost of computing the data complexity measures, while still keeping their descriptive ability. The proposal consists of a novel Meta-Learning system able to predict the values of the data complexity measures for a dataset by using simpler meta-features as input. In an extensive set of experiments, we show that the predictive performance achieved by Meta-Learning systems which use the predicted data complexity measures is similar to the performance obtained using the original data complexity measures, but the computational cost involved in their computation is significantly reduced.
Introduction
In Machine Learning (ML), bias has been defined as the choice of a specific generalization hypothesis over others, restricting the search space and model representation, making learning from data possible [51, 29]. Due to the lack of exact knowledge about the real data distribution, when deciding which algorithm has the most adequate bias for a new dataset, several algorithms are usually tried. This process, known as trial-and-error, is laborious and subjective. An alternative to supporting the automatic selection of an ML algorithm for a new dataset is to use Meta-Learning (MtL) [4]. By using knowledge from the previous application of ML algorithms to several datasets, a meta-model can be induced, which is able to recommend a suitable algorithm for a new dataset.
To use MtL, a meta-base must often be constructed. In this meta-base, typically, each meta-example is associated with a dataset, from which a set of descriptive characteristics is extracted. These characteristics are called meta-features and can be classified into five groups, whose combination we call here standard (STD) meta-features. The groups are simple measures based on general statistics and information theory [4], landmarks representing the performance of simple algorithms applied to the dataset [36] and internal features extracted from models induced by an ML algorithm when applied to the dataset [2]. Each meta-example can then be labeled according to the performance obtained by a set of ML algorithms when applied to the dataset represented by the meta-example. This process results in a meta-base, where the predictive and target features are the meta-features and the performance of the ML algorithms, respectively.
The next step is the induction of a meta-model using the meta-base. The meta-model can be induced by different ML algorithms and can be used in a recommendation system to predict [47], for a given dataset: (i) an algorithm that shall present the best performance; (ii) a ranking of the algorithms according to their performance; and (iii) the value of a performance metric achieved by each algorithm. It is important to notice that a theoretical support and a preprocessing step are needed in most of the cases, to provide a refinement of the recommendation framework [47].
However, the lack of an in-depth analysis of the STD meta-features is an open issue in most MtL studies [42, 43]. These studies usually show which group of STD meta-features can better characterize the problems [2, 3, 14, 9, 35, 13]. Besides, more recent studies [37, 45] proposed frameworks to systematize the extraction of STD meta-features, which are able to deal with aspects that affect the reproducibility and generalization of experiments. Even though these works cover essential gaps in the literature, they do not consider the cutting edge meta-features, their asymptotic computational cost, the degree of information presented or the importance of the meta-features for the investigated problems. We believe that this deficiency can be mitigated by adding data complexity measures (CM) to the characterization of the datasets.
The CM are values that can estimate the expected difficulty of a classification problem by extracting descriptions of the overlap between classes imposed by feature values, the separability and distribution of the data points, and certain structural characteristics of the problem based on the training data [21, 34, 27]. Most of the measure values are highly correlated with the predictive performance of diverse classification models, as demonstrated in previous studies [31, 8]. Therefore, one may expect they can play an important role in increasing the systematic comprehension of the ML models and improve MtL performance.
The CM have been successfully used in diverse MtL tasks, such as classifier recommendation [18], noise identification [16] and dealing with unbalanced data [1], to name a few. Although there are gains in terms of characterization of the dataset using them, they have a high asymptotic computational complexity, preventing their widespread usage in MtL. The main goal of this paper is to develop a MtL-based approach able to estimate the CM values for classification datasets with a lower computational cost. For such, low cost descriptive STD meta-features are used to design a novel MtL model able to predict the CM values for a given classification dataset. We will call these measures predicted complexity measures (PCM). In order to analyze the descriptive power of the PCM, we compare the predictive performance of a meta-model induced in three MtL experiments, each using a different set of meta-features, namely: STD, CM and PCM.
Looking at the experimental results, whilst the computational cost of obtaining the CM values was largely reduced by using PCM, the predictive performance of the meta-models was similar, confirming that the descriptive capability of the measures was maintained. Finally, the PCM was assembled into an R package called Simulated Complexity Library (SCoL). The SCoL package is publicly available at the GitHub1
The interest in reducing the computational burden of computing the CM values was previously addressed in [23]. Their proposal was to reduce the dataset size by selecting only representative prototypes, from which the CM values are extracted. However, the results were not consistent for all measures tested. The MtL setup in the current study differs from the previous attempt in the following ways. Firstly, a larger set of classification CM is considered. This set contains measures from recent literature implemented in the Extended Complexity Library package (ECoL) [17]. The use of MtL here is aimed at estimating the CM values for classification datasets with a lower computational cost. For such, the STD meta-features from the MtL literature are used to design a novel MtL model able to predict the CM values for a given classification dataset. Additional evaluations are performed to analyze the descriptive power of the PCM as meta-features for new MtL studies, revealing consistent results for all meta-features considered.
The rest of this research paper is organized as follows. Section 2 addresses the fundamental bibliographical synthesis that covers MtL recommendation systems and the CM. Section 3 describes the methodology adopted in this work to estimate the PCM, while Section 4 summarizes the experiments performed in their validation, as well as their results. Section 5 concludes this paper and points out possible future research directions.
This section presents the background information necessary to describe the proposed approach: Section 2.1 explains the MtL framework, including the process of building a meta-base. Section 2.2 presents the CM and their asymptotic computational complexity.
Meta-learning
The algorithm selection problem was initially addressed by Rice [44]. In this study, the author proposed an abstract model to systematize the algorithm selection problem. The main goal of this model is to predict the best algorithm to solve a given problem when more than one algorithm is available. There are four components in this model: (i) the problem instance space (
Smith-Miles [47] improved this abstract model by proposing generalizations that can also be applied to the algorithm design problem. In this proposal, some components are added: the set of MtL algorithms; the generation of empirical rules or algorithm rankings; and the examination of the empirical results by domain expert, which may guide theoretical support to refine the algorithms.
One crucial component of the previous models is the definition of the set of STD meta-features (
Simple: meta-features that are easily extracted from data [43], with low computational cost [41]. They are also called general measures [7].
Statistical: meta-features that capture statistical properties of the data [43], mainly indicators of localization and distribution, such as average, standard deviation, correlation and kurtosis. They can only characterize numerical attributes [7].
Information-theoretic: meta-features based on information theory [7], usually entropy estimates [46], which capture the amount of information in (subsets of) a dataset [47].
Model-based: meta-features extracted from a model induced from the data [43]. They are often based on properties of decision tree (DT) models [2, 35], when they are referred to as decision-tree-based meta-features [2].
Landmarking: meta-features that use the performance of simple and fast learning algorithms to characterize the datasets [47]. The algorithms must have different biases and should capture relevant information with a low computational cost.
A full description of the STD meta-features can be found in Rivolli et al. [45]. They proposed frameworks to systematize the extraction of STD meta-features by providing guidelines to enable the reproduction of empirical research in MtL, in agreement with the formalization presented and making the implementation available of the main STD meta-features used in MtL. Additionally, the paper surveys other measures, including the CM, and points out the computational cost involved in their computation.
The definition of the set of problem instances (
The algorithm space (
After extracting the STD meta-features from the datasets and evaluating the performance of a set of algorithms for these datasets, the next step is to label each meta-example in the meta-base. Brazdil et al. [5] summarize the three main properties frequently used to label the meta-examples in MtL: (i) the algorithm that presented the best performance on the dataset (a classification task); (ii) the ranking of the algorithms according to their performance on the dataset (a ranking classification task), where the algorithm with the best performance is top-ranked; and (iii) the performance value obtained by each evaluated algorithm on the dataset (a regression task).
The CM were first proposed by Ho and Basu [21] aiming to capture the underlying difficulty of a classification problem. They are measures extracted from a training dataset that characterize aspects such as overlapping of the classes, density of manifolds and shape of the decision boundary. Since various aspects may influence the complexity of a classification dataset, the authors defined a set of 12 measures able to capture different perspectives. Other recent works proposed more measures complementing the initial set of CM [25, 16, 34].
Table 1 presents the CM adopted in this work, which are briefly described next. This table presents the category of each measure, the name, acronym and asymptotic worst-case computational cost. All measures are computed from the training dataset
Characteristics of the complexity measures
Characteristics of the complexity measures
Some measures are defined only for binary classification problems. To measure the complexity of a multiclass dataset, the original classes are first decomposed using a OVO (one-versus-one) or pairwise strategy, producing
Most of the measures in this category consider a classification problem simpler if it has at least one feature which allows it to discriminate the classes perfectly.
Maximum Fisher’s discriminant ratio (F1): considers the discrimination ability of the features according to the Fisher’s discriminant criterion [30]:
where
The Directional-vector Maximum Fisher’s Discriminant Ratio (F1v): F1v searches for a vector which can separate the two classes after the examples have been projected into it according to a directional Fisher criterion [28]:
where
Volume of Overlapping Region (F2): F2 calculates the overlap of the distributions of the feature values within the classes. It takes the range of the overlapping interval, normalized by the range of the values in both classes, as shown in Eq. (5) [49].
where:
Maximum Individual Feature Efficiency (F3): F3 estimates the efficiency of each feature in separating the classes, and takes the feature with the best efficiency. The efficiency of each feature is given by the ratio between the number of examples that are in an overlapping region and the total number of examples. Based on those concepts, F3 can be estimated as:
where the numerator gives the number of examples that are in the overlapping region for feature
Collective Feature Efficiency (F4): this measure provides an overview of how the features work together, by successively applying the F3 measure [34]. First, the feature with the most discrimination power according to F3 is selected. All examples that can be separated by this feature are disregarded and the previous procedure is repeated: the next feature with the most discrimination ability according to F3 is selected, excluding the examples already discriminated. This procedure is repeated until all the features have been examined and can be stopped whenever no example remains. The final result of F4 is the percentage of remaining examples in the dataset.
These measures check if the classes are linearly separable. In this case, the problem can be considered simpler than another requiring a non-linear decision boundary. As in [34], to obtain the linear classifier, a linear Support Vector Machine (SVM) [10] is used. The SVM seeks the hyperplane able to discriminate the classes with a maximum margin of separation. In such a process, it minimizes the norm of the hyperplane weight vector and also the training errors, which are modelled by a slack variable
Sum of the Error Distance by Linear Programming (L1): this measure considers the sum of the distances of incorrectly classified examples to a linear boundary used in their separation. The distance of the erroneous instances to the decision hyperplane can be assessed by the
Error Rate of Linear Classifier (L2): L2 computes the error rate of the linear classifier previously described.
Non-Linearity of a Linear Classifier (L3):L3 uses a methodology proposed by [22] to produce a new dataset. For such, pairs of examples from the same class are chosen randomly and are linearly interpolated. Then, a linear SVM trained on the original data has its error rate measured in the new data points.
These measures analyze the neighborhood of the data points in order to capture the shape of the decision boundary, the overlapping of the classes and their internal structure.
Fraction of Borderline Points (N1): first a Minimum Spanning Tree (MST) is built from the data, in which each vertex corresponds to an example and the edges are weighted according to the distance between them. N1 is given by the percentage of vertices incident to edges connecting examples of different classes in the MST.
Ratio of Intra/Extra Class Nearest Neighbor Distance (N2): N2 considers the ratio of two sums: (i) intra-class, that is, the sum of the distances between each example and its closest neighbor from the same class; and (ii) extra-class, which is the sum of the distances between each example and its closest neighbor from another class (aka the nearest enemy). Here the following equation is taken:
where
Error Rate of the Nearest Neighbor Classifier (N3): N3 takes the error rate of a 1-nearest neighbor (1NN) classifier, estimated using a leave-one-out procedure in the training dataset.
Non-Linearity of the Nearest Neighbor Classifier (N4): N4 is similar to L3, but uses the NN classifier instead of the linear predictor to obtain the error rates in the new interpolated dataset.
Fraction of Hyperspheres Covering Data (T1): T1 builds hyperspheres centered at each one of the examples and their radius is progressively increased until the hypersphere reaches a hypersphere of another class. Smaller hyperspheres contained in larger hyperspheres are eliminated and T1 is the ratio between the number of the remaining hyperspheres and the total number of examples in the dataset.
Local Set Average Cardinality (LSC): The Local-Set (LS) of an example
The cardinality of the LS of an example is an indicative of how close it is to the decision boundary and the narrowness of the gap between the classes. The local set average cardinality measure (LSC) is calculated here as:
Morais and Prati [32] and Garcia et al. [16] propose to capture structural information from a dataset by modelling it as a graph. Using this representation, the following measures are based on statistical characterization of complex networks [24].
The
Average density of the network (Density): This measure takes the number of edges that are retained in the graph, normalized by the maximum number of edges that could be formed between
Clustering coefficient (ClsCoef): The clustering coefficient of a vertex
where
Hub score (Hubs): The hub score of a node is given by the number of connections it has to other nodes, weighted by the number of connections these neighbors have.
The values of
The methodology adopted in the experiments performed for this study is summarized in Fig. 1. First, the STD meta-features are extracted from the datasets to construct meta-models able to obtain the PCM. In this step, the meta-examples are labeled according to the values of the CM. Afterward, the STD, the PCM, and the CM meta-features are extracted from the datasets and used in a new MtL experiment in order to predict the performance of some classifiers. This step is called stacking. Finally, the execution time required to extract each set of meta-features is compared with the others.
Evaluation methodology followed in the experiments.
For the induction of the meta-regressors in the prediction step, left side of Fig. 1, three regression algorithms with distinct biases are used: Random Forests (RF) [20, 6] with 500 DTs, Support Vector Regressors (SVR) [10] with radial basis kernel and Distance Weighted
On the stacking side of Fig. 1, the different sets of meta-features and measures are used as input to the regression algorithms that will induce the meta-regressors, which will be used to predict the predictive performance of a pool of classifiers induced by supervised ML algorithms. The ML algorithms are: Artificial Neural Networks (ANN) [19] trained with backpropagation (learning rate of 0.3, momentum of 0.5 and one hidden layer); C4.5 algorithm [40] with pruning; Support Vector Machine (SVM) [10] with radial basis kernel; RF with 500 DTs; and 3-NN classifier. These algorithms were chosen because of their different inductive biases. As in the first experiment, the regression algorithms used to induce the meta-regressors are DWNN, RF and SVR. We obtained the RMSE using the same 10-fold cross-validation folds from the previous task, such that the independence of the models is maintained.
One additional analysis in the stacking level is the trade-off between the computational cost of obtaining each set of meta-features and the cost of evaluating all classifiers in a cross-validation setup. For such, we run these alternatives on 100 random selected datasets in a cluster node with two Intel Xeon E5-2680v2 processors and 128 GB DDR3 into a single thread. These datasets were selected from 400 benchmark classification datasets from the OpenML repository [50]. They represent diverse application contexts and domains, and were chosen with limits of up to a maximum number of 10,000 examples, 500 features and 10 classes. All these datasets have no missing values.
The STD meta-features were extracted using the mfe package [45], whereas the CM were extracted using the ECoL package [27]. Additional details of the implementation can be found at the GitHub site4
This section presents the results from the previously described experimental setups. Section 4.1 reports the predictive performance of the meta-models in the CM prediction task. Section 4.2 accesses the PCM values in a MtL stacking approach and Section 4.3 compares the running cost of using CM against the cost of using PCM.
Prediction of the complexity measure values
Figure 2 shows the average RMSE of the meta-regressors in the CM prediction task, estimated using 10-fold cross-validation. Each plot represents one CM and the boxplots summarize the RMSE values of each meta-regressor. The last plot summarizes the average RMSE values for all of the CM. The boxplots for the results of the meta-regressors induced by the three regression algorithms (DWNN, RF and SVR) are shown in white, whereas for the baselines (RD and DF) they are colored in gray.
The RMSE of each meta-regressor to predict the CM values.
According to the previous plots, the meta-regressors outperformed the baselines, with better predictive performance in all cases. This suggests that the combination of STD meta-features was able to capture the necessary knowledge to represent the CM. Furthermore, the meta-regressors induced by DWNN, RF and SVR presented a more stable behavior than the baselines, which have elongated boxplots and larger standard deviation. DF represents a more strict baseline, since it uses the CM values average, whilst RD uses random values. Among the meta-regressors, in most of the cases, RF performed better, followed by SVR and DWNN. This information can be confirmed in the last plot, which takes the average performance for all CM values.
In order to analyze if there is a direct relationship between the CM and the PCM values, we calculated the Pearson’s correlation between them. This evaluation identifies the meta-regressor models that are more sensitive to the variation of the CM values. Figure 3 shows a heatmap of the ordered correlation between CM and PCM. Each column and row corresponds to the CM value and the PCM value predicted by the corresponding meta-regressor, respectively. Each box is colored according to the ranking of the meta-regressors, from white (highest ranking) to gray (lowest ranking). The correlation values are also shown inside the heatmap’s cells.
Heatmap of the correlation between the CM and PCM.
As expected, the PCM values obtained by the meta-regressors induced by DWNN, RF and SVR are highly correlated with the CM values. Moreover, RF performed better than SVR and DWNN, achieving values close to 0.9 of correlation for most of the cases. The baselines presented a very low correlation and, in some cases, with negative correlation values. The general results from Figs 2 and 3 are very similar, except for the F3 and ClsCoef measures. While in the boxplot the SVR algorithm had a better performance for those measures, the RF algorithm had a higher correlation in the heatmap. The answer for this divergence is the low standard deviation of the RF predictions compared to those of the SVR meta-models. It is also worth noting that the prediction of the linearity measure L1 had the worst correlation, with values around 0.7, probably explained by the more complex concept involved in this particular measure, which takes into account the distance of incorrectly predicted examples to a linear boundary used in their separation.
In order to assess how the STD meta-features contribute to the prediction of the PCM values, Fig. 4 shows the 15 top-ranked STD meta-features selected by RF in all runs. We chose RF because it has the lowest RMSE in Fig. 2 and the highest Pearson’s correlation values in Fig. 3. In this figure, the
Top-ranked STD meta-features selected by the RF meta-regressor.
The STD meta-features regarded as the most important are those based on landmarking and information theory. The landmarking measures are mainly related to the performance of simple meta-models induced by the
Since the previous meta-regressors can predict all the CM with high predictive performance, in this section we assess their use in a MtL task, estimating the performance of a set of classifiers. For such, different groups of meta-features, including the STD meta-features, the CM and the PCM, are used independently or combined. The PCM used are those generated by the best meta-model (the RF) previously induced. Ideally, the PCM will be able to either maintain the performance achieved by the CM or outperform this performance.
Figure 5 shows the performance of each meta-regressor (DWNN, RF and SVR) when different groups of meta-features and measures are used to predict the accuracy of five classifiers (ANN, C4.5, SVM, RF and 3-NN). Each plot represents the performance of one meta-regressor to predict the accuracy of one classifier and the boxplots summarize the RMSE values of each group of meta-features and measures estimated by a 10-fold cross-validation process. The baseline (RD and DF) performances are not shown in that figure because they have high RMSE values for all the classifiers.
The RMSE of each group of meta-features to predict the performance of the classifiers.
The RMSE, although similar between the groups of meta-features and measures, was lower for the SVR and RF algorithms. This indicates that they are more accurate to predict the performance of the classifiers. Analyzing the groups of meta-features and measures, the PCM provided the best input variables for the DWNN and SVR meta-regressors for all the classifiers, and an intermediate performance for RF when used individually. When PCM or CM were combined with the STD meta-features, they had the best performance for RF. This indicates that the PCM, even when combined with the STD meta-features, can improve the dataset descriptions for the MtL tasks.
A Friedman statistical test [11] with 95% of confidence value was applied to compare the predictive performance of the groups of meta-features and measures. Statistical differences between the PCM, CM and STD were found in some cases. For the DWNN and SVR meta-regressors, the PCM performed better than STD for 7 of 10 meta-bases. For the RF meta-regressor there are no statistical differences between CM, PCM and STD. In general, the STD was never better than PCM and no difference was detected between PCM and CM. This indicates that the use of PCM can improve the performance for a vast group of MtL tasks.
In MtL scenarios, the trade-off between the runtime of the characterization process and the evaluation of all alternative data modeling algorithms considered to solve the task under study should favor the former. Otherwise, the trial-and-error approach would be preferable. Therefore, to evaluate this trade-off for the proposed approach, we performed a runtime analysis of the MtL stacking level. This analysis exhaustively compares:
The time taken for extracting the CM with the time taken for extracting the STD meta-features and running all classification algorithms. The time taken for extracting the PCM - which sums up the time for extracting the STD meta-features and running the RF meta-model – with the time taken for extracting the STD meta-features and running all classification algorithms.
These comparisons are shown in Fig. 6. To improve the visualization, the time is presented on a log-scale. Each point represents a dataset and the diagonal line indicates when both times are similar. Values above the diagonal indicate that the strategy represented in the
Runtime of the CM and PCM.
According to this figure, it was clearly faster to compute the PCM than the CM. The extraction of CM presented a higher runtime when compared with the runtime time for the extraction of the STD meta-features and for training the classifiers, in, respectively, 94% and 24% of the cases. The main bottleneck of the CM is related to high numbers of examples, which can overhead the computation of the neighborhood measures. Meanwhile, the PCM presented a lower runtime in both cases. Compared to the STD meta-features, the execution time of the PCM is similar, as they need to extract the STD meta-features and obtain the prediction values of the RF meta-regressor. Compared to running all of the classifiers, the PCM had a higher runtime execution for only 4% of the datasets. Additionally, it is important to remember that in some cases the use of PCM was able to improve the predictive performance when compared to the use of the CM and STD meta-features in the stacking experiments. Therefore, we can clearly notice that the PCM are suitable substitutes for the CM, and can be computed at a lower computational cost.
Originally proposed to estimate the expected difficulty of a classification problem by extracting descriptions of the overlap between classes, the separability and distribution of the data points and certain structural characteristics of the problem, the CM are a good alternative for characterizing datasets in MtL tasks. However, the CM have a high asymptotic computational complexity, which prevents their widespread use in MtL. This paper presented a study on how to estimate the CM values at a reduced computational cost, which is achieved by inducing meta-models able to predict the CM values by using STD meta-features, which have a lower computational cost. The results also indicate that the PCM values obtained are similar to the CM values and can be obtained at a much lower computational cost. Moreover, when the PCM are contrasted with different groups of meta-features in a MtL task, they can improve the characterization of the datasets.
To investigate this issue, a meta-base consisting of several classification datasets was created. Each dataset was described by STD meta-features and labeled according to each CM value. The meta-regressors DWNN, RF and SVR were used to predict the CM values and had their performance compared to two baseline recommenders. The experimental results showed that the RF meta-models were able to predict the CM with high predictive performance. Additional analysis indicates that the landmarking and information-theoretic measures were the essential STD meta-features to predict the CM.
To validate the PCM values, an analysis of the groups of meta-features and their runtime execution for a MtL regression task was evaluated in a stacking level. The groups of meta-features and measures used were the STD meta-features, the CM and PCM. The MtL task is to predict the performance of distinct classifiers. In some cases, the PCM outperformed the CM. Moreover, the time required to obtain the PCM values is largely reduced compared to that of obtaining the CM values.
Future work shall look for the PCM that best distinguishes the performances of classifiers and increase the interpretability of the meta-models results. We would also like to: (i) optimize the PCM; (ii) evaluate other MtL approaches such as ranking the classifiers; (iii) investigate hyperparameter tuning for the classification algorithms; and (iv) study further in which cases the classifier recommendation outperforms the default classification algorithm.
Footnotes
Acknowledgments
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) , Finance Code 001. The authors would also like to thank the São Paulo Research Foundation (FAPESP), grants 2013/07375-0 (CEPID CeMEAI), 2012/22608-8, 2016/18615-0 and 2018/14819-5, and Intel for the hardware and software server used in part of the experiments.
