Boosting meta-learning with simulated data complexity measures

Abstract

Meta-Learning has been largely used over the last years to support the recommendation of the most suitable machine learning algorithm(s) and hyperparameters for new datasets. Traditionally, a meta-base is created containing meta-features extracted from several datasets along with the performance of a pool of machine learning algorithms when applied to these datasets. The meta-features must describe essential aspects of the dataset and distinguish different problems and solutions. However, if one wants the use of Meta-Learning to be computationally efficient, the extraction of the meta-feature values should also show a low computational cost, considering a trade-off between the time spent to run all the algorithms and the time required to extract the meta-features. One class of measures with successful results in the characterization of classification datasets is concerned with estimating the underlying complexity of the classification problem. These data complexity measures take into account the overlap between classes imposed by the feature values, the separability of the classes and distribution of the instances within the classes. However, the extraction of these measures from datasets usually presents a high computational cost. In this paper, we propose an empirical approach designed to decrease the computational cost of computing the data complexity measures, while still keeping their descriptive ability. The proposal consists of a novel Meta-Learning system able to predict the values of the data complexity measures for a dataset by using simpler meta-features as input. In an extensive set of experiments, we show that the predictive performance achieved by Meta-Learning systems which use the predicted data complexity measures is similar to the performance obtained using the original data complexity measures, but the computational cost involved in their computation is significantly reduced.

Keywords

Meta-learning meta-features complexity measures

1. Introduction

In Machine Learning (ML), bias has been defined as the choice of a specific generalization hypothesis over others, restricting the search space and model representation, making learning from data possible [51, 29]. Due to the lack of exact knowledge about the real data distribution, when deciding which algorithm has the most adequate bias for a new dataset, several algorithms are usually tried. This process, known as trial-and-error, is laborious and subjective. An alternative to supporting the automatic selection of an ML algorithm for a new dataset is to use Meta-Learning (MtL) [4]. By using knowledge from the previous application of ML algorithms to several datasets, a meta-model can be induced, which is able to recommend a suitable algorithm for a new dataset.

To use MtL, a meta-base must often be constructed. In this meta-base, typically, each meta-example is associated with a dataset, from which a set of descriptive characteristics is extracted. These characteristics are called meta-features and can be classified into five groups, whose combination we call here standard (STD) meta-features. The groups are simple measures based on general statistics and information theory [4], landmarks representing the performance of simple algorithms applied to the dataset [36] and internal features extracted from models induced by an ML algorithm when applied to the dataset [2]. Each meta-example can then be labeled according to the performance obtained by a set of ML algorithms when applied to the dataset represented by the meta-example. This process results in a meta-base, where the predictive and target features are the meta-features and the performance of the ML algorithms, respectively.

The next step is the induction of a meta-model using the meta-base. The meta-model can be induced by different ML algorithms and can be used in a recommendation system to predict [47], for a given dataset: (i) an algorithm that shall present the best performance; (ii) a ranking of the algorithms according to their performance; and (iii) the value of a performance metric achieved by each algorithm. It is important to notice that a theoretical support and a preprocessing step are needed in most of the cases, to provide a refinement of the recommendation framework [47].

However, the lack of an in-depth analysis of the STD meta-features is an open issue in most MtL studies [42, 43]. These studies usually show which group of STD meta-features can better characterize the problems [2, 3, 14, 9, 35, 13]. Besides, more recent studies [37, 45] proposed frameworks to systematize the extraction of STD meta-features, which are able to deal with aspects that affect the reproducibility and generalization of experiments. Even though these works cover essential gaps in the literature, they do not consider the cutting edge meta-features, their asymptotic computational cost, the degree of information presented or the importance of the meta-features for the investigated problems. We believe that this deficiency can be mitigated by adding data complexity measures (CM) to the characterization of the datasets.

The CM are values that can estimate the expected difficulty of a classification problem by extracting descriptions of the overlap between classes imposed by feature values, the separability and distribution of the data points, and certain structural characteristics of the problem based on the training data [21, 34, 27]. Most of the measure values are highly correlated with the predictive performance of diverse classification models, as demonstrated in previous studies [31, 8]. Therefore, one may expect they can play an important role in increasing the systematic comprehension of the ML models and improve MtL performance.

The CM have been successfully used in diverse MtL tasks, such as classifier recommendation [18], noise identification [16] and dealing with unbalanced data [1], to name a few. Although there are gains in terms of characterization of the dataset using them, they have a high asymptotic computational complexity, preventing their widespread usage in MtL. The main goal of this paper is to develop a MtL-based approach able to estimate the CM values for classification datasets with a lower computational cost. For such, low cost descriptive STD meta-features are used to design a novel MtL model able to predict the CM values for a given classification dataset. We will call these measures predicted complexity measures (PCM). In order to analyze the descriptive power of the PCM, we compare the predictive performance of a meta-model induced in three MtL experiments, each using a different set of meta-features, namely: STD, CM and PCM.

Looking at the experimental results, whilst the computational cost of obtaining the CM values was largely reduced by using PCM, the predictive performance of the meta-models was similar, confirming that the descriptive capability of the measures was maintained. Finally, the PCM was assembled into an R package called Simulated Complexity Library (SCoL). The SCoL package is publicly available at the GitHub1

¹
https://github.com/lpfgarcia/SCoL.

repository.

The interest in reducing the computational burden of computing the CM values was previously addressed in [23]. Their proposal was to reduce the dataset size by selecting only representative prototypes, from which the CM values are extracted. However, the results were not consistent for all measures tested. The MtL setup in the current study differs from the previous attempt in the following ways. Firstly, a larger set of classification CM is considered. This set contains measures from recent literature implemented in the Extended Complexity Library package (ECoL) [17]. The use of MtL here is aimed at estimating the CM values for classification datasets with a lower computational cost. For such, the STD meta-features from the MtL literature are used to design a novel MtL model able to predict the CM values for a given classification dataset. Additional evaluations are performed to analyze the descriptive power of the PCM as meta-features for new MtL studies, revealing consistent results for all meta-features considered.

The rest of this research paper is organized as follows. Section 2 addresses the fundamental bibliographical synthesis that covers MtL recommendation systems and the CM. Section 3 describes the methodology adopted in this work to estimate the PCM, while Section 4 summarizes the experiments performed in their validation, as well as their results. Section 5 concludes this paper and points out possible future research directions.

2. Background knowledge

This section presents the background information necessary to describe the proposed approach: Section 2.1 explains the MtL framework, including the process of building a meta-base. Section 2.2 presents the CM and their asymptotic computational complexity.

2.1 Meta-learning

The algorithm selection problem was initially addressed by Rice [44]. In this study, the author proposed an abstract model to systematize the algorithm selection problem. The main goal of this model is to predict the best algorithm to solve a given problem when more than one algorithm is available. There are four components in this model: (i) the problem instance space ( $P$ ), which consists of datasets in MtL; (ii) the instance feature space ( $F$ ), which are the meta-features used to describe the datasets; (iii) the algorithm space ( $A$ ), which contains the pool of ML algorithms that might be recommended; and ( $i v$ ) the evaluation measure space ( $Y$ ), responsible for assessing the performance of the ML algorithms in solving the problem instances contained in $P$ . Using the previous sets, the MtL system can obtain an algorithm able to map a dataset $x$ , described by the meta-features $f$ , into one (or more) algorithm $\alpha$ able to solve the problem with a good predictive performance according to $Y$ , i.e., with maximum $y(\alpha(x))$ .

Smith-Miles [47] improved this abstract model by proposing generalizations that can also be applied to the algorithm design problem. In this proposal, some components are added: the set of MtL algorithms; the generation of empirical rules or algorithm rankings; and the examination of the empirical results by domain expert, which may guide theoretical support to refine the algorithms.

One crucial component of the previous models is the definition of the set of STD meta-features ( $F$ ) used to describe general properties of datasets. These meta-features must be able to provide evidence about the future performance of the algorithms in $A$ [48, 41] and to discriminate, with a low computational cost, the performance of a group of algorithms. The main STD meta-features used in the MtL literature can be divided into five groups:

•
Simple: meta-features that are easily extracted from data [43], with low computational cost [41]. They are also called general measures [7].
•
Statistical: meta-features that capture statistical properties of the data [43], mainly indicators of localization and distribution, such as average, standard deviation, correlation and kurtosis. They can only characterize numerical attributes [7].
•
Information-theoretic: meta-features based on information theory [7], usually entropy estimates [46], which capture the amount of information in (subsets of) a dataset [47].
•
Model-based: meta-features extracted from a model induced from the data [43]. They are often based on properties of decision tree (DT) models [2, 35], when they are referred to as decision-tree-based meta-features [2].
•
Landmarking: meta-features that use the performance of simple and fast learning algorithms to characterize the datasets [47]. The algorithms must have different biases and should capture relevant information with a low computational cost.

A full description of the STD meta-features can be found in Rivolli et al. [45]. They proposed frameworks to systematize the extraction of STD meta-features by providing guidelines to enable the reproduction of empirical research in MtL, in agreement with the formalization presented and making the implementation available of the main STD meta-features used in MtL. Additionally, the paper surveys other measures, including the CM, and points out the computational cost involved in their computation.

The definition of the set of problem instances ( $P$ ) is another concern, when the ideal would be to use a large number of diverse datasets to induce a reliable meta-model. To reduce the bias in this choice, datasets from several data repositories, such as UCI2
²
https://archive.ics.uci.edu/ml/index.php.

[12] and OpenML3
³
http://www.openml.org/.

[50], can be used. Other strategies to increase the number of datasets are using active learning for instance selection and datasetoids, which is a data manipulation method used to obtain new datasets from existing ones [38, 39].

The algorithm space ( $A$ ) represents a set of candidate algorithms to be recommended in the algorithm selection process. Ideally, these algorithms should also be sufficiently different from each other and represent all regions in the algorithm space [33]. The models induced by the algorithms can be evaluated by different measures. For classification tasks, most of the studies in the MtL use accuracy. However, other indices, such as $F_{\beta}$ , AUC and kappa coefficient, can also be used. For regression problems, Mean Squared Error (MSE) or Root MSE (RMSE) (or normalized versions of such measures) are usually employed.

After extracting the STD meta-features from the datasets and evaluating the performance of a set of algorithms for these datasets, the next step is to label each meta-example in the meta-base. Brazdil et al. [5] summarize the three main properties frequently used to label the meta-examples in MtL: (i) the algorithm that presented the best performance on the dataset (a classification task); (ii) the ranking of the algorithms according to their performance on the dataset (a ranking classification task), where the algorithm with the best performance is top-ranked; and (iii) the performance value obtained by each evaluated algorithm on the dataset (a regression task).
2.2 Complexity measures

The CM were first proposed by Ho and Basu [21] aiming to capture the underlying difficulty of a classification problem. They are measures extracted from a training dataset that characterize aspects such as overlapping of the classes, density of manifolds and shape of the decision boundary. Since various aspects may influence the complexity of a classification dataset, the authors defined a set of 12 measures able to capture different perspectives. Other recent works proposed more measures complementing the initial set of CM [25, 16, 34].

Table 1 presents the CM adopted in this work, which are briefly described next. This table presents the category of each measure, the name, acronym and asymptotic worst-case computational cost. All measures are computed from the training dataset $T$ , which contains $n$ examples $\mathbf{x}_{i}$ , described by $m$ predictive features, and are labeled into one class $y_{i}\in\{1,2,\ldots,n_{c}\}$ . Most of the measures are computed at a quadratic cost in the number of examples in the dataset, which can be costly for large datasets. The limit values (minimum and maximum) assumed by these measures are $[0,1]$ .

Table 1
Characteristics of the complexity measures

Category	Name	Acronym	Asymptotic cost
Feature-based	Maximum Fisher’s discriminant ratio	F1	$O(m\cdot n)$
	Directional vector maximum Fisher’s discriminant ratio	F1v	$O(m\cdot n\cdot n_{c}+m^{3}\cdot n_{c}^{2})$
	Volume of overlapping region	F2	$O(m\cdot n\cdot n_{c})$
	Maximum individual feature efficiency	F3	$O(m\cdot n\cdot n_{c})$
	Collective feature efficiency	F4	$O(m^{2}\cdot n\cdot n_{c})$
Linearity	Sum of the error distance by linear programming	L1	$O(n^{2})$
	Error rate of linear classifier	L2	$O(n^{2})$
	Non linearity of linear classifier	L3	$O(n^{2}+m\cdot l\cdot n_{c})$
Neighborhood	Faction of borderline points	N1	$O(m\cdot n^{2})$
	Ratio of intra/extra class NN distance	N2	$O(m\cdot n^{2})$
	Error rate of NN classifier	N3	$O(m\cdot n^{2})$
	Non linearity of NN classifier	N4	$O(m\cdot n^{2}+m\cdot l\cdot n)$
	Fraction of hyperspheres covering data	T1	$O(m\cdot n^{2})$
	Local set average cardinality	LSC	$O(m\cdot n^{2})$
Network	Density	Density	$O(m\cdot n^{2})$
	Clustering Coefficient	ClsCoef	$O(m\cdot n^{2})$
	Hubs	Hubs	$O(m\cdot n^{2})$

Some measures are defined only for binary classification problems. To measure the complexity of a multiclass dataset, the original classes are first decomposed using a OVO (one-versus-one) or pairwise strategy, producing $\frac{n_{c}(n_{c}-1)}{2}$ subproblems [26]. The final measure value is the average of the values obtained for all binary subproblems. A full description of the CM measures can be found in Lorena et al. [27].

2.2.1 Feature-based measures

Most of the measures in this category consider a classification problem simpler if it has at least one feature which allows it to discriminate the classes perfectly.

Maximum Fisher’s discriminant ratio (F1): considers the discrimination ability of the features according to the Fisher’s discriminant criterion [30]:

$\displaystyle r_{f_{i}}=\frac{\sum_{j=1}^{n_{c}}n_{c_{j}}\left(\mu_{c_{j}}^{f_% {i}}-\mu^{f_{i}}\right)^{2}}{\sum_{j=1}^{n_{c}}\sum_{l=1}^{n_{c_{j}}}\left(x_{% li}^{j}-\mu_{c_{j}}^{f_{i}}\right)^{2}},$ (1)

where $f_{i}$ is a particular feature, $n_{c_{j}}$ is the number of examples in class $c_{j}$ , $\mu_{c_{j}}^{f_{i}}$ is the mean of feature $f_{i}$ across examples of class $c_{j}$ , $\mu^{f_{i}}$ is the mean of the $f_{i}$ values for all examples, and $x_{li}^{j}$ denotes the individual value of the feature $f_{i}$ for an example from class $c_{j}$ . F1 takes the maximum of the $r_{f_{i}}$ values found, which corresponds to the feature with the highest discrimination ability in the dataset. In this work, we used the formulae:

$\displaystyle F1=\frac{1}{1+\max_{i=1}^{m}r_{f_{i}}},$ (2)

The Directional-vector Maximum Fisher’s Discriminant Ratio (F1v): F1v searches for a vector which can separate the two classes after the examples have been projected into it according to a directional Fisher criterion [28]:

$\displaystyle dF=\frac{\mathbf{d}^{t}\mathbf{B}\mathbf{d}}{\mathbf{d}^{t}% \mathbf{W}\mathbf{d}},$ (3)

where $\mathbf{d}$ is the directional vector onto which data are projected for maximizing class separation, $\mathbf{B}$ is the between-class scatter matrix and $\mathbf{W}$ is the within-class scatter matrix. Afterwards, F1v is given by:

$\displaystyle\textit{F1v}=\frac{1}{1+dF}$ (4)

Volume of Overlapping Region (F2): F2 calculates the overlap of the distributions of the feature values within the classes. It takes the range of the overlapping interval, normalized by the range of the values in both classes, as shown in Eq. (5) [49].

$\displaystyle F2=-\prod_{i}^{m}\frac{\max\{0,\min\max(f_{i})-\max\min(f_{i})\}% }{\max\max(f_{i})-\min\min(f_{i})},$ (5)

where:

$\displaystyle\min\max(f_{i})=\min\left(\max(f_{i}^{c_{1}}),\max(f_{i}^{c_{2}})% \right),$ (6) $\displaystyle\max\min(f_{i})=\max\left(\min(f_{i}^{c_{1}}),\min(f_{i}^{c_{2}})% \right),$ (7) $\displaystyle\max\max(f_{i})=\max\left(\max(f_{i}^{c_{1}}),\max(f_{i}^{c_{2}})% \right),$ (8) $\displaystyle\min\min(f_{i})=\min\left(\min(f_{i}^{c_{1}}),\min(f_{i}^{c_{2}})% \right).$ (9)

$\max(f_{i}^{c_{j}})$ and $\min(f_{i}^{c_{j}})$ are the maximum and minimum values of each feature in a class $c_{j}$ .

Maximum Individual Feature Efficiency (F3): F3 estimates the efficiency of each feature in separating the classes, and takes the feature with the best efficiency. The efficiency of each feature is given by the ratio between the number of examples that are in an overlapping region and the total number of examples. Based on those concepts, F3 can be estimated as:

$\displaystyle F3=\min_{i=1}^{m}\frac{\sum_{j=1}^{n}{I(x_{ji}>\max\min(f_{i})% \wedge x_{ji}<\min\max(f_{i}))}}{n},$ (10)

where the numerator gives the number of examples that are in the overlapping region for feature $f_{i}$ . In Eq. (10), $I$ is the indicator function, which returns 1 if its argument is true and 0 otherwise.

Collective Feature Efficiency (F4): this measure provides an overview of how the features work together, by successively applying the F3 measure [34]. First, the feature with the most discrimination power according to F3 is selected. All examples that can be separated by this feature are disregarded and the previous procedure is repeated: the next feature with the most discrimination ability according to F3 is selected, excluding the examples already discriminated. This procedure is repeated until all the features have been examined and can be stopped whenever no example remains. The final result of F4 is the percentage of remaining examples in the dataset.

2.2.2 Measures of linearity

These measures check if the classes are linearly separable. In this case, the problem can be considered simpler than another requiring a non-linear decision boundary. As in [34], to obtain the linear classifier, a linear Support Vector Machine (SVM) [10] is used. The SVM seeks the hyperplane able to discriminate the classes with a maximum margin of separation. In such a process, it minimizes the norm of the hyperplane weight vector and also the training errors, which are modelled by a slack variable $\varepsilon_{i}$ .

Sum of the Error Distance by Linear Programming (L1): this measure considers the sum of the distances of incorrectly classified examples to a linear boundary used in their separation. The distance of the erroneous instances to the decision hyperplane can be assessed by the $\varepsilon_{i}$ values. For correctly classified examples, $\varepsilon_{i}$ will be zero. Otherwise, it indicates the distance of the example to the linear boundary. Here we take L1 as:

$\displaystyle L1=\frac{\sum_{i=1}^{n}\varepsilon_{i}}{n+\sum_{i=1}^{n}% \varepsilon_{i}}$ (11)

Error Rate of Linear Classifier (L2): L2 computes the error rate of the linear classifier previously described.

Non-Linearity of a Linear Classifier (L3):L3 uses a methodology proposed by [22] to produce a new dataset. For such, pairs of examples from the same class are chosen randomly and are linearly interpolated. Then, a linear SVM trained on the original data has its error rate measured in the new data points.

2.2.3 Neighborhood measures

These measures analyze the neighborhood of the data points in order to capture the shape of the decision boundary, the overlapping of the classes and their internal structure.

Fraction of Borderline Points (N1): first a Minimum Spanning Tree (MST) is built from the data, in which each vertex corresponds to an example and the edges are weighted according to the distance between them. N1 is given by the percentage of vertices incident to edges connecting examples of different classes in the MST.

Ratio of Intra/Extra Class Nearest Neighbor Distance (N2): N2 considers the ratio of two sums: (i) intra-class, that is, the sum of the distances between each example and its closest neighbor from the same class; and (ii) extra-class, which is the sum of the distances between each example and its closest neighbor from another class (aka the nearest enemy). Here the following equation is taken:

$\displaystyle N2=\frac{\sum_{i=1}^{n}{d(\mathbf{x_{i}},nn(\mathbf{x}_{i}))}}{% \sum_{i=1}^{n}{d(\mathbf{x_{i}},nn(\mathbf{x}_{i}))}+\sum_{i=1}^{n}{d(\mathbf{% x_{i}},ne(\mathbf{x}_{i}))}},$ (12)

where $d(\mathbf{x_{i}},nn(\mathbf{x}_{i}))$ corresponds to the intra-class distance of each example $\mathbf{x}_{i}$ and $d(\mathbf{x_{i}},ne(\mathbf{x}_{i}))$ denotes the extra-class distance of $\mathbf{x}_{i}$ .

Error Rate of the Nearest Neighbor Classifier (N3): N3 takes the error rate of a 1-nearest neighbor (1NN) classifier, estimated using a leave-one-out procedure in the training dataset.

Non-Linearity of the Nearest Neighbor Classifier (N4): N4 is similar to L3, but uses the NN classifier instead of the linear predictor to obtain the error rates in the new interpolated dataset.

Fraction of Hyperspheres Covering Data (T1): T1 builds hyperspheres centered at each one of the examples and their radius is progressively increased until the hypersphere reaches a hypersphere of another class. Smaller hyperspheres contained in larger hyperspheres are eliminated and T1 is the ratio between the number of the remaining hyperspheres and the total number of examples in the dataset.

Local Set Average Cardinality (LSC): The Local-Set (LS) of an example $\mathbf{x}_{i}$ is defined in [25] as the set of points from the training dataset $T$ whose distance to $\mathbf{x}_{i}$ is smaller than the distance from $\mathbf{x}_{i}$ to $\mathbf{x}_{i}$ ’s nearest enemy $ne(\mathbf{x}_{i})$ :

$\displaystyle\textit{LS}(\mathbf{x}_{i})=\{\mathbf{x}_{j}|d(\mathbf{x}_{i},% \mathbf{x}_{j})<d(\mathbf{x}_{i},ne(\mathbf{x}_{i}))\},$ (13)

The cardinality of the LS of an example is an indicative of how close it is to the decision boundary and the narrowness of the gap between the classes. The local set average cardinality measure (LSC) is calculated here as:

$\displaystyle\textit{LSC}=1-\frac{1}{n^{2}}\sum_{i=1}^{n}{|\textit{LS}(\mathbf% {x}_{i})|},$ (14)

2.2.4 Network-based measures

Morais and Prati [32] and Garcia et al. [16] propose to capture structural information from a dataset by modelling it as a graph. Using this representation, the following measures are based on statistical characterization of complex networks [24].

The $\epsilon$ -NN method for building a graph from a dataset in the attribute-value format is used [52], with $\epsilon$ value equal to 0.15. Next, as in [15], a post-processing step is applied to the graph, pruning edges between examples of different classes. Let $G=(V,E)$ denote the graph built by this process, with $|V|=n$ and $0\leqslant|E|\leqslant\frac{n(n-1)}{2}$ . The $i$ -th vertex of the graph will be denoted as $v_{i}$ and an edge between two vertices $v_{i}$ and $v_{j}$ is denoted as $e_{ij}$ .

Average density of the network (Density): This measure takes the number of edges that are retained in the graph, normalized by the maximum number of edges that could be formed between $n$ pairs of data points.

$\displaystyle\textit{Density}=1-\frac{2|E|}{n(n-1)}$ (15)

Clustering coefficient (ClsCoef): The clustering coefficient of a vertex $v_{i}$ is the ratio of the number of edges between its neighbors and the maximum number of edges that could possibly exist between them.

$\displaystyle\textit{ClsCoef}=1-\frac{1}{n}\sum_{i=1}^{n}{\frac{2|e_{jk}:v_{j}% ,v_{k}\in N_{i}|}{k_{i}(k_{i}-1)}},$ (16)

where $N_{i}=\{v_{j}:e_{ij}\in E\}$ is the neighborhood of a vertex $v_{i}$ (nodes directly connected to $v_{i}$ ) and $k_{i}$ is the size of $N_{i}$ . The clustering coefficient assesses the grouping tendency of the vertexes.

Hub score (Hubs): The hub score of a node is given by the number of connections it has to other nodes, weighted by the number of connections these neighbors have.

$\displaystyle\textit{Hubs}=1-\frac{1}{n}\sum_{i=1}^{n}\textit{hub}(v_{i})$ (17)

The values of $\textit{hub}(v_{i})$ are given by the principal eigenvector of $A^{t}A$ , where $A$ is the adjacency matrix of the graph.

3. Methodology

The methodology adopted in the experiments performed for this study is summarized in Fig. 1. First, the STD meta-features are extracted from the datasets to construct meta-models able to obtain the PCM. In this step, the meta-examples are labeled according to the values of the CM. Afterward, the STD, the PCM, and the CM meta-features are extracted from the datasets and used in a new MtL experiment in order to predict the performance of some classifiers. This step is called stacking. Finally, the execution time required to extract each set of meta-features is compared with the others.

Figure 1.

Evaluation methodology followed in the experiments.

For the induction of the meta-regressors in the prediction step, left side of Fig. 1, three regression algorithms with distinct biases are used: Random Forests (RF) [20, 6] with 500 DTs, Support Vector Regressors (SVR) [10] with radial basis kernel and Distance Weighted $k$ -Nearest Neighbor (DWNN) [29] with Gaussian weighting. We obtained the RMSE using $10$ -fold cross-validation. We also extracted the Pearson correlation of the PCM and the CM values. Two baselines are used in this step: (i) DF (default), which corresponds to predicting the average CM values in the meta-base; and (ii) RD (random), which predicts a random CM value among those registered in the training meta-base. For the final analysis, the STD meta-features are ranked according to their importance to the regression meta-models generated.

On the stacking side of Fig. 1, the different sets of meta-features and measures are used as input to the regression algorithms that will induce the meta-regressors, which will be used to predict the predictive performance of a pool of classifiers induced by supervised ML algorithms. The ML algorithms are: Artificial Neural Networks (ANN) [19] trained with backpropagation (learning rate of 0.3, momentum of 0.5 and one hidden layer); C4.5 algorithm [40] with pruning; Support Vector Machine (SVM) [10] with radial basis kernel; RF with 500 DTs; and 3-NN classifier. These algorithms were chosen because of their different inductive biases. As in the first experiment, the regression algorithms used to induce the meta-regressors are DWNN, RF and SVR. We obtained the RMSE using the same 10-fold cross-validation folds from the previous task, such that the independence of the models is maintained.

One additional analysis in the stacking level is the trade-off between the computational cost of obtaining each set of meta-features and the cost of evaluating all classifiers in a cross-validation setup. For such, we run these alternatives on 100 random selected datasets in a cluster node with two Intel Xeon E5-2680v2 processors and 128 GB DDR3 into a single thread. These datasets were selected from 400 benchmark classification datasets from the OpenML repository [50]. They represent diverse application contexts and domains, and were chosen with limits of up to a maximum number of 10,000 examples, 500 features and 10 classes. All these datasets have no missing values.

The STD meta-features were extracted using the mfe package [45], whereas the CM were extracted using the ECoL package [27]. Additional details of the implementation can be found at the GitHub site4

⁴

https://github.com/lpfgarcia/mock.

including the benchmark classification datasets, the meta-base and the reference to SCoL package, a library for extracting the PCM values.

4. Experimental results

This section presents the results from the previously described experimental setups. Section 4.1 reports the predictive performance of the meta-models in the CM prediction task. Section 4.2 accesses the PCM values in a MtL stacking approach and Section 4.3 compares the running cost of using CM against the cost of using PCM.

4.1 Prediction of the complexity measure values

Figure 2 shows the average RMSE of the meta-regressors in the CM prediction task, estimated using 10-fold cross-validation. Each plot represents one CM and the boxplots summarize the RMSE values of each meta-regressor. The last plot summarizes the average RMSE values for all of the CM. The boxplots for the results of the meta-regressors induced by the three regression algorithms (DWNN, RF and SVR) are shown in white, whereas for the baselines (RD and DF) they are colored in gray.

Figure 2.

The RMSE of each meta-regressor to predict the CM values.

According to the previous plots, the meta-regressors outperformed the baselines, with better predictive performance in all cases. This suggests that the combination of STD meta-features was able to capture the necessary knowledge to represent the CM. Furthermore, the meta-regressors induced by DWNN, RF and SVR presented a more stable behavior than the baselines, which have elongated boxplots and larger standard deviation. DF represents a more strict baseline, since it uses the CM values average, whilst RD uses random values. Among the meta-regressors, in most of the cases, RF performed better, followed by SVR and DWNN. This information can be confirmed in the last plot, which takes the average performance for all CM values.

In order to analyze if there is a direct relationship between the CM and the PCM values, we calculated the Pearson’s correlation between them. This evaluation identifies the meta-regressor models that are more sensitive to the variation of the CM values. Figure 3 shows a heatmap of the ordered correlation between CM and PCM. Each column and row corresponds to the CM value and the PCM value predicted by the corresponding meta-regressor, respectively. Each box is colored according to the ranking of the meta-regressors, from white (highest ranking) to gray (lowest ranking). The correlation values are also shown inside the heatmap’s cells.

Figure 3.

Heatmap of the correlation between the CM and PCM.

As expected, the PCM values obtained by the meta-regressors induced by DWNN, RF and SVR are highly correlated with the CM values. Moreover, RF performed better than SVR and DWNN, achieving values close to 0.9 of correlation for most of the cases. The baselines presented a very low correlation and, in some cases, with negative correlation values. The general results from Figs 2 and 3 are very similar, except for the F3 and ClsCoef measures. While in the boxplot the SVR algorithm had a better performance for those measures, the RF algorithm had a higher correlation in the heatmap. The answer for this divergence is the low standard deviation of the RF predictions compared to those of the SVR meta-models. It is also worth noting that the prediction of the linearity measure L1 had the worst correlation, with values around 0.7, probably explained by the more complex concept involved in this particular measure, which takes into account the distance of incorrectly predicted examples to a linear boundary used in their separation.

In order to assess how the STD meta-features contribute to the prediction of the PCM values, Fig. 4 shows the 15 top-ranked STD meta-features selected by RF in all runs. We chose RF because it has the lowest RMSE in Fig. 2 and the highest Pearson’s correlation values in Fig. 3. In this figure, the $x$ -axis represents the STD meta-features and the $y$ -axis shows the average ranking of their Gini index in the RF meta-models. The information-theoretic measures are shown by dots, the landmarking measures by triangles, the model-based measures by squares and the statistical measures by crosses.

Figure 4.

Top-ranked STD meta-features selected by the RF meta-regressor.

The STD meta-features regarded as the most important are those based on landmarking and information theory. The landmarking measures are mainly related to the performance of simple meta-models induced by the $k$ -NN, the Naive Bayes (NB) and the Classification And Regression Tree (CART) algorithm under certain conditions to decrease the execution time. Since some CM use $k$ -NN to estimate their value, the $k$ -NN landmarks presence among the top-ranked meta-features was expected. The information-theoretic measures highlighted are: the equivalent number of attributes, noisiness of attributes, mutual information, class joint entropy and the concentration coefficient for each pair of attributes. The model based measures selected are related to the proportion of training instances to the DT model leaf, the number of nodes of the DT model per number of instances and the number of nodes per attributes. From the statistical group, the canonical correlation between the predictive attributes and the class is present.

4.2 Stacking the complexity measures

Since the previous meta-regressors can predict all the CM with high predictive performance, in this section we assess their use in a MtL task, estimating the performance of a set of classifiers. For such, different groups of meta-features, including the STD meta-features, the CM and the PCM, are used independently or combined. The PCM used are those generated by the best meta-model (the RF) previously induced. Ideally, the PCM will be able to either maintain the performance achieved by the CM or outperform this performance.

Figure 5 shows the performance of each meta-regressor (DWNN, RF and SVR) when different groups of meta-features and measures are used to predict the accuracy of five classifiers (ANN, C4.5, SVM, RF and 3-NN). Each plot represents the performance of one meta-regressor to predict the accuracy of one classifier and the boxplots summarize the RMSE values of each group of meta-features and measures estimated by a 10-fold cross-validation process. The baseline (RD and DF) performances are not shown in that figure because they have high RMSE values for all the classifiers.

Figure 5.

The RMSE of each group of meta-features to predict the performance of the classifiers.

The RMSE, although similar between the groups of meta-features and measures, was lower for the SVR and RF algorithms. This indicates that they are more accurate to predict the performance of the classifiers. Analyzing the groups of meta-features and measures, the PCM provided the best input variables for the DWNN and SVR meta-regressors for all the classifiers, and an intermediate performance for RF when used individually. When PCM or CM were combined with the STD meta-features, they had the best performance for RF. This indicates that the PCM, even when combined with the STD meta-features, can improve the dataset descriptions for the MtL tasks.

A Friedman statistical test [11] with 95% of confidence value was applied to compare the predictive performance of the groups of meta-features and measures. Statistical differences between the PCM, CM and STD were found in some cases. For the DWNN and SVR meta-regressors, the PCM performed better than STD for 7 of 10 meta-bases. For the RF meta-regressor there are no statistical differences between CM, PCM and STD. In general, the STD was never better than PCM and no difference was detected between PCM and CM. This indicates that the use of PCM can improve the performance for a vast group of MtL tasks.

4.3 Evaluating the runtime execution

In MtL scenarios, the trade-off between the runtime of the characterization process and the evaluation of all alternative data modeling algorithms considered to solve the task under study should favor the former. Otherwise, the trial-and-error approach would be preferable. Therefore, to evaluate this trade-off for the proposed approach, we performed a runtime analysis of the MtL stacking level. This analysis exhaustively compares:

•
The time taken for extracting the CM with the time taken for extracting the STD meta-features and running all classification algorithms.
•
The time taken for extracting the PCM - which sums up the time for extracting the STD meta-features and running the RF meta-model – with the time taken for extracting the STD meta-features and running all classification algorithms.

These comparisons are shown in Fig. 6. To improve the visualization, the time is presented on a log-scale. Each point represents a dataset and the diagonal line indicates when both times are similar. Values above the diagonal indicate that the strategy represented in the $y$ -axis spent more time to be computed than the strategy from the $x$ -axis, while values below that line indicate the opposite.

Figure 6.
Runtime of the CM and PCM.

According to this figure, it was clearly faster to compute the PCM than the CM. The extraction of CM presented a higher runtime when compared with the runtime time for the extraction of the STD meta-features and for training the classifiers, in, respectively, 94% and 24% of the cases. The main bottleneck of the CM is related to high numbers of examples, which can overhead the computation of the neighborhood measures. Meanwhile, the PCM presented a lower runtime in both cases. Compared to the STD meta-features, the execution time of the PCM is similar, as they need to extract the STD meta-features and obtain the prediction values of the RF meta-regressor. Compared to running all of the classifiers, the PCM had a higher runtime execution for only 4% of the datasets. Additionally, it is important to remember that in some cases the use of PCM was able to improve the predictive performance when compared to the use of the CM and STD meta-features in the stacking experiments. Therefore, we can clearly notice that the PCM are suitable substitutes for the CM, and can be computed at a lower computational cost.
5. Conclusion

Originally proposed to estimate the expected difficulty of a classification problem by extracting descriptions of the overlap between classes, the separability and distribution of the data points and certain structural characteristics of the problem, the CM are a good alternative for characterizing datasets in MtL tasks. However, the CM have a high asymptotic computational complexity, which prevents their widespread use in MtL. This paper presented a study on how to estimate the CM values at a reduced computational cost, which is achieved by inducing meta-models able to predict the CM values by using STD meta-features, which have a lower computational cost. The results also indicate that the PCM values obtained are similar to the CM values and can be obtained at a much lower computational cost. Moreover, when the PCM are contrasted with different groups of meta-features in a MtL task, they can improve the characterization of the datasets.

To investigate this issue, a meta-base consisting of several classification datasets was created. Each dataset was described by STD meta-features and labeled according to each CM value. The meta-regressors DWNN, RF and SVR were used to predict the CM values and had their performance compared to two baseline recommenders. The experimental results showed that the RF meta-models were able to predict the CM with high predictive performance. Additional analysis indicates that the landmarking and information-theoretic measures were the essential STD meta-features to predict the CM.

To validate the PCM values, an analysis of the groups of meta-features and their runtime execution for a MtL regression task was evaluated in a stacking level. The groups of meta-features and measures used were the STD meta-features, the CM and PCM. The MtL task is to predict the performance of distinct classifiers. In some cases, the PCM outperformed the CM. Moreover, the time required to obtain the PCM values is largely reduced compared to that of obtaining the CM values.

Future work shall look for the PCM that best distinguishes the performances of classifiers and increase the interpretability of the meta-models results. We would also like to: (i) optimize the PCM; (ii) evaluate other MtL approaches such as ranking the classifiers; (iii) investigate hyperparameter tuning for the classification algorithms; and (iv) study further in which cases the classifier recommendation outperforms the default classification algorithm.

Footnotes

Acknowledgments

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) , Finance Code 001. The authors would also like to thank the São Paulo Research Foundation (FAPESP), grants 2013/07375-0 (CEPID CeMEAI), 2012/22608-8, 2016/18615-0 and 2018/14819-5, and Intel for the hardware and software server used in part of the experiments.

References

Barella

V.H.

Garcia

L.P.F.

de Souto

M.P.

Lorena

A.C.

and de Carvalho

A.C.P.L.F.

, Data complexity measures for imbalanced classification tasks, In International Joint Conference on Neural Networks (IJCNN), volume 1, 2018, pp. 1–8.

Bensusan

Giraud-Carrier

and Kennedy

, A higher-order approach to meta-learning, Technical report, University of Bristol, 2000.

Bensusan

and Kalousis

, Estimating the predictive accuracy of a classifier. In 12th European Conference on Machine Learning (ECML), volume 2167, 2001, pp. 25–36.

Brazdil

Giraud-Carrier

Soares

and Vilalta

, Metalearning – Applications to Data Mining, Cognitive Technologies. Springer, 1 edition, 2009.

Brazdil

Soares

and da Costa

J.P.

, Ranking learning algorithms: Using IBL and meta-learning on accuracy and time results, Machine Learning 50(3) (2003), 251–277.

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32.

Castiello

Castellano

and Fanelli

A.M.

, Meta-data: Characterization of input features for meta-learning, In Modeling Decisions for Artificial Intelligence (MDAI), volume 3558, 2005, pp. 457–468.

Cavalcanti

G.D.C.

Ren

T.I.

and Vale

B.A.

, Data complexity measures and nearest neighbor classifiers: a practical analysis for meta-learning, In 24th International Conference on Tools with Artificial Intelligence (ICTAI), volume 1, 2012, pp. 1065–1069.

ChristianKopf and Iglezakis

, Combination of task description strategies and case base properties for meta-learning, In Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning (IDDM), 2002, pp. 65–76.

10.

Cristianini

and Shawe-Taylor

, An introduction to support vector machines and other kernel-based learning methods, Cambridge University Press, 2000.

11.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006), 1–30.

12.

Dua

and Graff

, UCI machine learning repository, 2017. http://archive.ics.uci.edu/ml.

13.

Filchenkov

and Pendryak

, Datasets meta-feature description for recommending feature selection algorithm, In Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), volume 7, 2015, pp. 11–18.

14.

Fürnkranz

and Petrak

, An evaluation of landmarking variants, In Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning (IDDM), 2001, pp. 57–68.

15.

Garcia

L.P.F.

de Carvalho

A.C.P.L.F.

and Lorena

A.C.

, Effect of label noise in the complexity of classification problems, Neurocomputing 160 (2015), 108–119.

16.

Garcia

L.P.F.

de Carvalho

A.C.P.L.F.

and Lorena

A.C.

, Noise detection in the meta-learning level, Neurocomputing 176 (2016), 14–25.

17.

Garcia

L.P.F.

and Lorena

A.C.

, ECoL: Complexity measures for classification problems, 2018. https://CRAN.R-project.org/package=ECoL.

18.

Garcia

L.P.F.

Lorena

A.C.

de Souto

M.P.

and Ho

T.K.

, Classifier recommendation using data complexity measures, In 24th International Conference on Pattern Recognition (ICPR), volume 1, 2018, pp. 874–879.

19.

Haykin

, Neural Networks – A Comprehensive Foundation. Prentice Hall, 2 edition, 1999.

20.

T.K.

, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8) (1998), 832–844.

21.

T.K.

and Basu

, Complexity measures of supervised classification problems, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3) (2002), 289–300.

22.

Hoekstra

and Duin

R.P.W.

, On the nonlinearity of pattern classifiers, In 13th International Conference on Pattern Recognition (ICPR), volume 4, 1996, pp. 271–275.

23.

Kim

S.-W.

and Oommen

B.J.

, On using prototype reduction schemes to enhance the computation of volume-based inter-class overlap measures, Pattern Recognition 42(11) (2009), 2695–2704.

24.

Kolaczyk

E.D.

, Statistical Analysis of Network Data: Methods and Models, Springer Series in Statistics. Springer, 2009.

25.

Leyva

González

and Pérez

, A set of complexity measures designed for applying meta-learning to instance selection, IEEE Transactions on Knowledge and Data Engineering 27(2) (2014), 354–367.

26.

Lorena

A.C.

de Carvalho

A.C.P.L.F.

and Gama

J.M.P.

, A review on the combination of binary classifiers in multiclass problems, Artificial Intelligence Review 30(1-4) (2008), 19.

27.

Lorena

A.C.

Garcia

L.P.F.

Lehmann

de Souto

M.P.

and Ho

T.K.

, How complex is your classification problem? A survey on measuring classification complexity, ACM Computing Surveys 52(5) (2019), 107:1–107:34.

28.

Malina

, Two-parameter fisher criterion, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 31(4) (2001), 629–636.

29.

Mitchell

T.M.

, Machine Learning, McGraw Hill series in computer science, McGraw Hill, 1997.

30.

Mollineda

R.A.

Sánchez

J.S.

and Sotoca

J.M.

, Data characterization for effective prototype selection, In 2nd Iberian Conference on Pattern Recognition and Image Analysis, volume 3523, 2005, pp. 27–34.

31.

Mollineda

R.A.

Sánchez

J.S.

and Sotoca

J.M.

, A meta-learning framework for pattern classification by means of data complexity measures, Inteligencia Artificial 10(29) (2006), 31–38.

32.

Morais

and Prati

R.C.

, Complex network measures for data set characterization, In 2nd Brazilian Conference on Intelligent Systems (BRACIS), 2013, pp. 12–18.

33.

Muñoz

M.A.

Villanova

Baatar

and Smith-Miles

, Instance spaces for machine learning classification, Machine Learning 107(1) (2018), 109–147.

34.

Orriols-Puig

Maciá

and Ho

T.K.

, Documentation for the data complexity library in C++, Technical report, La Salle – Universitat Ramon Llull, 2010.

35.

Peng

Flach

P.A.

Soares

and Brazdil

, Improved dataset characterisation for meta-learning, In 5th International Conference on Discovery Science (DS), volume 2534, 2002, pp. 141–152.

36.

Pfahringer

Bensusan

and Giraud-Carrier

, Meta-learning by landmarking various learning algorithms, In 17th International Conference on Machine Learning (ICML), 2000, pp. 743–750.

37.

Pinto

Soares

and Mendes-Moreira

, Towards automatic generation of metafeatures, In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2016, pp. 215–226.

38.

Prudêncio

R.B.C.

and Ludermir

T.B.

, Active learning to support the generation of meta-examples. In 17th International Conference on Artificial Neural Networks (ICANN), volume 4668, 2007, pp. 817–826.

39.

Prudêncio

R.B.C.

Soares

and Ludermir

T.B.

, Uncertainty sampling-based active selection of datasetoids for meta-learning, In 21st International Conference on Artificial Neural Networks (ICANN), volume 6792, 2011, pp. 454–461.

40.

Quinlan

J.R.

, Induction of decision trees, Machine Learning 1(1) (1986), 81–106.

41.

Reif

, A comprehensive dataset for evaluating approaches of various meta-learning tasks, In 1st International Conference on Pattern Recognition Applications and Methods, 2012, pp. 273–276.

42.

Reif

Shafait

and Dengel

, Prediction of classifier training time including parameter optimization, In 34th Annual German Conference on Artificial Intelligence (KI), 2011, pp. 260–271.

43.

Reif

Shafait

Goldstein

Breuel

and Dengel

, Automatic classifier selection for non-experts, Pattern Analysis and Applications 17(1) (2014), 83–96.

44.

Rice

J.R.

, The algorithm selection problem, Advances in Computers 15 (1976), 65–118.

45.

Rivolli

Garcia

L.P.F.

Soares

Vanschoren

and de Carvalho

A.C.P.L.F.

, Towards reproducible empirical research in meta-learning, eprint arXiv, (1808.10406) (2019), 1–41.

46.

Segrera

Pinho

and Moreno

M.N.

, Information-theoretic measures for meta-learning, In 3rd Hybrid Artificial Intelligence Systems (HAIS), 2008, pp. 458–465.

47.

Smith-Miles

K.A.

, Cross-disciplinary perspectives on meta-learning for algorithm selection, ACM Computing Surveys 41(1) (2008), 1–25.

48.

Soares

Petrak

and Brazdil

, Sampling-based relative landmarks: Systematically test-driving algorithms before choosing, In 10th Portuguese Conference on Artificial Intelligence (EPIA), 2001, pp. 88–95.

49.

Souto

M.C.P.

Lorena

A.C.

Spolaôr

and Costa

I.G.

, Complexity measures of supervised classification tasks: A case study for cancer gene expression data, In International Joint Conference on Neural Networks (IJCNN), 2010, pp. 1352–1358.

50.

Vanschoren

van Rijn

J.N.

Bischl

and Torgo

, OpenML: networked science in machine learning, SIGKDD Explorations 15(2) (2013), 49–60.

51.

Wolpert

D.H.

, Stacked generalization, Neural Networks 5(2) (1992), 241–259.

52.

Zhu

Lafferty

and Rosenfeld

, Semi-supervised learning with graphs, PhD thesis, Carnegie Mellon University, Language Technologies Institute, School of Computer Science, 2005.

Boosting meta-learning with simulated data complexity measures

Abstract

Keywords

1. Introduction

1 https://github.com/lpfgarcia/SCoL.

2.1 Meta-learning

Table 1 Characteristics of the complexity measures

4.1 Prediction of the complexity measure values

Footnotes

Acknowledgments

References

¹
https://github.com/lpfgarcia/SCoL.

Table 1
Characteristics of the complexity measures