A guidance of data stream characterization for meta-learning

Abstract

The problem of selecting learning algorithms has been studied by the meta-learning community for more than two decades. One of the most important task for the success of a meta-learning system is gathering data about the learning process. This data is used to induce a (meta) model able to map characteristics extracted from different data sets to the performance of learning algorithms on these data sets. These systems are built under the assumption that the data are generated by a stationary distribution, i.e., a learning algorithm will perform similarly for new data from the same problem. However, many applications generate data whose characteristics can change over time. Therefore, a suitable bias at a given time may become inappropriate at another time. Although meta-learning has been used to continuously select a learning algorithm in data streams, data characterization has received less attention in this context. In this study, we provide a set of guidelines to support the proposal of characteristics able to describe non-stationary data over time. This guidance considers both the order of arrival of the examples and the type of variables involved in the base-level learning. In addition, we analyze the influence of characteristics regarding their dependence on data morphology. Experimental results using real data streams showed the effectiveness of the proposed data characterization general scheme to support algorithm selection by meta-learning systems. Moreover, the dependent meta-features provided crucial information for the success of some meta-models.

Keywords

Feature extraction data streams algorithm selection

1. Introduction

Machine learning algorithms have been successfully employed in many domains. The successful application of these algorithms to a particular problem is largely affected by their suitability to the characteristics of the data at hand. Thus, the selection of the most promising algorithm or model for a new data set has been an important research issue [54, 2, 33, 42]. In this context, meta-learning is one of the main approaches employed for algorithm selection [7, 26, 50]. Meta-learning is concerned with understanding the learning mechanism itself by exploiting the knowledge acquired from previous experience on similar problems in order to predict the behavior of the algorithms in the future.

Usually, meta-learning algorithms are designed and applied to data which are supposed to be generated by an underlying stationary distribution. Thus, a learning algorithm is selected for each data set only once, assuming that this algorithm will perform similarly for new data of the problem considered. However, many dynamic environments generate data streams, which are produced automatically, in large scale, and subjected to change in their distribution over time [21]. A common strategy to deal with changes in the data distribution is to maintain a model up-to-date with the most recent examples. Nonetheless, this procedure is not sufficient when the bias of the current algorithm becomes inappropriate for the new data. To overcome this issue, the data should be continuously monitored and the predictive performance of the learning system constantly analyzed to select on the fly the most suitable algorithm for the current data. Meta-learning can do so by selecting the most appropriate learning algorithm based on the characteristics extracted from the data or from the learning process itself [51, 22, 27, 37, 65].

A crucial aspect for the success of a meta-learning system is the usefulness of these characteristics. For the purpose of algorithm selection, useful characteristics are those that represent properties of the data that influence algorithm performance. The process to extract suitable measures for the characterization of stationary data sets [42, 13, 25]is not the most convenient for data streams, since it compares different data sets with potentially distinct morphologies, while in data streams the set of attributes typically does not change over time. But even though the attributes are the same, the phenomenon generating the data may change significantly. Moreover, the process of feature extraction is currently performed in an ad-hoc manner, making it difficult for the meta-learning practitioners to realize which attributes and their relations they should consider over all possibilities. In practice, they generally inspects a limited number of measures using a trial and error method or simply assume measures successfully employed in the past will be adequate for the present problem.

Therefore, in this paper, we provide a guidance of data stream characterization for meta-learning. The proposed guidelines are expected to provide proper information to support the task of algorithm selection by a meta-learning based system on such type of data. They consider two key aspects of the stream. The first one is the order of arrival of the data, defined according to a sliding window. This is important because it may provide useful information about changes that may occur on data and allow the system to react to them. The second one is related to the variables involved in the base-level learning, which are divided in predictive and target attributes, besides the model predictions. Moreover, we propose a classification of the characteristics regarding their dependence on data morphology. The directions we provide should not be seen as a precise recipe. Instead, they aim at support the expert user in constructing the meta-features for his/her own problem.

The remainder of this paper is organized as follows. In Section 2, we provide an overview of meta-learning for algorithm selection on data streams. In Section 3, we discuss the process of characterization of stationary and time-changing data. The guidance presented in this paper for data stream characterization is described in Section 4. In Section 5, we show how the proposed guidance can be employed to describe data streams in the context of three algorithm selection problems. The results are presented and discussed in Section 6. Finally, Section 7 presents the main conclusions and points out future research directions.

2. Meta-learning for algorithm selection in data streams

According to the No Free Lunch theorem for machine learning [67], no algorithm performs better than any other when their performance are averaged over all possible problems. Similarly, in time-changing environments, an algorithm that is suitable for a problem at the current moment may become inappropriate when the data characteristics change [21]. Meta-learning can provide automatic and systematic guidance on algorithm selection based on the knowledge acquired from the application of a set of algorithms on different problems in the past [7]. More generally, meta-learning is concerned with understanding the conditions that determine if a learning system is adequate or not for a particular task. This information can be used to improve its performance in future applications [7, 26]. The main difference to base-learning (traditional learning) concerns its level of adaptation. At the base-level, the bias is fixed a priori (by the choice of the algorithm and the values for its parameters). In meta-learning the most suitable bias is dynamically defined according to the characteristics of the data [60].

A few studies have investigated the use of meta-learning to select the most appropriate model in data changing environments. For instance, a method has been proposed to detect concept drifts in data distribution from contextual clues and to store the so called stable concepts [65]. Thus, when changes in the observed domain occur, the system can rapidly adapt to a context previously found. An alternative approach [29] uses hidden contexts to identify subsets of data that are associated with stable concepts. Since contexts can recur, several disjoint intervals from the data may be associated with the same concept. The proposed method then uses a batch decision tree induction algorithm to induces models for each of these concepts. The problem of recurring concepts has also been investigated in association with the problem of concept drift [22]. In this case, whenever a concept drift is detected, the base model together with a meta-model that predicts the contexts in which it performs well are stored in a pool for possible future use. Every time the performance of the current base model drops below a given threshold, the meta-models in the pool are retrieved and used to predict the most appropriate base model for the current data. A similar approach of reusing models has been attempted [27], where an ensemble is built based on context information. It maintains a pool of classifiers learned from previous concepts using an incremental algorithm. When a concept is identified as recurrent, a weighting function is used to determine which classifiers should be retrieved from the pool and weights are assigned to them based on their estimated error and context information. Finally, a method has been proposed to detect concept drift and to choose, at each step in time, the most promising alternative learning algorithm at base level [37]. Thus, it is expected a better adaptation of the learning system according to the context.

Another meta-learning based method for algorithm selection on data streams, called MetaStream, was introduced recently [51]. Conceptually, it is similar to an existing method [37]. Their main difference resides on the set of characteristics each method employs to describe data. In [37], a meta-example consists of information about the learning process related to the batches of arriving examples, such as the number of batches used for training, the most successful algorithm on the previous batch and the most successful learner over all the batches seen so far. MetaStream, on the other hand, extracts characteristics directly from the raw data used in the base-learning. By doing so, it acquires a more detailed description of the problem under analysis than the high level characteristics used in [37].

Here, MetaStream is used as test bed for the proposed guidance for data streams characterization. Thus, in the following, a brief explanation is provided. MetaStream aims at selecting the most adequate regression algorithm for time-changing environments. For such, it works on both base and meta levels of learning. At the base-level, a stream of examples of the problem under analysis is used to induce and evaluate some regression models, using the interleaved test-then-train method [23] with a sliding window. As a result, models induced by different algorithms can be properly compared on incoming data. Once an arbitrary number of examples have arrived, meta-level learning can take place. It comprises three main steps in order to support algorithm selection in the context of data streams. In the first step information about the learning process at base-level is extracted, resulting in meta-data. This meta-data consists of a stream of meta-examples, where each meta-example is a tuple of meta-attributes (i.e. data characterization information) and a corresponding target attribute (i.e. the most adequate regression algorithm), computed over a set of base-level examples. In the second step, a learning algorithm is applied on the meta-data in order to relate the characteristics extracted from the base-level data to the best regressor. The learning model induced in this step is called meta-classifier, and it is is periodically built as new meta-data is created. Finally, the third step is the deployment of the meta-classifier to predict the class, i.e., the regression algorithm, for new unlabeled meta-example. The predicted regressor is then used to forecast the target of the base-level data. In-depth details about MetaStream can be found elsewhere [51].

3. Data characterization for meta-learning

In conventional meta-learning systems, the meta-data set employed to support algorithm selection is generated by extracting characteristics from several data sets, with distinct morphologies. Here, we consider that two data sets are morphologically different if they are not described by the same set of attributes with the same domains. Let $A$ be a data set described by a set of $p$ attributes $\mathbf{x}=\{x_{1},\ldots,x_{p}\}$ , and B be a data set described by a set of $q$ attributes $\mathbf{w}=\{w_{1},\ldots,w_{q}\}$ . Each attribute $\mathbf{x_{i}}$ (or $\mathbf{w_{i}}$ ) is associated with a domain: $domain(x_{i})$ . We say the data sets $A$ and $B$ are morphologically different if $\mathbf{x_{i}}\neq\mathbf{w_{i}}$ or $\textit{domain}(\mathbf{x_{i}})\neq\textit{domain}(\mathbf{w_{i}})$ for at least one $i$ .1

¹
For simplicity, we make the assumption that the order of the attributes must be the same in both data sets to be considered morphologically equivalent.

The term feature extraction is used with the same meaning of data characterization hereafter.

Some measures that are commonly used to characterize morphologically different data sets produce different sets of characteristics, regarding the number of characteristics and their semantics. For instance, the correlation coefficient is computed for each pair of numeric attributes, resulting in $\frac{n\times(n-1)}{2}$ values, where $n$ is the number of numeric attributes. However, these values cannot be exploited directly as meta-features when propositional learning algorithms are employed as meta-learners [49]. Therefore, in order to obtain a characterization that is uniform for data sets with different morphologies, in many cases, such as in the correlation between attributes, the meta-feature must be a summary of multiple values. Computing the average of all correlation coefficients between pairs of numeric attributes enables the creation of a meta-feature that is applicable to data sets with different morphologies based on this measure.

In data stream environments, meta-learning systems usually are employed for algorithm (or model) selection at different time points for only one data set, rather than many different data sets [51, 22, 37]. Since a data stream is generally described by the same set of attributes over time, the characteristics extracted from each attribute or its relations at a time point can be used directly as meta-features. The correlation coefficients between every pair of numeric attributes, for instance, may be used as a set of meta-features, without any type of summarization to transform them into a single value. Moreover, due to changes that may naturally occur along the time, feature extraction must be a regular process to support algorithm selection in a data stream scenario.

Some measures usually employed in meta-learning studies for feature extraction from stationary data [55, 32, 38, 3] may also be used for data streams characterization. In this case, the data associated with a time period is treated as a separate data set. In addition, the temporal component of examples is a relevant aspect in environments that continuously generate data, which are likely to change for some time scale [6]. Although the dependence between successive examples in data streams is usually not as strong as in traditional time series problems, mainly for long-term predictions, it is generally still a relevant characteristic [6]. Therefore, measures that take the arrival order into account can also be useful for data streams. In particular, measures investigated for feature extraction of time series, such as serial correlation, should be considered [1, 47, 62, 39].

4. A guidance for data streams characterization

In order to extract useful information from data streams for algorithm selection, one should examine both the arrival order of the examples and the variables of the problem available during learning. Jointly considering those aspects leads to a better manner to build sets of suitable meta-features for the meta-learning problem at hand. In this section, the interaction between both issues will be explored. In Subsection 4.1, we present a division of data according to the aforementioned aspects and provide some guidelines to extract characteristics from data streams at the base-level. In Subsection 4.2, we consider gathering information from meta-level to enhance the process of data characterization. In Subsection 4.3, we present the concepts of characteristics dependent and independent of the data morphology and draw a connection to algorithm selection.

Figure 1.

Data stream at the base-level.

4.1 Meta-features from the base-level data

In order to facilitate the feature extraction process, we divide the data stream according to the order of arrival of the examples and to the variables involved in the base-level learning, as depicted in Fig. 1. The variables are divided into predictive attributes ( $\mathbf{x}$ ), which describe the examples in the data; the target attribute ( $y$ ), which informs the real output for each example and; model predictions ( $\hat{y}$ ), which represents the output predicted by a learning algorithm. Regarding the order of arrival of the examples, the data set is divided into training, horizon and selection subsets, accordingly to a sequence-based sliding window together with an evaluation method, such as the interleaved test-then-train (or prequential) method [24]. The first subset is used to induce a base-model, since the predictive and target attributes values are known. In the horizon subset, the the values of the predictive attributes are also known, but the values of the target variable are still unknown. This is a usual scenario in many applications where the true target is known only some time after the prediction. For instance, in the travel time prediction problem, the target is predicted some minutes, hours or even days in advance [44]. Finally, the selection subset contains examples for which the selected model will generate predictions.

Figure 1 illustrates how such divisions of attributes and examples take place for a data stream at a time point where the example $i$ is the last one for which the target value is known. The training subset contains $\omega_{b}$ examples with indexes $[i-\omega_{b}+1,\ldots,i]$ . The horizon set contains $\eta_{b}$ examples with indexes $[i+1,\ldots,i+\eta_{b}]$ . The selection subset contains $\gamma$ examples with indexes $[i+\eta_{b}+1,\ldots,i+\eta_{b}+\gamma]$ . Thus, overall, the figure exhibits nine data subsets: $\mathbf{X_{train}}$ , $\mathbf{X_{horiz}}$ , $\mathbf{X_{selec}}$ , $\mathbf{y_{train}}$ , $\mathbf{y_{horiz}}$ , $\mathbf{y_{selec}}$ , $\mathbf{\hat{y}_{train}}$ , $\mathbf{\hat{y}_{horiz}}$ and $\mathbf{\hat{y}_{selec}}$ . Subsets in white are available at the current time represented in this figure while the subsets in gray are still unknown. Different characteristics of a data stream may be captured based on this division by applying measures to a single subset, characterizing attributes individually; to groups of subsets, to characterize relations between their variables or the whole subsets; and to characterize relations of relations. In Section 4.1.1 we explain the first two cases, providing examples of meaningful meta-features, whereas the third one is discussed in Section 4.1.2. First, a note on notation. The arrow $\leftrightarrow$ represents the relation between subsets. For instance, $\mathbf{y_{train}}$ $\leftrightarrow\mathbf{\hat{y}_{train}}$ denotes the relation between the target attribute and some model predictions, on the training subset.

4.1.1 Characterizing single subsets and their relations

In the following, we present all single subsets according to Fig. 1, and their relations.

$\mathbf{X_{train}}$ and $\mathbf{X_{train}}\leftrightarrow\mathbf{X_{train}}$

$\mathbf{X_{train}}$ is the subset of the attribute values of the training data. Characteristics extracted from individual variables, represented as $\mathbf{X_{train}}$ , or groups of variables, represented as $\mathbf{X_{train}}\leftrightarrow\mathbf{X_{train}}$ , may provide some clues on the behaviour of the base model induced on these data. Conventional meta-learning measures [42, 32, 55] can be used to characterize the variables in $\mathbf{X_{train}}$ and their relations. In this context, skewness or kurtosis applied on individual attributes in $\mathbf{X_{train}}$ are examples of $\mathbf{X_{train}}$ measures, while the mean correlation between all pairs of attributes in $\mathbf{X_{train}}$ is an example of a $\mathbf{X_{train}}\leftrightarrow\mathbf{X_{train}}$ measure. Other measures, that consider the arrival order of examples even inside each subset, may also be relevant for these data [1, 47, 62, 39].

$\mathbf{X_{horiz}}$ and $\mathbf{X_{horiz}}\leftrightarrow\mathbf{X_{horiz}}$

$\mathbf{X_{horiz}}$ is the subset of the attribute values on the horizon data. The same measures applied to extract characteristics from $\mathbf{X_{train}}$ can be employed for the $\mathbf{X_{horiz}}$ data, yielding $\mathbf{X_{horiz}}$ and $\mathbf{X_{horiz}}\leftrightarrow\mathbf{X_{horiz}}$ based measures. Although $\mathbf{X_{horiz}}$ is not used to induce a model, it may still be important for the prediction of the most suitable learning algorithm, since it contains more recent data than $\mathbf{X_{train}}$ . A hypothetical situation to highlight the importance of $\mathbf{X}_{horiz}$ is when the best algorithm for the selection set diverges from the training set after an abrupt change of the data distribution. In this case, $\mathbf{X_{horiz}}$ may have information about this change, which will be useful to select the best algorithm, since it is more similar to $\mathbf{X_{selec}}$ than $\mathbf{X_{train}}$ .

$\mathbf{X_{selec}}$ and $\mathbf{X_{selec}}\leftrightarrow\mathbf{X_{selec}}$

$\mathbf{X_{selec}}$ is the subset of the attribute values on the selection data. The motivation for characterizing the attribute values of the selection data are essentially the same as for the horizon subset, but stronger, as the former is the subset for which predictions will be made. Therefore, it is expected that the meta-features obtained from this subset are more informative than those obtained from $\mathbf{X_{horiz}}$ . Although $\mathbf{X_{selec}}$ , $\mathbf{X_{train}}$ and $\mathbf{X_{horiz}}$ have a disjoint set of examples, they share the same set of attributes. Thus, measures employed for $\mathbf{X_{train}}$ and $\mathbf{X_{horiz}}$ can also be employed for $\mathbf{X_{selec}}$ .

$\mathbf{y_{train}}$

$\mathbf{y_{train}}$ is the subset of the target attribute ( $y$ ) values on the training data. The characteristics extracted from $\mathbf{y_{train}}$ are important to guide the learning process at the meta-level, given that by definition, in data streams, some temporal dependence between data is expected. In other words, the nature of the $\mathbf{y_{selec}}$ values may depend on the values of $\mathbf{y_{train}}$ , which are known in advance. Moreover, in periods in which the phenomenon that generates the data does not change significantly, a suitable model chosen according to the distribution of $\mathbf{y_{train}}$ is also expected to be appropriate for $\mathbf{y_{selec}}$ . When handling regression problems, measures that characterize numeric data can be applied to $\mathbf{y_{train}}$ . Although the correlation between $\mathbf{y_{train}}$ values is possibly not as strong as in time series, the serial correlation test and other measures for time series characterization are also relevant to describe $\mathbf{y_{train}}$ [39, 62, 47, 1]. On the other hand, measures for categorical data can be applied when dealing with classification problems, such as impurity or mutual information [42].

$\mathbf{\hat{y}_{train}}$ , $\mathbf{\hat{y}_{horiz}}$ and $\mathbf{\hat{y}_{selec}}$

$\mathbf{\hat{y}_{train}}$ , $\mathbf{\hat{y}_{horiz}}$ and $\mathbf{\hat{y}_{selec}}$ are the subsets of the prediction values for the training, horizon and selection data, respectively. Multiple sets of predictions may be available for a single subset of data, obtained with different models. However, for simplicity, here we discuss as if a single set of predictions is available. These predictions can provide information mainly about model behavior. For instance, the presence of outliers or a high variance of these predictions indicates that employing the same algorithm that induced this “unstable” model may be riskier than choosing another with more stable predictions. The same measures used to characterize $\mathbf{y_{train}}$ may be used with these subsets. In a real scenario, the predictions for $\mathbf{\hat{y}_{train}}$ and $\mathbf{\hat{y}_{horiz}}$ subsets are obtained in the past. Thus, these predictions are just recovered from the memory at the current time. On the other hand, the predictions for the selection subset ( $\mathbf{\hat{y}_{selec}}$ ) are performed at the same time that an algorithm has to be selected for the examples in this subset. This means that the characterization of $\mathbf{\hat{y}_{selec}}$ might be impractical for some learning systems, due to the time required to obtain the predictions for all the available algorithms.

$\mathbf{X_{}}\leftrightarrow\mathbf{X_{}}$

$\mathbf{X_{*}}\leftrightarrow\mathbf{X_{*}}$ represents the relations between predictive attributes from the different subsets $\mathbf{X_{train}}$ , $\mathbf{X_{horizon}}$ and $\mathbf{X_{selec}}$ . Such characterization is possible in a data stream scenario because, as previously mentioned, the set of attributes in this case rarely change over time. Therefore, characteristics regarding the relations of the three subsets can be extracted. For example, the difference between the average of a numeric attribute in $\mathbf{X_{train}}$ and the average of the same attribute in $\mathbf{X_{selec}}$ may indicate a change in this attribute. This type of relation cannot be explored in conventional meta-learning systems because different data sets are described by different attributes. We can also compute the distance between subsets considering all their attributes, expecting this to be more informative than computing the distance between each attribute. For instance, the distance between $\mathbf{X_{train}}$ and $\mathbf{X_{selec}}$ provides evidences of whether a change in the data distribution occurred or not and, thus, whether the same algorithm used for the training data should be employed for the selection data. The distance between data subsets can be obtained by different measures [57], such as Relativized Discrepancy [36] and Kullback-Leibler Divergence (Relative Entropy) [16, 53].

$\mathbf{\hat{y}_{}}\leftrightarrow\mathbf{\hat{y}_{}}$

$\mathbf{\hat{y}_{*}}\leftrightarrow\mathbf{\hat{y}_{*}}$ represents the relations between the predicted values for the training, horizon and selection subsets. Characterizing those relations may be done following the same approach adopted for predictive attributes, i.e., comparing different subsets. Such comparisons are important since they may reveal changes occurred on the behavior of the models induced over time. These alterations may suggest, for instance, that the current data distribution diverges from the distribution learned earlier, probably affecting the performance of those models. During the meta-learning process, characteristics from $\mathbf{\hat{y}_{*}}\leftrightarrow\mathbf{\hat{y}_{*}}$ based measures complement those obtained from the relations between predictive attributes.

$\mathbf{X_{train}}\leftrightarrow\mathbf{y_{train}}$

$\mathbf{X_{train}}\leftrightarrow\mathbf{y_{train}}$ represents the relations between the values of the predictive attribute in $\mathbf{X_{train}}$ and the target attribute in $\mathbf{y_{train}}$ . Measures to quantify such relations have been widely investigated [32, 55]. This characterization may be important for the meta-learning process because learning algorithms are expected to be able to induce models that map the predictive attribute values to the target values. Therefore, information about the relation between $\mathbf{X_{train}}$ and $\mathbf{y_{train}}$ is expected to be relevant for algorithm selection. Concerning regression problems, the correlation between numeric attributes and the target is an example of a $\mathbf{X_{train}}\leftrightarrow\mathbf{y_{train}}$ based measure. When dealing with classification problems, the mutual information may provide a good alternative to characterize the relation between nominal attributes and the target attribute.

$\mathbf{y_{train}}\leftrightarrow\mathbf{\hat{y}_{train}}$

$\mathbf{y_{train}}\leftrightarrow\mathbf{\hat{y}_{train}}$ represents the relations between the target values $\mathbf{y_{train}}$ and model predictions $\mathbf{\hat{y}_{train}}$ for the training data. Analysing those relations is relevant for the task of algorithm selection, since they are associated with the behavior of the induced models. For instance, model performance is a piece of information that can be obtained from such relations and may be used to predict the algorithm behavior for the most recent data. Any suitable metric to assess model performance, like the mean squared error for regression problems or the accuracy for classification problems, can be used as a $\mathbf{y_{train}}\leftrightarrow\mathbf{\hat{y}_{train}}$ based measure.

4.1.2 Relations of relations

Besides the information gathered from the relations between the subsets of predictive attributes, model predictions and target values, the characterization of the relation between two relations may also be useful. For instance, consider the following two relations for regression problems: 1) the correlation between a predictive attribute and the target attribute for the training data ( $\mathbf{X_{train}}\leftrightarrow\mathbf{y_{train}}$ ); and 2) the correlation between the same predictive attribute and the predictions for the training data ( $\mathbf{X_{train}}\leftrightarrow\mathbf{\hat{y}_{train}}$ ). Characteristics that can be extracted from the relation between these two relations:

$\displaystyle(\mathbf{X_{train}}\leftrightarrow\mathbf{y_{train}})% \leftrightarrow(\mathbf{X_{train}}\leftrightarrow\mathbf{\hat{y}_{train}})$

may provide information about model behavior for that specific attribute. In this case, a small difference between these correlations may be an evidence that the predicted values presented the same tendency of the target values regarding the predictive attribute analyzed.

Similar information can be obtained from relations between different data subsets. The difference between the average of a numeric predictive attribute from $\mathbf{X_{train}}$ and the average of the same attribute from $\mathbf{X_{selec}}$ , for instance, may indicate if the attribute is changing over time or not. In order to have a better understanding of the degree and the speed of this change, a possible strategy is to compute the difference between the average of this attribute between the training, horizon and selection data: $(\mathbf{X_{train}}\leftrightarrow\mathbf{X_{horiz}})\leftrightarrow(\mathbf{X% _{train}}\leftrightarrow\mathbf{X_{selec}})$ . If this difference is large, then the attribute is probably changing fast; otherwise, the change is occurring gradually.

4.2 Meta-features from the meta-level

Heretofore, we have presented a guidance for characterization of data streams focusing on the information that can be attained from the base-level. A different approach can be followed, by extracting characteristics from the meta-level learning process itself to support algorithm selection [37]. Based on this approach, we will discuss in this section which useful information for this task can be gathered from the meta-level, since a stream of meta-examples is generated over time.

According to Fig. 1, a meta-example is created for each new $\gamma$ base-level examples, i.e., for each selection subset. Thus, a meta-example $e=(\mathbf{a},c)$ is a tuple of $n$ predictive meta-features, $\mathbf{a}=\{a_{1},\cdots,a_{n}\}$ , and a target value $c$ , which may be either a label identifying the best learning algorithm for the corresponding selection subset, the predictive performance of the algorithm or a ranking of algorithms. In the beginning of the data stream, the meta-features can be obtained only from the base-level data, because there are no information about the meta-level learning process yet. However, after a sufficient number of meta-examples have been generated and processed, this information can also be used to improve the meta-level learning itself. For such, we assume that the same interleaved test-then-train procedure used in the base-level is also adopted in the meta-level to induce and evaluate a meta-model. Figure 2 shows this procedure using a training set of $\omega_{m}$ meta-examples and a single test meta-example. In this figure, $A_{\textit{train.}}$ , $c_{\textit{train.}}$ and $\hat{c}_{\textit{test}}$ are the subsets of the meta-features values, target values and meta-model prediction values for training meta-data, respectively. Similarly, $A_{\textit{test}}$ , $c_{\textit{test}}$ and $\hat{c}_{\textit{test}}$ represent those values for test meta-data. The data subsets in white are available at the time of the prediction of the test meta-example, whereas the actual target of this meta-example, in gray, will just be known after the evaluation of the performance of the algorithms on the base-level examples corresponding to this meta-example.

Considering that the meta-target is a categorical value that identifies the best algorithm for the corresponding base-level example (s), it is possible to suggest some meta-features that can be obtained from the meta-data available, namely: i) the class distribution of the meta-examples; ii) the error rate for each class; iii) a nominal value indicating if the meta-example was predicted correctly or not; and iv) the last predicted class. These meta-features are always computed up to the previous time instant. Thus, the class distribution in the meta-level for the meta-example $i+1$ is computed from the meta-examples $i-\omega_{m}$ until $i$ . Other similar measures could be considered for the same procedure if the target of the meta-examples were a numeric value or a ranking of algorithms.

Figure 2.

Data in the meta-level.

4.3 The role of data morphology

As previously noted, traditional meta-learning studies about algorithm selection generate meta-data by extracting characteristics from several data sets, which may have different morphologies [8, 7, 54, 46]. This process results in an uneven set of meta-features, since some measures yield an output value for each attribute or relation between multiple attributes. Consequently, if the learning algorithm employed in the meta-level is propositional [49], the output values must first be somehow aggregated into a single value. Thus, all data sets can be described by the same meta-features.

A shortcoming of this approach is that this aggregation may hide important information. For instance, if a data set has attributes that are strongly positively correlated ( $\rho\approx 1$ ) and other attributes strongly negatively correlated ( $\rho\approx-1$ ), their mean correlation will be close to zero. This is the same amount of mean correlation for a data set whose all attributes are uncorrelated ( $\rho\approx 0$ ), although these two situations are completely different. Some alternatives to overcome this and other similar problems have been proposed, such as histograms and others [35, 59, 34].

In this paper, we propose a distinction between characteristics based on whether they dependent on data morphology or not. Morphology-dependent meta-features cannot be used in meta-learning applications involving data sets with varying data morphologies. To be used in these applications, they must be transformed, e.g. by summarization. For example, the correlation between attributes discussed earlier is a typical case of morphology-dependent characteristic.

Morphology-independent characteristics can be obtained from data sets with different morphologies and used directly as meta-features. The number of numeric and nominal attributes, and the number of classes, in a classification problem, are examples of morphology-independent measures.

As discussed earlier, the morphology of the data is typically not an issue in meta-learning for data streams. Both morphology-dependent and -independent meta-features can be used directly. Concerning the former, characterization can be done directly with independent measures, such as those concerning the target attribute ( $\mathbf{y}$ ) and the predicted target values ( $\mathbf{\hat{y}}$ ), like skewness for regression, entropy for classification, or any relations between them ( $\mathbf{y}\leftrightarrow\mathbf{\hat{y}}$ ).

5. Experiments

The purpose of the experiments carried out in this paper is two-fold. First, we intend to show how this guidance can assist users to construct meta-features that describe data streams. For such, we applied MetaStream [51] for the task of algorithm selection and compared it to a baseline method. While the former makes explicit use of the extracted meta-features, the latter is guided only by the meta-target. Second, we examine the role of dependent and independent meta-features in the performance of the meta-learning system by comparing two sets of predictive data. One set consists of only independent meta-features and the other consists of both independent and dependent meta-features. The whole experimentation study was performed using three real world problems.

In Section 5.1, the data sets used in the experiments are described. The algorithms employed to induce the base learning model and related parameters values are presented in Section 5.2. In Section 5.3, we describe the meta-level setup. Finally, in Section 5.4, the data characterization process using the proposed set of guidelines is discussed.

5.1 Time-changing data

In this paper, we evaluate MetaStream on three problems that constantly generate new data: i) Travel Time Prediction (TTP) of bus lines, which is crucial in ground public transportation systems; ii) Electricity Demand Prediction (EDP), which is relevant for operations and system planning; and iii) flight departure delays (Airline), which is important for air transportation systems. Table 1 shows some characteristics of these data sets, namely the number of examples ( $\#$ Examples), the number of predictive attributes ( $\#$ Attributes) and the period when the data was collected ( $\#$ Period). Next, these problems will be described in details.

Table 1
Main characteristics of the data sets investigated

	$\#$ Examples	$\#$ Attributes	Period
TTP	24 974	6	Jan/2007 to Mar/2008
EDP	27 552	6	May/1997 to Dec/1998
Airline	20 285	7	Jan/2007 to Dec/2008

5.1.1 TTP

Travel time prediction (TTP) may be useful for different purposes in public transportation, like in the definition of crew’s duties and for helping users to decide the best route and departure time to arrive on time at their destination. Here, we are concerned with employing regression models to predict travel time, sometime in advance (minutes or hours). The data comes from a study of the TTP problem for planning of passenger transport companies using data collected from the Sociedade de Transportes Colectivos do Porto SA (STCP2

²
www.stcp.pt.

), the public bus operator company in Porto, Portugal [44]. The same data, referred to as Triana, for one specific route (205-1-1), are analysed in this paper. This application does not necessarily generate data in equally spaced time, and periodicity may change arbitrarily, according to the business strategy. Each example is described by 5 attributes: day of week, day of year, type of day (holiday, bridge, tolerance and normal), departure time and travel time. The target attribute is the travel time.

5.1.2 EDP

The electricity demand prediction (EDP) data were collected by the Australian New South Wales Electricity Market and is usually referred to as Elec2 [28]. Each example on the data set is described by 6 attributes: day of week, time stamp based on half hour periods (from 1 to 48), the electricity demand of the New South Wales (NSW) and Victoria (VIC) states, the scheduled electricity transfer between these states and the change of the price using a moving average of the last 24 hours (up or down). The target attribute is the NSW electricity demand. Here, we are interested in predicting the electricity demand for the next week (next 336 examples). In this paper, we removed the first 17,424 examples referring to the period up to the 4th of May 4, 1997, because they do not have values for the VIC electricity demand and the scheduled electricity transfer between states, which were included after that date.

5.1.3 Airline

The Airline data set was created by the American Statistical Association for a competition of statistical graphs in 2009 [4]. It contains examples of 120 millions flights in the United States between October 1987 and December 2008. Here, we use data of flights from a single route, from Chicago O’Hare (ORD) International airport to LaGuardia airport (LGA), in 2007 and 2008. After selecting this route and cleaning the data, we ended up with 20 285 examples, which were sorted by the scheduled departure time [18]. The set of attributes used here was the same as in previous work [31]. We had to remove some attributes, like origin and destination, because data from a single route was used. At the end, the following attributes describe the Airline data: date, day of week, scheduled departure time, scheduled arrival time, flight number, and scheduled elapsed time. The target attribute is the prediction of the departure delay.

5.2 Base-level setup

At the base-level, both incremental and batch learning algorithms could be employed to induce models to predict the target attribute values for the investigated tasks. However, given that there are just a few studies on incremental regression algorithms [61], we decided to use batch learning algorithms, namely:

•
Random Forests (RF) [9];
•
Support Vector Machines (SVM) [15];
•
Classification and Regression Trees (CART) [10];
•
Project Pursuit Regression (PPR) [20];
•
Multivariate Adaptive Regression Splines (MARS) [19];
•
Linear Regression (LR) [66].

Except for LR, implementations of the algorithms from R packages were used, respectively: randomForest [40], e1071 [41], rpart [58], stats [48] and earth [43]. The Weka implementation of LR was used [66]. The default parameter values of these algorithms as suggested in their respective packages were used in the experiments. The exception was the number of terms of the PPR algorithm, which does not have a default value and was set to 1, based on previous work [44].

At the base-level, these learning algorithms were applied and evaluated using a sliding window of a fixed size, which allows a better control over the data used to induce and test the models. For the TTP and Airline data sets, the training window size, $\omega_{b}$ , was set to 1000 examples while the test set was always one, $\lambda=1$ . The horizon prediction size, $\eta_{b}$ , was defined as the smallest possible value, i.e. the smallest interval between predicting the target attribute value of an example and observing its actual value. Thus, this parameter was set to $\eta_{b}=2$ and $\eta_{b}=5$ for the TTP and Airline data sets, respectively. For the EDP data set, these parameters were set to $\omega_{b}=672$ (data of two weeks), $\eta_{b}=0$ and $\lambda=336$ (data of one week). In summary, a model is induced weekly using all the available data from the two past weeks and is applied to predict the electricity demand for the next week. Table 2 shows the parameter values for each problem. These settings were defined empirically based on previous experiments [52, 51] and in other studies [44, 18].

Table 2
Parameter values for learning in the base-level for each data set

Data set Parameter

$\omega_{b}$ $\eta_{b}$ $\lambda_{b}$

TTP 1000 2 1

EDP 672 0 336

Airline 1000 5 1

The predictive performance of the base-level models was assessed by the Normalized Mean Squared Error (NMSE) [56]. This measure is always computed for all base-level examples that belong to the selection subset, with size ( $\gamma$ ) has defined in the following section.
5.3 Meta-level setup

Data set	Parameter
TTP	1000	2	1
EDP	672	0	336
Airline	1000	5	1

As in most meta-learning studies, the algorithm selection problem is considered here as a classification task, where each pair of algorithms constitutes a classification problem at the meta-level [32]. Thus, if $m$ learning algorithms are used at the base-level, $\binom{m}{2}$ problems will be created at the meta-level and a meta-model will be induced for each one. Therefore, 15 meta-learning problems are investigated here, since 6 algorithms were used in the base-level. This pairwise approach was preferred to others because a meta-model induced to distinguish between a pair of algorithms is expected to achieve better predictive performance than when many algorithms are considered simultaneously [32]. The major disadvantage of this approach is its computational cost, due to the number of classification tasks to be handled.

At the meta-level, we have to choose the learning algorithm (meta-learner) that will induce the meta-model. Due to the small number of meta-examples generated when an algorithm is chosen for a batch of examples, we decided to use batch learning algorithms, since they are more appropriate than incremental ones when there is a small amount of data [17, 21]. Thus, we decided to investigate RF and SVM as meta-learners. The former was used for similar purposes before [51] and obtained the best results, while the latter presented high predictive performance in many studies [45, 12, 5]. The default parameter values for the implementations of RF and SVM selected for this project were used in all experiments.

A common baseline classification method in machine learning is to classify a test example in the majority class of the training set, also known as default class. A natural extension of this evaluation method for sliding windows is to use the majority class of the meta-examples in the training window at each time point. In this sense, if the majority class changes over time, the prediction will be automatically updated. We will refer to this baseline method as Default.

In order to select the appropriate size of the training window ( $\omega_{m}$ ) and the selection subset ( $\gamma$ ) for MetaStream, we carried out experiments using the beginning of the data streams. To be as fair as possible in the comparison between MetaStream and Default, the same parameter values were defined for both. These experiments were performed with $\omega_{m}=\{100,200,300\}$ for the three data sets investigated, and $\gamma=\{10,20\}$ for the TTP and Airline, and $\gamma=\{12,24\}$ for the EDP, taking the particular characteristics of each data set into account. The predictive performance of the base models was computed for each combination of $\omega_{m}$ and $\gamma$ . Afterwards, the setting that yielded the best average performance for MetaStream and Default for each data set was chosen. These parameter values are shown in Table 3.

Table 3
Parameter values of the meta-level defined experimentally for each problem

Data set	$\omega_{m}$	$\gamma$
TTP	300	20
EDP	300	24
Airline	100	10

The predictive performance of MetaStream and Default in the meta-level was assessed by the $\kappa$ statistic [14], which is a suitable measure for imbalanced data distributions. In the context of data streams classification, this statistic was used by [6] to compare state-of-the-art algorithms with a baseline method that takes into account the dependence among examples.

5.4 Meta-data sets

The meta-features for the experiments were generated using the guidelines presented in Section 4. Table 4 shows the measures used to generate these meta-features and the subsets of data to which they were applied. Although there are some meta-features that could be extracted from relations between training and selection data, such as $X_{\textit{train}}\leftrightarrow X_{\textit{selec}}$ (Sec. 4.1.1), we decided not to use them because it would generate a large number of meta-features, what could result in a very high dimensional problem. For the same reason we did not consider meta-features from “relations of relations” strategy (Sec. 4.1.2) and from the meta-level (Sec. 4.2). Future studies could compare different sets of meta-features and measures for characteristics extraction.

In order to evaluate if the addition of dependent meta-features is able to improve the predictive performance of a meta-learning method for algorithm selection, the meta-features extracted from the base-level were split considering whether they are dependent or independent of the data morphology, as presented in Section 4.3. Thus, the MDInd meta-data set is composed of independent meta-features, obtained by computing the average, the maximum and the minimum of each measure showed in Table 4 for the respective subsets of data. The MDIndDep meta-data set adds dependent meta-features to the MDInd set. The dependent meta-features were generated by the same measures and data showed in Table 4, except for the actual target, $\mathbf{y}$ , the predictions, $\mathbf{\hat{y}}$ , and their relations (Section 4.3). Therefore, all characteristics extracted from the data are used as meta-features, without the need of summarizing them. For instance, the entropy of each categorical attribute is used as a meta-feature together with the average, maximum and minimum values considering all categorical attributes.

Since there may be an overlap between these two sets of meta-features, only the dependent meta-features which are not highly correlated to the independent meta-features are maintained in the meta-data set MDIndDep. In these experiments, we empirically defined a threshold of 0.9 for the Pearson correlation coefficient to determine when meta-features are highly correlated.

Table 4
Measures of characterization and data for which their are applied

Ref.	Measure	Data
M1	Arithmetic mean	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M2	Trimmed mean	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M3	Standard deviation	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M4	Median	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M5	Interquartile range	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M6	Maximum	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M7	Minimum	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M8	Outliers	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M9	Skewness	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M10	Kurtosis	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M11	Coefficient of variation	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M12	Serial correlation	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M13	Ratio of turning points	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ , $\mathbf{y_{train.}}$
M14	Entropy	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$
M15	Concentration coefficient	$\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$
M16	Correlation	$\mathbf{X_{train.}}$ $\leftrightarrow$ $\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ $\leftrightarrow$ $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ $\leftrightarrow$ $\mathbf{X_{selec.}}$ , $\mathbf{X_{train.}}$ $\leftrightarrow$ $\mathbf{y_{train.}}$
M17	$p$ -value of the F-distribution	$\mathbf{X_{train.}}$ $\leftrightarrow$ $\mathbf{X_{train.}}$ , $\mathbf{X_{horiz.}}$ $\leftrightarrow$ $\mathbf{X_{horiz.}}$ , $\mathbf{X_{selec.}}$ $\leftrightarrow$ $\mathbf{X_{selec.}}$ , $\mathbf{X_{train.}}$ $\leftrightarrow$ $\mathbf{y_{train.}}$
M18	Dispersion gain	$\mathbf{X_{train.}}\leftrightarrow\mathbf{y_{train.}}$
M19	Range	$\mathbf{y_{train.}}$
M20	Heterogeneity	$\mathbf{y_{train.}}$
M21	Predictive performance rate	$\mathbf{y_{train.}}$ $\leftrightarrow$ $\mathbf{\hat{y}_{train.}}$
M22	Rate of the standard deviation of the squared error	$\mathbf{y_{train.}}$ $\leftrightarrow$ $\mathbf{\hat{y}_{train.}}$
M23	Models ranking	$\mathbf{y_{train.}}\leftrightarrow\mathbf{\hat{y}_{train.}}$
M24	Diversity I	$\mathbf{y_{train.}}$ $\leftrightarrow$ $\mathbf{\hat{y}_{train.}}$

The target at the meta-level is a categorical value that indicates the best learning algorithm at the base-level, i.e. which algorithm obtained the smallest NMSE for the selection subset. As the algorithms are evaluated in pairs, we have a binary classification problem for algorithm selection, as stated in Section 5.3. However, a meta-example can be labeled only after the actual target of the corresponding base-level examples are observed. Thus, there may be a delay between predicting the target of a meta-example and labeling it. The meta-targets for both meta-data sets are equal, i.e. MDInd and MDIndDep differ only in their meta-features.

6. Experimental results

The predictive performance of MetaStream, considering the meta-data sets MDInd and MDIndDep, and Default are shown in Fig. 3. This performance was assessed by the $\kappa$ statistic for each pair of regressors and problem investigated. In these experiments, we employed RF as the meta-learner in MetaStream. The order of presentation of results is increasing difference between the values of $\kappa$ obtained by MDInd and MDIndDep. This means that the most interesting results (i.e. larger difference between results) are presented on the right-hand side of the plots. A $\kappa$ statistic around zero means the predictive performance of the method is similar to always predicting the majority class. A $\kappa=1$ indicates the method selected the best algorithm for all meta-examples whereas a $\kappa=-1$ indicates a completely wrong selection. The bar of some pairs of regressors using Default can not be seen because their $\kappa$ values are around zero. This occurred because the majority class of each training window, used by Default to select an algorithm, is equal to the majority class of the whole meta-data set.

Figure 3.

MetaStream (MDInd and MDIndDep) and Default predictive performance for the data sets TTP, EDP and Airline. RF were employed as meta-learners in the MetaStream method.

These results evidence the superiority of MetaStream (MDInd and MDIndDep) compared to Default, with $\kappa$ values significantly higher than zero, especially for TTP and EDP data sets. One reason is that the guidance proposed is supporting the development of useful meta-features for the purpose of algorithm selection. Regarding MDInd and MDIndDep, the performance of MetaStream was very similar comparing both meta-data sets. Actually, the $\kappa$ values were slightly greater for about half the pairs of algorithms when the meta-data set MDIndDep was used and slightly smaller for the other half.

In order to analyze the influence of the independent and dependent meta-features, we computed their relative weight for the RF algorithm. The relative weight of each meta-feature was based on the Mean Decrease Accuracy (MDA) measure [9], which is calculated by the RF algorithm during the induction of the meta-model. The average relative weight of the independent and dependent meta-features is determined as follow. First, the meta-features are sorted by decreasing weight (the highest, the better), and only the first 10% are selected. Next, these meta-features are split in two sets according to their dependence of data morphology: the independent, $I$ , and the dependent, $D$ , subsets. Then, the relative average weight of each subset is computed as the sum of its weights over the sum of both subsets ( $I$ and $D$ ).

Figure 4.

The relative weights of the top ranked independent and dependent meta-features for the TTP, EDP and Airline data sets.

In Fig. 4 we present the relative weights of independent and dependent meta-features for one pair of regressors of each meta-data set. These pairs are LR/PPR, RF/SVM and PPR/CART for the meta-data sets TTP, EDP and Airline, respectively. In general, the behavior presented for these pairs of regressors are similar to the other pairs. These graphics show that RF assigned higher weights to independent meta-features (about 75%) in comparison with dependent meta-features (about 25%) for all data sets. These plots agree with the results shown in Fig. 3, i.e., the addition of dependent meta-features did not contributed to the improvement of the MetaStream performance using RF.

Figure 5.

MetaStream (MDInd and MDIndDep) and Default predictive performance for the data sets TTP, EDP and Airline. SVM were employed as meta-learners in the MetaStream method.

The importance of the meta-features and the MetaStream performance can also be influenced by the learning algorithm employed as meta-learner. Figure 5 shows the results of MetaStream (MDInd and MDIndDep) when SVM was used as meta-learner. According to these graphics, when MetaStream used the MDIndDep meta-dataset, its performance remarkably improved compared to MDInd for the TTP and EDP data sets. In order to have more evidence about the influence of the dependent meta-features, we applied the Wilcoxon statistical test with 0.05 of significance to compare both meta-data sets across all pairs of regressors. This test resulted in a significant difference, i.e., the enhancement of the MetaStream method was supported by the addition of the dependent meta-features. These results disagree with those observed when RF was used as meta-learner. A possible reason is that the RF algorithm has an inner feature selection mechanism while SVM may have been leveraged by irrelevant independent meta-features [63, 64, 11], whose influence was reduced by the addition of relevant dependent meta-features.

Regarding predictive performance, MetaStream achieved greater $\kappa$ values than Default for TTP and EDP data sets. On the other hand, MetaStream performance was similar to Default method for Airline. This result indicates that the dependent meta-features did not provided additional information to solve the problem of algorithm selection for this data set. Actually, selecting an algorithm for the Airline data set was the most difficult problem among those investigated here.

7. Conclusion

In this paper, we provide a set of guidelines to characterize data streams. Taking into account both the order of examples and the type of variables involved in the base-level, we assemble a wide range of possibilities to extract characteristics from time-changing data considering different measures for this purpose. With proper information about the data, the task of algorithm selection for data streams can take place.

In order to evaluate the suitability of this approach, we carried out experiments with three data streams problems. In this process, we considered commonly used measures for characterization and the MetaStream method was employed to select the best learning algorithm over time. Moreover, we analyzed the influence of meta-features dependent and independent of the data morphology. According to the experimental results, the proposed guidelines could successfully be employed to discover relevant meta-features from base-level data and their relations.

As future work, we plan to investigate characteristics of relations between relations and relations from different time points (e.g., training $\rightleftarrows$ selection). In addition, we intend to analyze the influence of each meta-feature on the general performance of the meta-model. This knowledge may allow the selection of the most relevant meta-features and the proposal of other meaningful characteristics to be extracted from data streams. Many other measures have been proposed in the literature, such as complexity measures [30, 25], and can also be useful for algorithm selection on data stream scenario.

Footnotes

Acknowledgments

The authors would like to thank the financial support of funding agencies FAPESP (2008/11569-6), CAPES and FCT (224/09 – BEX3231/10-0), and CNPq. This work was also partly funded by the ERDF – European Regional Development Fund through the COMPETE programme (operational programme for competitiveness) within project GNOSIS, cf. “FCOMP-01-0202-FEDER-038987”. It is also funded by the North Portugal Regional Operational Programme (ON.2 O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF), and by national funds, through the Portuguese funding agency, Fundação para a Ciência e a Tecnologia (FCT) through projects “NORTE-07-0124-FEDER-000057” and “NORTE-07-0124-FEDER-000059”. The work is also financed by the ERDF European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within projects “FCOMP-01-0124-FEDER-037281”. The research leading to these results has also received funding from the ECSEL Joint Undertaking, the framework programme for research and innovation horizon 2020 (2014-2020) under grant agreement n 662189-MANTIS-2014-1.

References

Adya

Collopy

Armstrong

J.S.

and Kennedy

, Automatic identification of time series features for rule-based forecasting, International Journal of Forecasting 17(2) (2001), 143–157.

Ali

and Smith

K.A.

, On learning algorithm selection for classification, Applied Soft Computing 6(2) (2006), 119–138.

Amasyali

and Ersoy

, A study of meta learning for regression, Technical report, Purdue University, 2009. http://docs.lib.purdue.edu/ecetr/386/.

ASA

A.S.A.

, Data Expo 2009 – Sections on Statistical Computing and Statistical Graphics, 2009. http://stat-computing.org/dataexpo/2009/.

Bennett

K.P.

and Campbell

, Support vector machines: hype or hallelujah? SIGKDD Explorations Newsletter 2(2) (2000), 1–13.

Bifet

Read

Žliobaite

Pfahringer

and Holmes

, Pitfalls in benchmarking data stream classification and how to avoid them, in: Machine Learning and Knowledge Discovery in Databases Blockeel

Kersting

Nijssen

and Železný

, eds, volume 8188, Springer Berlin Heidelberg, 2013, pp. 465–479.

Brazdil

Giraud-Carrier

Soares

and Vilalta

, Metalearning: Applications to Data Mining, Springer Verlag, 2009.

Brazdil

Soares

and da Costa

, Ranking learning algorithms: Using ibl and meta-learning on accuracy and time results, Machine Learning 50(3) (2003), 251–277.

Breiman

, Random forests, Machine Learning 45(1) (2001), 5–32.

10.

Breiman

Friedman

Olshen

and Stone

, Classification and Regression Trees, Chapman & Hall (Wadsworth, Inc.), 1984.

11.

Cao

Chua

Chong

Lee

and Gu

, A comparison of pca, KPCA and ICA for dimensionality reduction in support vector machine, Neurocomputing 55(12) (2003), 321–336.

12.

Caruana

and Niculescu-Mizil

, An empirical comparison of supervised learning algorithms, in, Proceedings of the 23rd International Conference on Machine Learning, New York, NY, USA, 2006, pp. 161–168. ACM.

13.

Castiello

Castellano

and Fanelli

, Meta-data: Characterization of input features for meta-learning, in: Modeling Decisions for Artificial Intelligence Torra

Narukawa

and Miyamoto

, eds, Springer Berlin/Heidelberg, 2005, pp. 295–304.

14.

Cohen

, A coefficient of agreement for nominal scales, Educational and Psychological Measurement 20(1) (1960), 37–46.

15.

Cristianini

and Shawe-Taylor

, An introduction to support Vector Machines: and Other Kernel-Based Learning Methods, Cambridge University Press, New York, NY, USA, 2000.

16.

Dasu

Krishnan

Lin

Venkatasubramanian

and Yi

, Change (detection) you can believe in: Finding distributional shifts in data streams, in: Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII, Berlin, Heidelberg, 2009, pp. 21–34. Springer-Verlag.

17.

Domingos

and Hulten

, Mining high-speed data streams, in, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2000, pp. 71–80. ACM.

18.

Fidalgo-Merino

and Nunez

, Self-adaptive induction of regression trees, IEEE Transactions on Pattern Analysis and Machine Intelligence 33(8) (2011), 1659–1672.

19.

Friedman

J.H.

, Multivariate adaptive regression splines, The Annals of Statistics 19(1) (1991), 1–67.

20.

Friedman

J.H.

and Stuetzle

, Projection pursuit regression, Journal of the American Statistical Association 76(376) (1981), 817–823.

21.

Gama

, Knowledge Discovery from Data Streams, CRC Press, 2010.

22.

Gama

and Kosina

, Learning about the learning process, in: Proceedings of the 10th International Conference on Advances in Intelligent Data Analysis, Berlin, Heidelberg, 2011, pp. 162–172. Springer-Verlag.

23.

Gama

Sebastião

and Rodrigues

P.P.

, Issues in evaluation of stream learning algorithms, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2009, pp. 329–338. ACM.

24.

Gama

Sebastião

and Rodrigues

P.P.

, On evaluating stream learning algorithms, Machine Learning 90(3) (2013), 317–346.

25.

Garcia

L.P.F.

de Carvalho

A.C.P.F.

and Lorena

A.C.

, Noise detection in the meta-learning level, Neurocomputing (2015). in press 2015.

26.

Giraud-Carrier

Vilalta

and Brazdil

, Introduction to the special issue on meta-learning, Machine Learning 54(3) (2004), 187–193.

27.

Gomes

J.B.

Menasalvas

and Sousa

P.A.C.

, Learning recurring concepts from data streams with a context-aware ensemble, In Proceedings of the ACM Symposium on Applied Computing, New York, NY, USA, 2011, 994–999. ACM.

28.

Harries

, Splice-2 comparative evaluation: Electricity pricing, Technical Report 9905, School of Computer Science and Engineering, University of New South Wales, 1999.

29.

Harries

Sammut

and Horn

, Extracting hidden context, Machine Learning 32(2) (1998), 101–126.

30.

T.K.

and Basu

, Complexity measures of supervised classification problems, Pattern Analysis and Machine Intelligence, IEEE Transactions on 24(3) (2002), 289–300.

31.

Ikonomovska

Gama

and Džeroski

, Learning model trees from evolving data streams, Data Mining and Knowledge Discovery 23 (2011), 128–168.

32.

Kalousis

, Algorithm Selection via Meta-Learning, PhD thesis, University of Geneva, Faculty of Sciences, Geneva, Switzerland, 2002.

33.

Kalousis

Gama

and Hilario

, On data and algorithms: Understanding inductive performance, Machine Learning 54(3) (2004), 275–312.

34.

Kalousis

and Hilario

, Representational issues in meta-learning, in: Proceedings of the Twentieth International Conference on Machine Learning Fawcett

and Mishra

, eds, AAAI Press, 2003, pp. 313–320.

35.

Kalousis

and Theoharis

, Noemon: Design, implementation and performance results for an intelligent assistant for classifier selection, Intelligent Data Analysis 3(5) (1999), 319–337.

36.

Kifer

Ben-David

and Gehrke

, Detecting change in data streams, in: Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB Endowment, 2004, pp. 180–191.

37.

Klinkenberg

, Meta-learning, model selection, and example selection in machine learning domains with concept drift, in: LWA Bauer

Brandherm

Fürnkranz

Grieser

Hotho

Jedlitschka

and Kröner

, eds, DFKI, 2005, pp. 164–171.

38.

Kuba

Brazdil

Soares

and Woznica

, Exploiting sampling and meta-learning for parameter setting support vector machines, in: Proceedings of the Workshop de Minería de Datos Y Aprendizaje of IBERAMIA Garijo

M.T.F.

and Riquelme

, eds, Universidad de Sevilla, 2002, pp. 217–225.

39.

Lemke

and Gabrys

, Meta-learning for time series forecasting and forecast combination, Neurocomputing 73(10–12) (2010), 2006–2016.

40.

Liaw

and Wiener

, Classification and regression by randomforest, R News 2(3) (2002), 18–22.

41.

Meyer

Dimitriadou

Hornik

Weingessel

and Leisch

, e1071: Misc Functions of the Department of Statistics (e1071), TU Wien, 2012. R package version 16-1.

42.

Michie

Spiegelhalter

and Taylor

, Introduction, In Michie

Spiegelhalter

and Taylor

, editors, Machine Learning, Neural and Statistical Classification, Ellis Horwood, 1994.

43.

Milborrow

, Earth: Multivariate Adaptive Regression Spline Models, 2012. Derived from mda:mars by Trevor Hastie and Rob Tibshirani. R package version 32-3.

44.

Moreira

J.P.C.L.M.

, Travel time prediction for the planning of mass transit companies: a machine learning approach, PhD thesis, Faculty of Engineering of University of Porto, 2008.

45.

Musliu

and Schwengerer

, Algorithm selection for the graph coloring problem, in: Proceedings of the Learning and Intelligent Optimization Conference Nicosia

and Pardalos

, eds, Springer Berlin Heidelberg, 2013, pp. 389–403.

46.

Pfahringer

Bensusan

and Giraud-Carrier

C.G.

, Meta-learning by landmarking various learning algorithms, in: Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 2000, pp. 743–750. Morgan Kaufmann Publishers Inc.

47.

Prudêncio

R.B.C.

and Ludermir

T.B.

, Meta-learning approaches to selecting time series models, Neurocomputing 61 (2004), 121–137.

48.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.

49.

Raedt

L.D.

, Logical and Relational Learning, Cognitive Technologies. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2008.

50.

Rendell

L.A.

Sheshu

and Tcheng

D.K.

, Layered concept-learning and dynamically variable bias management, in: Proceedings of the International Joint Conference on Artificial Intelligence, Morgan Kaufmann, 1987, pp. 308–314.

51.

Rossi

A.L.D.

Carvalho

A.C.P.L.F.

and Soares

, Metastream: a meta-learning based method for periodic algorithm selection in time-changing data, Neurocomputing 127(0) (2014), 52–64.

52.

Rossi

A.L.D.

de Carvalho

A.C.P.L.F.

and Soares

, Meta-learning for periodic algorithms selection in time-changing data, in: Proceedings of the Brazilian Symposium on Neural Networks, IEEE Computer Society, 2012, pp. 7–12.

53.

Sebastião

Rodrigues

and Gama

, Change detection in climate data over the iberian peninsula, in: IEEE International Conference on Data Mining Workshops, 2009, pp. 248–253.

54.

Smith-Miles

K.A.

, Cross-disciplinary perspectives on meta-learning for algorithm selection, ACM Computing Surveys 41(1) (2008), 1–25.

55.

Soares

, Learning Rankings of Learning Algorithms: Recomendation of Algorithms with Meta-Learning, PhD thesis, Faculdade de Ciências da Universidade do Porto, Porto, Portugal, 2004.

56.

Soares

Brazdil

P.B.

and Kuba

, A meta-learning method to select the kernel width in support vector regression, Machine Learning 54 (2004), 195–209. 101023/B:MACH.0000015879.28004.9b.

57.

Tao

and Ozsu

M.T.

, Mining data streams with periodically changing distributions, in: Proceeding of the 18th ACM Conference on Information and Knowledge Management, New York, NY, USA, 2009, pp. 887–896. ACM.

58.

Therneau

Atkinson

and Ripley

, rpart: Recursive Partitioning, 2012. R package version 31-52.

59.

Todorovski

and Džeroski

, Experiments in meta-level learning with ilp, in: Proceedings of the 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases Žytkow

and Rauch

, eds, Springer Berlin Heidelberg, 1999, pp. 98–106.

60.

Vilalta

and Drissi

, A perspective view and survey of meta-learning, Artificial Intelligent Review 18(2) (2002), 77–95.

61.

Žliobaite

Bifet

Gaber

Gabrys

Gama

Minku

and Musial

, Next challenges for adaptive learning systems, SIGKDD Explorations 14(1) (2012), 48–55.

62.

Wang

Smith-Miles

and Hyndman

, Rule induction for forecasting method selection: Meta-learning the characteristics of univariate time series, Neurocomputing 72(10–12) (2009), 2581–2594.

63.

Waske

van der Linden

Benediktsson

J.A.

Rabe

and Hostert

, Sensitivity of support vector machines to random feature selection in classification of hyperspectral data, Geoscience and Remote Sensing, IEEE Transactions on 48(7) (2010), 2880–2889.

64.

Weston

Mukherjee

Chapelle

Pontil

Poggio

and Vapnik

, Feature selection for svms, in: Advances in Neural Information Processing Systems 13, Cambridge, MA, USA, 2001, pp. 668–674. MIT Press.

65.

Widmer

, Tracking context changes through meta-learning, Machine Learning 27(3) (1997), 259–286.

66.

Witten

I.H.

and Frank

, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005.

67.

Wolpert

and Macready

W.G.

, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation 1(1) (1997), 67–82.

Data set	Parameter
	$\omega_{b}$	$\eta_{b}$	$\lambda_{b}$
TTP	1000	2	1
EDP	672	0	336
Airline	1000	5	1

A guidance of data stream characterization for meta-learning

Abstract

Keywords

1. Introduction

2. Meta-learning for algorithm selection in data streams

3. Data characterization for meta-learning

1 For simplicity, we make the assumption that the order of the attributes must be the same in both data sets to be considered morphologically equivalent.

4.1.1 Characterizing single subsets and their relations

𝐗 𝐭𝐫𝐚𝐢𝐧 and 𝐗 𝐭𝐫𝐚𝐢𝐧 ↔ 𝐗 𝐭𝐫𝐚𝐢𝐧

𝐗 𝐡𝐨𝐫𝐢𝐳 and 𝐗 𝐡𝐨𝐫𝐢𝐳 ↔ 𝐗 𝐡𝐨𝐫𝐢𝐳

𝐗 𝐬𝐞𝐥𝐞𝐜 and 𝐗 𝐬𝐞𝐥𝐞𝐜 ↔ 𝐗 𝐬𝐞𝐥𝐞𝐜

𝐲 𝐭𝐫𝐚𝐢𝐧

𝐲 ^ 𝐭𝐫𝐚𝐢𝐧 , 𝐲 ^ 𝐡𝐨𝐫𝐢𝐳 and 𝐲 ^ 𝐬𝐞𝐥𝐞𝐜

𝐗 * ↔ 𝐗 *

𝐲 ^ * ↔ 𝐲 ^ *

𝐗 𝐭𝐫𝐚𝐢𝐧 ↔ 𝐲 𝐭𝐫𝐚𝐢𝐧

𝐲 𝐭𝐫𝐚𝐢𝐧 ↔ 𝐲 ^ 𝐭𝐫𝐚𝐢𝐧

4.1.2 Relations of relations

4.2 Meta-features from the meta-level

5. Experiments

5.1 Time-changing data

Table 1 Main characteristics of the data sets investigated

2 www.stcp.pt.

5.1.3 Airline

5.2 Base-level setup

Table 3 Parameter values of the meta-level defined experimentally for each problem

Table 4 Measures of characterization and data for which their are applied

Footnotes

Acknowledgments

References

¹
For simplicity, we make the assumption that the order of the attributes must be the same in both data sets to be considered morphologically equivalent.

$\mathbf{X_{train}}$ and $\mathbf{X_{train}}\leftrightarrow\mathbf{X_{train}}$

$\mathbf{X_{horiz}}$ and $\mathbf{X_{horiz}}\leftrightarrow\mathbf{X_{horiz}}$

$\mathbf{X_{selec}}$ and $\mathbf{X_{selec}}\leftrightarrow\mathbf{X_{selec}}$

$\mathbf{y_{train}}$

$\mathbf{\hat{y}_{train}}$ , $\mathbf{\hat{y}_{horiz}}$ and $\mathbf{\hat{y}_{selec}}$

$\mathbf{X_{}}\leftrightarrow\mathbf{X_{}}$

$\mathbf{\hat{y}_{}}\leftrightarrow\mathbf{\hat{y}_{}}$

$\mathbf{X_{train}}\leftrightarrow\mathbf{y_{train}}$

$\mathbf{y_{train}}\leftrightarrow\mathbf{\hat{y}_{train}}$

Table 1
Main characteristics of the data sets investigated

²
www.stcp.pt.

Table 3
Parameter values of the meta-level defined experimentally for each problem

Table 4
Measures of characterization and data for which their are applied