Abstract
The problem of selecting learning algorithms has been studied by the meta-learning community for more than two decades. One of the most important task for the success of a meta-learning system is gathering data about the learning process. This data is used to induce a (meta) model able to map characteristics extracted from different data sets to the performance of learning algorithms on these data sets. These systems are built under the assumption that the data are generated by a stationary distribution, i.e., a learning algorithm will perform similarly for new data from the same problem. However, many applications generate data whose characteristics can change over time. Therefore, a suitable bias at a given time may become inappropriate at another time. Although meta-learning has been used to continuously select a learning algorithm in data streams, data characterization has received less attention in this context. In this study, we provide a set of guidelines to support the proposal of characteristics able to describe non-stationary data over time. This guidance considers both the order of arrival of the examples and the type of variables involved in the base-level learning. In addition, we analyze the influence of characteristics regarding their dependence on data morphology. Experimental results using real data streams showed the effectiveness of the proposed data characterization general scheme to support algorithm selection by meta-learning systems. Moreover, the dependent meta-features provided crucial information for the success of some meta-models.
Introduction
Machine learning algorithms have been successfully employed in many domains. The successful application of these algorithms to a particular problem is largely affected by their suitability to the characteristics of the data at hand. Thus, the selection of the most promising algorithm or model for a new data set has been an important research issue [54, 2, 33, 42]. In this context, meta-learning is one of the main approaches employed for algorithm selection [7, 26, 50]. Meta-learning is concerned with understanding the learning mechanism itself by exploiting the knowledge acquired from previous experience on similar problems in order to predict the behavior of the algorithms in the future.
Usually, meta-learning algorithms are designed and applied to data which are supposed to be generated by an underlying stationary distribution. Thus, a learning algorithm is selected for each data set only once, assuming that this algorithm will perform similarly for new data of the problem considered. However, many dynamic environments generate data streams, which are produced automatically, in large scale, and subjected to change in their distribution over time [21]. A common strategy to deal with changes in the data distribution is to maintain a model up-to-date with the most recent examples. Nonetheless, this procedure is not sufficient when the bias of the current algorithm becomes inappropriate for the new data. To overcome this issue, the data should be continuously monitored and the predictive performance of the learning system constantly analyzed to select on the fly the most suitable algorithm for the current data. Meta-learning can do so by selecting the most appropriate learning algorithm based on the characteristics extracted from the data or from the learning process itself [51, 22, 27, 37, 65].
A crucial aspect for the success of a meta-learning system is the usefulness of these characteristics. For the purpose of algorithm selection, useful characteristics are those that represent properties of the data that influence algorithm performance. The process to extract suitable measures for the characterization of stationary data sets [42, 13, 25]is not the most convenient for data streams, since it compares different data sets with potentially distinct morphologies, while in data streams the set of attributes typically does not change over time. But even though the attributes are the same, the phenomenon generating the data may change significantly. Moreover, the process of feature extraction is currently performed in an ad-hoc manner, making it difficult for the meta-learning practitioners to realize which attributes and their relations they should consider over all possibilities. In practice, they generally inspects a limited number of measures using a trial and error method or simply assume measures successfully employed in the past will be adequate for the present problem.
Therefore, in this paper, we provide a guidance of data stream characterization for meta-learning. The proposed guidelines are expected to provide proper information to support the task of algorithm selection by a meta-learning based system on such type of data. They consider two key aspects of the stream. The first one is the order of arrival of the data, defined according to a sliding window. This is important because it may provide useful information about changes that may occur on data and allow the system to react to them. The second one is related to the variables involved in the base-level learning, which are divided in predictive and target attributes, besides the model predictions. Moreover, we propose a classification of the characteristics regarding their dependence on data morphology. The directions we provide should not be seen as a precise recipe. Instead, they aim at support the expert user in constructing the meta-features for his/her own problem.
The remainder of this paper is organized as follows. In Section 2, we provide an overview of meta-learning for algorithm selection on data streams. In Section 3, we discuss the process of characterization of stationary and time-changing data. The guidance presented in this paper for data stream characterization is described in Section 4. In Section 5, we show how the proposed guidance can be employed to describe data streams in the context of three algorithm selection problems. The results are presented and discussed in Section 6. Finally, Section 7 presents the main conclusions and points out future research directions.
Meta-learning for algorithm selection in data streams
According to the No Free Lunch theorem for machine learning [67], no algorithm performs better than any other when their performance are averaged over all possible problems. Similarly, in time-changing environments, an algorithm that is suitable for a problem at the current moment may become inappropriate when the data characteristics change [21]. Meta-learning can provide automatic and systematic guidance on algorithm selection based on the knowledge acquired from the application of a set of algorithms on different problems in the past [7]. More generally, meta-learning is concerned with understanding the conditions that determine if a learning system is adequate or not for a particular task. This information can be used to improve its performance in future applications [7, 26]. The main difference to base-learning (traditional learning) concerns its level of adaptation. At the base-level, the bias is fixed a priori (by the choice of the algorithm and the values for its parameters). In meta-learning the most suitable bias is dynamically defined according to the characteristics of the data [60].
A few studies have investigated the use of meta-learning to select the most appropriate model in data changing environments. For instance, a method has been proposed to detect concept drifts in data distribution from contextual clues and to store the so called stable concepts [65]. Thus, when changes in the observed domain occur, the system can rapidly adapt to a context previously found. An alternative approach [29] uses hidden contexts to identify subsets of data that are associated with stable concepts. Since contexts can recur, several disjoint intervals from the data may be associated with the same concept. The proposed method then uses a batch decision tree induction algorithm to induces models for each of these concepts. The problem of recurring concepts has also been investigated in association with the problem of concept drift [22]. In this case, whenever a concept drift is detected, the base model together with a meta-model that predicts the contexts in which it performs well are stored in a pool for possible future use. Every time the performance of the current base model drops below a given threshold, the meta-models in the pool are retrieved and used to predict the most appropriate base model for the current data. A similar approach of reusing models has been attempted [27], where an ensemble is built based on context information. It maintains a pool of classifiers learned from previous concepts using an incremental algorithm. When a concept is identified as recurrent, a weighting function is used to determine which classifiers should be retrieved from the pool and weights are assigned to them based on their estimated error and context information. Finally, a method has been proposed to detect concept drift and to choose, at each step in time, the most promising alternative learning algorithm at base level [37]. Thus, it is expected a better adaptation of the learning system according to the context.
Another meta-learning based method for algorithm selection on data streams, called MetaStream, was introduced recently [51]. Conceptually, it is similar to an existing method [37]. Their main difference resides on the set of characteristics each method employs to describe data. In [37], a meta-example consists of information about the learning process related to the batches of arriving examples, such as the number of batches used for training, the most successful algorithm on the previous batch and the most successful learner over all the batches seen so far. MetaStream, on the other hand, extracts characteristics directly from the raw data used in the base-learning. By doing so, it acquires a more detailed description of the problem under analysis than the high level characteristics used in [37].
Here, MetaStream is used as test bed for the proposed guidance for data streams characterization. Thus, in the following, a brief explanation is provided. MetaStream aims at selecting the most adequate regression algorithm for time-changing environments. For such, it works on both base and meta levels of learning. At the base-level, a stream of examples of the problem under analysis is used to induce and evaluate some regression models, using the interleaved test-then-train method [23] with a sliding window. As a result, models induced by different algorithms can be properly compared on incoming data. Once an arbitrary number of examples have arrived, meta-level learning can take place. It comprises three main steps in order to support algorithm selection in the context of data streams. In the first step information about the learning process at base-level is extracted, resulting in meta-data. This meta-data consists of a stream of meta-examples, where each meta-example is a tuple of meta-attributes (i.e. data characterization information) and a corresponding target attribute (i.e. the most adequate regression algorithm), computed over a set of base-level examples. In the second step, a learning algorithm is applied on the meta-data in order to relate the characteristics extracted from the base-level data to the best regressor. The learning model induced in this step is called meta-classifier, and it is is periodically built as new meta-data is created. Finally, the third step is the deployment of the meta-classifier to predict the class, i.e., the regression algorithm, for new unlabeled meta-example. The predicted regressor is then used to forecast the target of the base-level data. In-depth details about MetaStream can be found elsewhere [51].
Data characterization for meta-learning
In conventional meta-learning systems, the meta-data set employed to support algorithm selection is generated by extracting characteristics from several data sets, with distinct morphologies. Here, we consider that two data sets are morphologically different if they are not described by the same set of attributes with the same domains. Let
For simplicity, we make the assumption that the order of the attributes must be the same in both data sets to be considered morphologically equivalent.
Some measures that are commonly used to characterize morphologically different data sets produce different sets of characteristics, regarding the number of characteristics and their semantics. For instance, the correlation coefficient is computed for each pair of numeric attributes, resulting in
In data stream environments, meta-learning systems usually are employed for algorithm (or model) selection at different time points for only one data set, rather than many different data sets [51, 22, 37]. Since a data stream is generally described by the same set of attributes over time, the characteristics extracted from each attribute or its relations at a time point can be used directly as meta-features. The correlation coefficients between every pair of numeric attributes, for instance, may be used as a set of meta-features, without any type of summarization to transform them into a single value. Moreover, due to changes that may naturally occur along the time, feature extraction must be a regular process to support algorithm selection in a data stream scenario.
Some measures usually employed in meta-learning studies for feature extraction from stationary data [55, 32, 38, 3] may also be used for data streams characterization. In this case, the data associated with a time period is treated as a separate data set. In addition, the temporal component of examples is a relevant aspect in environments that continuously generate data, which are likely to change for some time scale [6]. Although the dependence between successive examples in data streams is usually not as strong as in traditional time series problems, mainly for long-term predictions, it is generally still a relevant characteristic [6]. Therefore, measures that take the arrival order into account can also be useful for data streams. In particular, measures investigated for feature extraction of time series, such as serial correlation, should be considered [1, 47, 62, 39].
In order to extract useful information from data streams for algorithm selection, one should examine both the arrival order of the examples and the variables of the problem available during learning. Jointly considering those aspects leads to a better manner to build sets of suitable meta-features for the meta-learning problem at hand. In this section, the interaction between both issues will be explored. In Subsection 4.1, we present a division of data according to the aforementioned aspects and provide some guidelines to extract characteristics from data streams at the base-level. In Subsection 4.2, we consider gathering information from meta-level to enhance the process of data characterization. In Subsection 4.3, we present the concepts of characteristics dependent and independent of the data morphology and draw a connection to algorithm selection.
Data stream at the base-level.
In order to facilitate the feature extraction process, we divide the data stream according to the order of arrival of the examples and to the variables involved in the base-level learning, as depicted in Fig. 1. The variables are divided into predictive attributes (
Figure 1 illustrates how such divisions of attributes and examples take place for a data stream at a time point where the example
Characterizing single subsets and their relations
In the following, we present all single subsets according to Fig. 1, and their relations.
and
and
and
,
and
Relations of relations
Besides the information gathered from the relations between the subsets of predictive attributes, model predictions and target values, the characterization of the relation between two relations may also be useful. For instance, consider the following two relations for regression problems: 1) the correlation between a predictive attribute and the target attribute for the training data (
may provide information about model behavior for that specific attribute. In this case, a small difference between these correlations may be an evidence that the predicted values presented the same tendency of the target values regarding the predictive attribute analyzed.
Similar information can be obtained from relations between different data subsets. The difference between the average of a numeric predictive attribute from
Meta-features from the meta-level
Heretofore, we have presented a guidance for characterization of data streams focusing on the information that can be attained from the base-level. A different approach can be followed, by extracting characteristics from the meta-level learning process itself to support algorithm selection [37]. Based on this approach, we will discuss in this section which useful information for this task can be gathered from the meta-level, since a stream of meta-examples is generated over time.
According to Fig. 1, a meta-example is created for each new
Considering that the meta-target is a categorical value that identifies the best algorithm for the corresponding base-level example (s), it is possible to suggest some meta-features that can be obtained from the meta-data available, namely: i) the class distribution of the meta-examples; ii) the error rate for each class; iii) a nominal value indicating if the meta-example was predicted correctly or not; and iv) the last predicted class. These meta-features are always computed up to the previous time instant. Thus, the class distribution in the meta-level for the meta-example
Data in the meta-level.
As previously noted, traditional meta-learning studies about algorithm selection generate meta-data by extracting characteristics from several data sets, which may have different morphologies [8, 7, 54, 46]. This process results in an uneven set of meta-features, since some measures yield an output value for each attribute or relation between multiple attributes. Consequently, if the learning algorithm employed in the meta-level is propositional [49], the output values must first be somehow aggregated into a single value. Thus, all data sets can be described by the same meta-features.
A shortcoming of this approach is that this aggregation may hide important information. For instance, if a data set has attributes that are strongly positively correlated (
In this paper, we propose a distinction between characteristics based on whether they dependent on data morphology or not. Morphology-dependent meta-features cannot be used in meta-learning applications involving data sets with varying data morphologies. To be used in these applications, they must be transformed, e.g. by summarization. For example, the correlation between attributes discussed earlier is a typical case of morphology-dependent characteristic.
Morphology-independent characteristics can be obtained from data sets with different morphologies and used directly as meta-features. The number of numeric and nominal attributes, and the number of classes, in a classification problem, are examples of morphology-independent measures.
As discussed earlier, the morphology of the data is typically not an issue in meta-learning for data streams. Both morphology-dependent and -independent meta-features can be used directly. Concerning the former, characterization can be done directly with independent measures, such as those concerning the target attribute (
Experiments
The purpose of the experiments carried out in this paper is two-fold. First, we intend to show how this guidance can assist users to construct meta-features that describe data streams. For such, we applied MetaStream [51] for the task of algorithm selection and compared it to a baseline method. While the former makes explicit use of the extracted meta-features, the latter is guided only by the meta-target. Second, we examine the role of dependent and independent meta-features in the performance of the meta-learning system by comparing two sets of predictive data. One set consists of only independent meta-features and the other consists of both independent and dependent meta-features. The whole experimentation study was performed using three real world problems.
In Section 5.1, the data sets used in the experiments are described. The algorithms employed to induce the base learning model and related parameters values are presented in Section 5.2. In Section 5.3, we describe the meta-level setup. Finally, in Section 5.4, the data characterization process using the proposed set of guidelines is discussed.
Time-changing data
In this paper, we evaluate MetaStream on three problems that constantly generate new data: i) Travel Time Prediction (TTP) of bus lines, which is crucial in ground public transportation systems; ii) Electricity Demand Prediction (EDP), which is relevant for operations and system planning; and iii) flight departure delays (Airline), which is important for air transportation systems. Table 1 shows some characteristics of these data sets, namely the number of examples (
Main characteristics of the data sets investigated
Main characteristics of the data sets investigated
Travel time prediction (TTP) may be useful for different purposes in public transportation, like in the definition of crew’s duties and for helping users to decide the best route and departure time to arrive on time at their destination. Here, we are concerned with employing regression models to predict travel time, sometime in advance (minutes or hours). The data comes from a study of the TTP problem for planning of passenger transport companies using data collected from the Sociedade de Transportes Colectivos do Porto SA (STCP2
www.stcp.pt.
The electricity demand prediction (EDP) data were collected by the Australian New South Wales Electricity Market and is usually referred to as Elec2 [28]. Each example on the data set is described by 6 attributes: day of week, time stamp based on half hour periods (from 1 to 48), the electricity demand of the New South Wales (NSW) and Victoria (VIC) states, the scheduled electricity transfer between these states and the change of the price using a moving average of the last 24 hours (up or down). The target attribute is the NSW electricity demand. Here, we are interested in predicting the electricity demand for the next week (next 336 examples). In this paper, we removed the first 17,424 examples referring to the period up to the 4th of May 4, 1997, because they do not have values for the VIC electricity demand and the scheduled electricity transfer between states, which were included after that date.
Airline
The Airline data set was created by the American Statistical Association for a competition of statistical graphs in 2009 [4]. It contains examples of 120 millions flights in the United States between October 1987 and December 2008. Here, we use data of flights from a single route, from Chicago O’Hare (ORD) International airport to LaGuardia airport (LGA), in 2007 and 2008. After selecting this route and cleaning the data, we ended up with 20 285 examples, which were sorted by the scheduled departure time [18]. The set of attributes used here was the same as in previous work [31]. We had to remove some attributes, like origin and destination, because data from a single route was used. At the end, the following attributes describe the Airline data: date, day of week, scheduled departure time, scheduled arrival time, flight number, and scheduled elapsed time. The target attribute is the prediction of the departure delay.
Base-level setup
At the base-level, both incremental and batch learning algorithms could be employed to induce models to predict the target attribute values for the investigated tasks. However, given that there are just a few studies on incremental regression algorithms [61], we decided to use batch learning algorithms, namely:
Except for LR, implementations of the algorithms from R packages were used, respectively: randomForest [40], e1071 [41], rpart [58], stats [48] and earth [43]. The Weka implementation of LR was used [66]. The default parameter values of these algorithms as suggested in their respective packages were used in the experiments. The exception was the number of terms of the PPR algorithm, which does not have a default value and was set to 1, based on previous work [44].
At the base-level, these learning algorithms were applied and evaluated using a sliding window of a fixed size, which allows a better control over the data used to induce and test the models. For the TTP and Airline data sets, the training window size,
Parameter values for learning in the base-level for each data set
The predictive performance of the base-level models was assessed by the Normalized Mean Squared Error (NMSE) [56]. This measure is always computed for all base-level examples that belong to the selection subset, with size (
As in most meta-learning studies, the algorithm selection problem is considered here as a classification task, where each pair of algorithms constitutes a classification problem at the meta-level [32]. Thus, if
At the meta-level, we have to choose the learning algorithm (meta-learner) that will induce the meta-model. Due to the small number of meta-examples generated when an algorithm is chosen for a batch of examples, we decided to use batch learning algorithms, since they are more appropriate than incremental ones when there is a small amount of data [17, 21]. Thus, we decided to investigate RF and SVM as meta-learners. The former was used for similar purposes before [51] and obtained the best results, while the latter presented high predictive performance in many studies [45, 12, 5]. The default parameter values for the implementations of RF and SVM selected for this project were used in all experiments.
A common baseline classification method in machine learning is to classify a test example in the majority class of the training set, also known as default class. A natural extension of this evaluation method for sliding windows is to use the majority class of the meta-examples in the training window at each time point. In this sense, if the majority class changes over time, the prediction will be automatically updated. We will refer to this baseline method as Default.
In order to select the appropriate size of the training window (
Parameter values of the meta-level defined experimentally for each problem
Parameter values of the meta-level defined experimentally for each problem
The predictive performance of MetaStream and Default in the meta-level was assessed by the
The meta-features for the experiments were generated using the guidelines presented in Section 4. Table 4 shows the measures used to generate these meta-features and the subsets of data to which they were applied. Although there are some meta-features that could be extracted from relations between training and selection data, such as
In order to evaluate if the addition of dependent meta-features is able to improve the predictive performance of a meta-learning method for algorithm selection, the meta-features extracted from the base-level were split considering whether they are dependent or independent of the data morphology, as presented in Section 4.3. Thus, the MDInd meta-data set is composed of independent meta-features, obtained by computing the average, the maximum and the minimum of each measure showed in Table 4 for the respective subsets of data. The MDIndDep meta-data set adds dependent meta-features to the MDInd set. The dependent meta-features were generated by the same measures and data showed in Table 4, except for the actual target,
Since there may be an overlap between these two sets of meta-features, only the dependent meta-features which are not highly correlated to the independent meta-features are maintained in the meta-data set MDIndDep. In these experiments, we empirically defined a threshold of 0.9 for the Pearson correlation coefficient to determine when meta-features are highly correlated.
Measures of characterization and data for which their are applied
Measures of characterization and data for which their are applied
The target at the meta-level is a categorical value that indicates the best learning algorithm at the base-level, i.e. which algorithm obtained the smallest NMSE for the selection subset. As the algorithms are evaluated in pairs, we have a binary classification problem for algorithm selection, as stated in Section 5.3. However, a meta-example can be labeled only after the actual target of the corresponding base-level examples are observed. Thus, there may be a delay between predicting the target of a meta-example and labeling it. The meta-targets for both meta-data sets are equal, i.e. MDInd and MDIndDep differ only in their meta-features.
The predictive performance of MetaStream, considering the meta-data sets MDInd and MDIndDep, and Default are shown in Fig. 3. This performance was assessed by the
MetaStream (MDInd and MDIndDep) and Default predictive performance for the data sets TTP, EDP and Airline. RF were employed as meta-learners in the MetaStream method.
These results evidence the superiority of MetaStream (MDInd and MDIndDep) compared to Default, with
In order to analyze the influence of the independent and dependent meta-features, we computed their relative weight for the RF algorithm. The relative weight of each meta-feature was based on the Mean Decrease Accuracy (MDA) measure [9], which is calculated by the RF algorithm during the induction of the meta-model. The average relative weight of the independent and dependent meta-features is determined as follow. First, the meta-features are sorted by decreasing weight (the highest, the better), and only the first 10% are selected. Next, these meta-features are split in two sets according to their dependence of data morphology: the independent,
The relative weights of the top ranked independent and dependent meta-features for the TTP, EDP and Airline data sets.
In Fig. 4 we present the relative weights of independent and dependent meta-features for one pair of regressors of each meta-data set. These pairs are LR/PPR, RF/SVM and PPR/CART for the meta-data sets TTP, EDP and Airline, respectively. In general, the behavior presented for these pairs of regressors are similar to the other pairs. These graphics show that RF assigned higher weights to independent meta-features (about 75%) in comparison with dependent meta-features (about 25%) for all data sets. These plots agree with the results shown in Fig. 3, i.e., the addition of dependent meta-features did not contributed to the improvement of the MetaStream performance using RF.
MetaStream (MDInd and MDIndDep) and Default predictive performance for the data sets TTP, EDP and Airline. SVM were employed as meta-learners in the MetaStream method.
The importance of the meta-features and the MetaStream performance can also be influenced by the learning algorithm employed as meta-learner. Figure 5 shows the results of MetaStream (MDInd and MDIndDep) when SVM was used as meta-learner. According to these graphics, when MetaStream used the MDIndDep meta-dataset, its performance remarkably improved compared to MDInd for the TTP and EDP data sets. In order to have more evidence about the influence of the dependent meta-features, we applied the Wilcoxon statistical test with 0.05 of significance to compare both meta-data sets across all pairs of regressors. This test resulted in a significant difference, i.e., the enhancement of the MetaStream method was supported by the addition of the dependent meta-features. These results disagree with those observed when RF was used as meta-learner. A possible reason is that the RF algorithm has an inner feature selection mechanism while SVM may have been leveraged by irrelevant independent meta-features [63, 64, 11], whose influence was reduced by the addition of relevant dependent meta-features.
Regarding predictive performance, MetaStream achieved greater
In this paper, we provide a set of guidelines to characterize data streams. Taking into account both the order of examples and the type of variables involved in the base-level, we assemble a wide range of possibilities to extract characteristics from time-changing data considering different measures for this purpose. With proper information about the data, the task of algorithm selection for data streams can take place.
In order to evaluate the suitability of this approach, we carried out experiments with three data streams problems. In this process, we considered commonly used measures for characterization and the MetaStream method was employed to select the best learning algorithm over time. Moreover, we analyzed the influence of meta-features dependent and independent of the data morphology. According to the experimental results, the proposed guidelines could successfully be employed to discover relevant meta-features from base-level data and their relations.
As future work, we plan to investigate characteristics of relations between relations and relations from different time points (e.g., training
Footnotes
Acknowledgments
The authors would like to thank the financial support of funding agencies FAPESP (2008/11569-6), CAPES and FCT (224/09 – BEX3231/10-0), and CNPq. This work was also partly funded by the ERDF – European Regional Development Fund through the COMPETE programme (operational programme for competitiveness) within project GNOSIS, cf. “FCOMP-01-0202-FEDER-038987”. It is also funded by the North Portugal Regional Operational Programme (ON.2 O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development Fund (ERDF), and by national funds, through the Portuguese funding agency, Fundação para a Ciência e a Tecnologia (FCT) through projects “NORTE-07-0124-FEDER-000057” and “NORTE-07-0124-FEDER-000059”. The work is also financed by the ERDF European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within projects “FCOMP-01-0124-FEDER-037281”. The research leading to these results has also received funding from the ECSEL Joint Undertaking, the framework programme for research and innovation horizon 2020 (2014-2020) under grant agreement n 662189-MANTIS-2014-1.
