An ensemble-based method for the selection of instances in the multi-target regression problem

Abstract

The multi-target regression problem comprises the prediction of multiple continuous variables at the same time using a common set of input variables, and in the last few years, this problem has gained an increasing attention due to the broad range of real-world applications that can be analyzed under this framework. The complexity of the multi-target regression problem is higher than the single-target regression one since target variables often have statistical dependencies, and these dependencies should be correctly exploited in order to effectively solve this problem. Consequently, additional difficulties appear when the aim is to perform a selection of instances on this type of data. In this work, an ensemble-based method to perform the instance selection task in multi-target regression problems is proposed. First, a well-known instance selection method is adapted to directly work with multi-target data. Second, the proposed ensemble-based approach uses a set of these adapted methods to select the final subset of instances. The members of the ensemble select partial data subsets, where each member is performed on a different input space that is expanded with target variables, exploiting therefore the underlying inter-target dependencies. Finally, the ensemble-based method aggregates all the selected partial data subsets into a final subset of relevant instances by means of solving an optimization problem with a simple greedy heuristic. The experimental study carried out on 18 datasets shows the effectiveness of our proposal for selecting instances in the multi-target regression problem. Results demonstrate that the size of datasets is considerably reduced, whilst the predictive performance of the multi-target regressors is maintained or even improved. Also, it is observed that the proposed method is robust to the presence of noise in data.

Keywords

Instance selection multi-target regression ensemble learning ensemble-based instance selection

1. Introduction

In the past few years, the scientific community has paid an increasing attention to problems that comprise the prediction of multiple outputs simultaneously, mainly due to the many real-world applications that are possible to study within this framework [1, 2, 3, 4, 5]. Multi-target regression (henceforth MTR) is one of these problems, and it comprises the prediction of multiple continuous variables from a common set of input variables [6]. In other words, MTR algorithms aim to learn a predictive model that, given an unseen input vector X, predicts a target vector Z of numeric variables. This type of regression problems has been successfully applied to diverse engineering applications including automatic control [7], energy efficiency [8] and signal processing [9].

Up to date, many methods have been proposed to tackle the MTR problem, and these can be organized into problem transformation and algorithm adaptation methods [10]. Problem transformation methods decompose an MTR problem into several single-target regression tasks. Recent researches have focused on applying some well-known multi-label learning transformation methods to solve the MTR problem, mainly motivated by the tight connection between these two learning paradigms.1 In this regard, Spyromitros-Xioufis et al. [6] demonstrated that several multi-label approaches, such as the binary relevance [11], stacked generalization [12] and classifier chains [13], are straightforward to adapt to the MTR problem. On the other side, the algorithm adaptation category comprises algorithms that do not decompose an MTR problem into several single-target regression tasks; i.e. they directly handle the multi-target data. In this category, many methods have been proposed, such as statistical techniques [14], support vector machines [15], kernel-based approaches [16], MTR trees [17], rule-based methods [18], and locally weighted regression methods [4].

Spyromitros-Xioufis et al. [6], Melki et al. [15], Reyes et al. [4] and many other authors have demonstrated that the MTR problem can be solved more effectively if the inter-target correlations are detected and exploited. However, the major challenges of MTR lie in how to model such inter-target dependencies correctly, and how to estimate the nonlinear relationships that may exist between the input and output spaces of the problem [19]. On the other hand, when the MTR problem is studied not all available training samples are useful to construct an accurate predictive model; it is well-known that noisy, redundant and incomplete data can significantly deteriorate the performance of the most learning algorithms [20]. Consequently, the acquisition of a high-quality and compact dataset, from which an algorithm can learn relevant data relationships, is also an important issue to be considered when tackling the MTR problem.

The instance selection task (henceforth IS) is an important data preprocessing step, that aims to select a representative subset of an original dataset by filtering noisy and redundant data, in such a manner that the predictive performance of the learner that was induced from the data subset would be the same (even better) as if the original dataset was used [21]. Nowadays, these algorithms can bring many benefits to the scientific community mainly due to their applications to the Big Data challenge [22]. The IS task has been widely studied for the classification problem (see, for instance, the Olvera-López et al. [21] and García et al. [23] works), however, this task for the regression problem has been far less studied [24]. The IS task in the regression problem has some difficulties that do not exist in the classification task. For instance, several IS methods assess the relevance of an instance by means of measuring its usefulness in predicting the correct classes of its nearest neighbors. However, the concept of data class in the regression problem does not exist since the domain of the output variables is continuous. On the other hand, the identification of class boundaries, an important criterion on which many IS methods for the classification problem are formulated, does not have sense in regression [24]. As for performing the IS task in the MTR problem, the complexity for selecting the instances is higher than the one we could have in the single-target regression problem, mainly due to the aforementioned challenges that MTR problem presents.

In the last two decades, existing ensemble-based methods have demonstrated to be really effective techniques to improve the results in complex problems [25, 26, 27, 28, 29, 30, 31, 32]. Kocev et al. [33] and Spyromitros-Xioufis et al. [6], for example, demonstrated how a better predictive performance could be obtained in solving the MTR problem by using ensemble-based approaches. On the other hand, several authors have demonstrated that the IS task can be significantly improved by means of applying ensemble-based methods [34]. By this way, the relevance of the training instances is measured by considering not only a single criterion but many approximations and, therefore, a more reliable estimation of the relevance of the instances is obtained.

In this work, an ensemble-based method to perform the IS task in the MTR problem is proposed. First, an error accumulation-based approach is introduced, which is an adaptation of the well-known family of the Decremental Reduction Optimization Procedures (henceforth DROP) [35] to multi-target data. Second, an ensemble-based method that effectively combines the partial data subsets that are previously selected by each member of the ensemble is proposed. To obtain the final data subset, an aggregation process is carried out by a simple greedy heuristic that solves an optimization problem. The members of the ensemble select the partial data subsets on different input spaces which are expanded by target variables, exploiting therefore the underlying inter-target dependencies. On the other hand, the method proposed does not use any threshold value to decide whether an instance is selected or not, resulting in a method less dependent on the specific features of each problem. To the best of our knowledge, this is the first attempt to study the selection of instances in MTR, and the main motivation of this work is to analyze the benefits of the IS task for constructing better MTR models.

The effectiveness of the proposal is assessed through an extensive experimental study, where 18 datasets of varied features and different application domains are used. The results showed that the proposed IS method can significantly boost the predictive performance of the multi-target regressors, and therefore, it can benefit the development of methods for solving complex problems that comprise the prediction of multiple outputs. A good trade-off between the predictive performance and reduction rate is attained; the size of the training sets is reduced without significantly deteriorating the predictive performances of the multi-target regressors. In addition, the proposed ensemble-based method demonstrates to be robust on datasets which have noise samples.

The remainder of this paper is arranged as follows. Section 2 briefly describes the IS task and exposes the related works that have been proposed to perform the selection of instances in the regression problem. Section 3 presents the proposed ensemble-based method. Section 4 shows a description and discussion of the experimental results. Finally, some concluding remarks are presented in Section 5.

2. Related work

Roughly speaking, IS methods aim to reduce the size of an original training data but retaining or improving the predictive capacity of the models. The optimal outcome of an IS method is a minimum data subset from which a learning algorithm would accomplish the same task with no performance loss as if the original dataset was used [36]. However, some authors have noted that in practice, it is not always possible to maintain the performance levels as the dataset is reduced, and a loss of effectiveness may be inevitable [37]. IS methods have the following goals [36]: (I) decrease the computational cost for predicting new patterns; (II) reduce the storage requirements by removing redundant information from datasets; (III) improve the performance of learning algorithms by removing noise and outliers; and (IV) increase the efficiency when working on large-scale datasets.

Many IS methods have been proposed in the literature, and a complete description of these methods can be consulted in [36]. The IS methods can be categorized by considering the following three criteria: (I) the selection criterion used to select the instances; (II) the type of points that are removed in the IS process; (III) and finally they can be classified according to the search direction used to obtain the final data subset.

The first category includes the wrapper [35] and filter methods [38], and the main difference between these two type of approaches lies in which the wrapper methods select the relevant instances based on the prediction made by a learning algorithm, whilst the filter methods are not based on a classifier to determine the instances to be discarded from the training set.

Regarding the second criterion, the IS algorithms can be classified into condensation [38], edition [39] or hybrid methods [35]. The condensation methods retain the points closer to the decision boundaries (border points), preserving the training error, but at the expense of deteriorating the generalization test error. The edition methods remove the border points and maintain the internal points, getting smoother decision boundaries and reducing the generalization test error. The hybrid methods, on the other hand, remove the internal and border points, taking the advantages of both the condensation and edition methods.

As for the third criterion, the IS algorithms can be classified into incremental [40], decremental [35], batch [41] or mixed methods [42]. The incremental methods start with an empty data subset and continue adding instances to it; in this case, the presentation order of the instances is an important issue that might affect the effectiveness of the IS algorithms. The decremental methods begin with the whole dataset and continues removing instances of it; in this case, the presentation order of the instances is still an important issue, but not so significant as in the case of the incremental methods. As for batch methods, they analyze all the instances but without removing them, and at the end of the process, all the instances marked as disposable are removed; the complexity of this type of methods is usually higher than the one of incremental and decremental methods. Finally, the mixed methods begin with a pre-selected data subset, and the instances which satisfy a specific criterion can be added or removed; the pre-selected data subset may be constructed by either a random selection, an incremental method, or a decremental one.

Olvera-López et al. [21], García et al. [36], and many other authors have noted that the IS task for the classification problem is widely studied. However, the selection of instances in the regression problem has not followed the same path, existing far less research in this regard [24]. Some works have proposed different evolutionary algorithms to perform the IS task in the regression problem. For example, Tolvi [43] presented a genetic algorithm that was able to detect the outliers in linear regression models, and Antonelli et al. [44] addressed the IS task through a multi-objective evolutionary learning approach. Also, there have been other efforts focused on studying the IS task in time series [45]. On the other hand, in the last few years, several works have been focused on the adaptation of some IS methods, that were originally designed for the classification problem, to the regression problem. For example, Kordos and Blachnick [46] adapted the Condensed Nearest Neighbor [40] and Edited Nearest Neighbor [39] methods, and Arnaiz-González et al. [47] proposed an adaptation of DROP method [35]. Finally, another approach for performing the IS task in the regression problem comprises the discretization of the target variable [48], and therefore, in this case any existing IS method can be used directly.

Independently, in order to improve the efficiency and accuracy of a method to find a solution for a given learning problem, ensembles of methods have gained an increasing popularity in the research community in the last few decades [49, 50]. Summarizing the advantages of the ensemble-based methods [50, 51]: (I) ensemble methods perform well in both scenarios, when there are very scarce data samples for learning and when a huge amount of data is available; (II) a combined classifier can have a better predictive performance that the best individual classifier; (III) combining methods trained from different samples could overcome the local optima problem; and (IV) an exact function may be impossible to be modelled by any single hypothesis, but the combination of several hypotheses may expand the space of representable functions. Taking the advantages provided by the ensemble learning paradigm, it is not surprising the use of ensemble-based methods to perform the IS task [22, 34]. The main objective of such ensemble-based IS methods is to produce more reliable estimations of the relevance of the instances by aggregating the outputs produced by the members of the ensemble. In this regard, there are very few works that have studied the selection of instances in the regression problem following an ensemble-based approach. The most relevant works on this topic are the ones presented by Blachnik and Kordos [52] and Arnaiz-González et al. [24], who showed that a better IS process in the regression problem can be achieved by using bagging models.

Finally, it is important to note that all the aforementioned works have been designed for selecting instances on regression problems that have only one target variable, and they are not directly applicable to the MTR problem. As far as we know, an IS method for the MTR problem has not been proposed yet. In addition to the difficulties that appear when the IS process is performed on any regression problem (these were previously mentioned in the introduction of this work), the major challenges of MTR arise from modelling the inter-target correlations and complex input-output relationships. In the next section, an ensemble-based method for the selection of instances in the MTR problem is presented.

3. An ensemble-based method for the selection of instances

In this section, first, an error accumulation-based approach, which is an adaptation of the well-known DROP method to multi-target data, is introduced, and then, the ensemble-based method to perform the IS task in the MTR problem is presented.

3.1 A DROP-based extension for the MTR problem

Let us say $S=\{(\textbf{X}_{1},\textbf{Y}_{1}),(\textbf{X}_{2},\textbf{Y}_{2}),\ldots,(% \textbf{X}_{n},\textbf{Y}_{n})\}$ represents a dataset of $n$ training instances. An instance $i\in S$ is represented as a tuple $(\textbf{X}_{i},\textbf{Y}_{i})$ , where $\textbf{X}_{i}\in{\bm{X}}$ and $\textbf{Y}_{i}\in{\bm{Y}}$ are the input and target vectors of $i$ , respectively. ${\bm{X}}$ represents the input space that contains $d$ input variables $\{\textbf{x}_{1},\textbf{x}_{2},\ldots,\textbf{x}_{d}\}$ , whereas ${\bm{Y}}$ is the output space that comprises $q$ target variables $\{\textbf{y}_{1},\textbf{y}_{2},\ldots,\textbf{y}_{q}\}$ .2 On the other hand, $x^{\ell}_{i}$ denotes the value of the ${\ell}$ -th input variable for the instance $i$ , whereas $y^{\ell}_{i}$ represents the value of its ${\ell}$ -th target variable. An MTR algorithm aims to learn a predictive model $\Phi$ that, given an unseen input vector X, can predict a target vector Z that best approximates the true target vector Y.

DROP is a well-known IS method that, according to the three categories portrayed in Section 2, can be classified as a wrapper, hybrid and decremental method. In this work, the DROP-based adaptation to the regression problem presented by Arnaiz-González et al. [47] is extended to multi-target data.

Let us say $N_{i}$ represents the set of $k$ -nearest neighbours of the instance $i$ in the input space ${\bm{X}}$ . Given a set of instances $S$ , the set of associates of $i$ (denoted as $A_{i}$ ) comprises those training instances that include to $i$ into their sets of $k$ -nearest neighbors, i.e. $A_{i}=\{j\in S|i\in N_{j}\}$ . Our proposed DROP-based method uses the following simple rule as removal criterion: the instance $i$ can be safely removed from the training set, if the target vectors of its associated instances can be correctly estimated without considering $i$ . From this rule arises the necessity of designing a reliability function for multi-target data that assigns proper scores according to the error levels obtained in the predictions of the target vectors of the associate instances. This reliability function is crucial for the success of the IS process since the error associated with the elimination of a relevant instance can reinforce itself in the subsequent iterations. However, developing a good reliability measure is not a simple task [53]. This has not been entirely resolved yet in the single-target regression and classification, and even less for the MTR problem [54].

Different reliability scores have been proposed for single-target regression, such as the estimation based on sensitivity analysis [53], local cross-validation [53], analysis of the density of the distribution of instances [55], the variance of bagged models [56], and the estimation of the instances’ error by considering its local environment in the training set [57]. Recently, Levatić et al. [54] defined various reliability functions for the MTR problem following a semi-supervised approach. However, these last-mentioned functions are not directly applicable to our problem since they do not consider the true target vector of the instances.

In this work, given a training instance $i$ , we follow a traditional approach to estimate the error made in predicting the target vectors of the associate instances by considering their local environments; it is similar to the approach proposed by Briesemeister et al. in [57]. Given the associate instance $j$ of $i$ , a multi-target algorithm $\Phi$ is trained on the dataset formed by the set of $k$ -nearest neighbors $N_{j}$ and afterward the model predicts a target vector for $j$ $(\textbf{Z}_{j})$ . This procedure is performed for each associated instance $j\in A_{i}$ , and the estimation of the global error can be calculated with the average relative root mean square error (aRRMSE)

$\frac{1}{q}\sum\limits_{\ell=1}^{q}{\sqrt{\frac{\sum\limits_{j\in A_{i}}{(y_{j% }^{\ell}-z_{j}^{\ell})^{2}}}{\sum\limits_{j\in A_{i}}{(y_{j}^{\ell}-y_{m}^{% \ell})^{2}}}}},$ (1)

where $y^{\ell}_{j}$ and $z^{\ell}_{j}$ are the values of the ${\ell}$ -th target variable in the true $(\textbf{Y}_{j})$ and predicted $(\textbf{Z}_{j})$ target vectors of the associate instance $j$ , respectively. Also, y ${}^{\ell}_{m}$ is the mean of the true values for the ${\ell}$ -th target variable in the set of associates of $i$ . aRRMSE is a measure widely used in the MTR literature [10], and it averages the RMSE values of each target variable, and automatically re-scales the error contributions of each target variable.

Algorithm 1 DROPMTR algorithm
The function kNearestNeighbors $(i,S,k)$ computes the $k$ -nearest neighbors of the instance $i$ in the dataset $S$ . The function predictTargetVector $(\Phi,$ $N,i)$ trains the multi-target regressor $\Phi$ on $N$ and predicts the target vector of the instance $i$ .
Input
$S$ : training set of multi-target instances, $k$ : number of nearest neighbours, $\Phi$ : multi-target regressor
Output
$S_{S}\subseteq S$ : subset of training instances
Begin
$S_{S}\leftarrow S$
#Create an empty set of associates for each example $i\in S_{S}$
foreach $i\in S_{S}$ do
$A_{i}\leftarrow\emptyset$
end
foreach $i\in S_{S}$ do
# Find the $k$ nearest neighbours of $i$ on the input space ${\bm{X}}$
$N_{i}\leftarrow$ kNearestNeighbors( $i,S_{S},k$ )
# Add $i$ to the sets of associates
foreach $j\in N_{i}$ do
$A_{j}\leftarrow A_{j}\cup\{i\}$
end
end
foreach $i\in S_{S}$ do
# Predict the target vectors of the associate
instances
foreach $j\in A_{i}$ do
predictTargetVector $(\Phi,N_{j},j)$
predictTargetVector $(\Phi,N_{j}\backslash\{i\},j)$
end
compute Ewith ${}_{i}$
compute Ewithout ${}_{i}$
# Check whether the instance $i$ can be removed or not
if Ewithout ${}_{i}\leqslant\textit{ Ewith}_{i}$ then
$S_{S}\leftarrow S_{S}\backslash\{i\}$
# Remove the instance $i$ from each set of nearest neighbours
foreach $j\in A_{i}$ do
$N_{j}\leftarrow N_{j}\backslash\{i\}$
# Find a new nearest neighbour for $j$
$p\leftarrow$ kNearestNeighbors $(j,S_{S}\backslash N_{j},1)$ ;
# Add $p$ to the set of nearest
neighbours of $j$
$N_{j}\leftarrow N_{j}\cup\{p\}$
# Add $j$ to the sets of associates of $p$
$A_{p}\leftarrow A_{p}\cup\{j\}$
end
end
end
return $S_{S}$
end

We believe that aRRMSE is a reliable estimator since it considers the actual errors made by the internal regressor. It also imposes almost no additional computational overhead, as opposed to some other estimation methods for regression. We denoted as Ewith ${}_{i}$ the global error in predicting the target vectors of the associated instances of $i$ , but without removing $i$ from the sets of the nearest neighbors of its associated instances, whereas Ewithout ${}_{i}$ represents the opposite case. Finally, the instance $i$ can be safely removed from the training set if Ewithout ${}_{i}\leqslant$ Ewith ${}_{i}$ . Algorithm 1 shows the steps performed by our error accumulation-based approach (hereafter, dubbed as DROPMTR).

The proposed method is able to detect the errors and outliers in data. For example, if the set of associate instances of the instance $i$ is empty $(A_{i}=\emptyset)$ , it could mean that this instance is an outlier since $i$ is not a neighbour of any training instance. In addition, it could be the case that Ewithout ${}_{i}=$ Ewith ${}_{i}$ but $A_{i}\neq\emptyset$ , meaning that the instance $i$ does not contribute for a better prediction of the target vectors of its associates, and therefore it can be safely removed.

The main advantages of the proposed approach are that it can be wrapped around any existing MTR regressor (problem transformation or algorithm adaptation methods), and it also does not depend on any threshold value to decide whether an instance is selected. Therefore, the IS process can be significantly benefited from the capacities of the internal regressor. The proposed IS method can implicitly exploit the inter-target dependencies for the selection of more relevant instances if the internal MTR is able to model such correlations. In this sense, the challenge of modelling complex input-output relationships can be also effectively tackled since our proposal can be applied with any linear and non-linear regression MTR algorithm.

Regarding the runtime complexity of our proposal, let us say $S$ is a dataset with $n$ instances, $d$ input variables and $q$ target variables, and $f_{k}(n,d)$ represents the cost function for determining the $k$ -nearest neighbours of an instance in the input space ${\bm{X}}$ . So, $O(n\times f_{k}(n,d))$ steps are needed to compute the list of associates of all training instances. On the other hand, let us say $f_{\Phi}(N,i)$ represents the cost function of training the multi-target regressor $\Phi$ on the dataset $N$ (dataset formed by $k$ instances), added to the cost for predicting the target vector of the instance $i$ . So, $O(n\times a_{avg}\times f_{\Phi}(K,i))$ steps are required to compute the accumulative errors of all the training instances, being $a_{avg}$ the average number of associates per instance. Consequently, the overall runtime complexity of the proposed IS method is $O(\max(n\times f_{k}(n,d),n\times a_{avg}\times f_{\Phi}(K,i)))$ .

3.2 Ensemble-based IS method

We propose an ensemble-based method for tackling the IS task in the MTR problem since more reliable estimations on the importance of the instances could be obtained if multiple approximations were considered. Similar to Spyromitros-Xioufis et al. [6], we adopted an approach that composes the ensemble by means of adding target variables to the input space of the MTR problem.

The rationale of our proposal is as follows. Given a multi-target dataset $S$ with $q$ target variables, our ensemble is formed by $q+1$ members, where $q$ of them select a data subset from a slightly different version of the original dataset $S$ . The ${\ell}$ -th member $({\ell}\leqslant q)$ of the ensemble (denoted as $I_{\ell}$ ) selects the instances from a multi-target dataset which has an input space equal to ${\bm{X}}\cup\{\textbf{y}_{\ell}\}$ , and an output space equal to ${\bm{Y}}\backslash\{\textbf{y}_{\ell}\}$ . By this way, the member $I_{\ell}$ allows to model the contribution of the ${\ell}$ -th target for predicting the rest of the target variables y ${}_{p}\in{\bm{Y}}|p\neq{\ell}$ , and therefore, $I_{\ell}$ would select a data subset formed by those relevant instances that reflect the inter-target dependencies that are related with the ${\ell}$ -th target variable. The last member of the ensemble $(I_{q+1})$ is executed on the original dataset $S$ , and the selected data subset would comprise those instances that are relevant for predicting all target variables. Finally, each member of the ensemble returns a subset $S_{\{1,2,\ldots,q,q+1\}}$ , and then, an aggregation process determines the best subset $(S_{e})$ from these $q+1$ partial data subsets.

Figure 1.

Schema of the proposed ensemble-based method.

Figure 1 shows the general schema of the proposed ensemble-based method for performing the IS task in the MTR problem. It is noteworthy that the diversity of the members of the ensemble is tackled by means of executing each member on different datasets. Also, note that the members must be IS methods able to work directly with multi-target data, as DROPMTR method. On the other hand, in order to add more diversity to the ensemble, for each new dataset over which the members are performed, the presentation order of the instances is randomly changed. This action also allows that the ensemble method will be less sensitive to the presentation order of the instances, that is a limitation of any DROP-based method.

Another important issue to analyse in our approach is how to aggregate the $q+1$ partial data subsets into a final data subset of instances; this component is of a major importance in all ensemble-based methods. In this work, we adopt a stacking approach [58], where $q+1$ independent members select data subsets from $S$ , but finally one extra model produces an optimal combination of the outputs of the members.

Let us say $c_{i}$ represents the times the instance $i\in S$ is selected by the members, and it can be calculated as

$c_{i}=\sum\limits_{\ell=1}^{q+1}{1_{S_{\ell}}}(i),$ (2)

where $1_{S_{\ell}}$ is the function that indicates whether the instance $i$ is in the data subset $S_{\ell}$ selected by the ${\ell}$ -th member or not. It is important to note that, although each member is executed on datasets that slightly differ from the original dataset $S$ and the presentation order of the instances is changed in each dataset, a function that maps the new indexes of the instances to the original indexes in the dataset $S$ can be easily constructed. By this way, given the data subset $S_{\ell}$ selected by the ${\ell}$ -th member of the ensemble, it is easy to retrieve the corresponding original instances from $S$ .

The set of instances that are exactly selected by ${\ell}$ members is denoted as $R_{\ell}=\{i\in S|c_{i}={\ell}\}$ , and $T$ represents the set of instances resulting of the union of the subsets $S_{1},S_{2},\ldots,S_{q},S_{q+1}$ . Therefore, our proposal aims to determine the subset of instances $S_{e}\subseteq T$ that generalises well the observations in $T$ and minimises the following cost function

$\frac{1}{q}\sum\limits_{\ell=1}^{q}{\sqrt{\frac{\sum\limits_{i\in T}{(y_{i}^{% \ell}-z_{i}^{\ell})^{2}}}{\sum\limits_{i\in T}{(y_{i}^{\ell}-y_{m}^{\ell})^{2}% }}}},$ (3)

where $y^{\ell}_{i}$ and $z^{\ell}_{i}$ are the values of the ${\ell}$ -th target variable in the true $(\textbf{Y}_{i})$ and predicted $(\textbf{Z}_{i})$ target vectors of $i$ , respectively, and $y^{\ell}_{m}$ is the mean of true values for the ${\ell}$ -th target in the set $T$ . Note that, this cost function is simply the measure aRRMSE, but now it is defined over the set $T$ .

Solving this optimization problem with classical methods could take a considerable runtime since the objective function requires the evaluation of a multi-target regressor $\Phi$ for predicting the target vector of each instance $i\in T$ . On the other hand, if we consider this formulation as a searching problem, the number of feasible solutions is $2^{|T|}-1$ , resulting in a huge space. Consequently, in this work, the following simple heuristic is defined: The instances that were selected by a higher number of members are preferred over those instances with a fewer selection frequency. Consequently, the following hill climbing process is proposed to compute the final subset of instances $S_{e}$ : considering as starting point the set of instances most selected (denoted as $R_{m}$ ), add continuously to $S_{e}$ the next set of instances with the highest selection frequency $(R_{\mathbf{l}}|1\leqslant{\ell}<{\textit{m}})$ , until a degradation in estimating T is obtained.

Algorithm 2 EDROPMTR algorithm
The function construct $(S,{\bm{X}}_{new},{\bm{Y}}_{new})$ constructs a new dataset from $S$ but considering the input and output spaces ${\bm{X}}_{new}$ and ${\bm{Y}}_{new}$ , respectively. The function shuffle $(S)$ changes the presentation order of the instances in $S$ . The function dropMTR $(S,k,\Phi)$ performs the DROPMTR algorithm on the dataset $S$ , using the internal multi-target regressor $\Phi$ and considering $k$ nearest neighbours. The function retrievOriginal $(S_{\ell},S)$ transforms all the instances $i\in S_{\ell}$ , recovering their original forms as they appear in $S$ . The function predictTargetVectors $(\Phi,S,T)$ trains the regressor $\Phi$ on $S$ and predicts the target vectors of all the instances $i\in T$ .
Input
$S$ : training set, $k$ : number of nearest neighbours, $\Phi$ : multi-target regressor
Output
$S_{e}\subseteq S$ : subset of training instances
Begin
#Compute the frequency of selection of each instance
foreach $i\in S$ do
$c_{i}\leftarrow 0$
end
$T\leftarrow\emptyset$
# Construct the members
foreach ${\ell}\in\{\textit{1},\ldots,q+1\}$ do
$R_{\ell}\leftarrow\emptyset$
if ${\ell}\leqslant q$ then
# Remove the ${\ell}$ -th target variable from ${\bm{Y}}$ and
add it to ${\bm{X}}$
${\bm{Y}}_{new}\leftarrow{\bm{Y}}\backslash\{\textbf{y}_{\ell}\}$
${\bm{X}}_{new}\leftarrow{\bm{X}}\cup\{\textbf{y}_{\ell}\}$
else
${\bm{Y}}_{new}\leftarrow{\bm{Y}}$
${\bm{X}}_{new}\leftarrow{\bm{X}}$
end
# Construct a new training set from $S$
$S_{new}\leftarrow\textit{construct}(S,{\bm{X}}_{new},{\bm{Y}}_{new})$
# Change the order presentation
$S_{new}\leftarrow\textit{shuffle}(S_{new})$
# Execute DROPMTR on the training set
$S_{new}$
$S_{\ell}\leftarrow$ dropMTR $(S_{new},k,\Phi)$
# Retrieve the original instances from $S$
$S_{\ell}\leftarrow$ retrieveOriginal $(S_{\ell},S)$
$T{\leftarrow T}\cup S_{\ell}$
# Increment the frequency of selection of the
instances
foreach $i\in S_{\ell}$ do
$c_{i}\leftarrow c_{i}+1$
end
end
# Construct the frequency sets
foreach $i\in S$ do
if $c_{i}>0$ then
$R_{Ci}\leftarrow R_{Ci}\cup\{i\}$
end
end
# Determine the set of instances most selected
$m\leftarrow$ arg max $R_{\ell}\neq\emptyset$
${\ell}\in\{1,\ldots,q+1\}$
# Hill climbing process
$S_{e}\leftarrow R_{m}$
predictTargetVectors $(\Phi,S_{e},T)$

Algorithm 2 Continuation
$e_{best}\leftarrow$ equation 3
foreach ${\ell}\in\{\textit{m-1, m-2, \ldots, 1}\}$ do
if $R_{\ell}\neq\emptyset$ then
predictTargetVectors $(\Phi,S_{e}\cup R_{\ell},T)$
$e_{new}\leftarrow$ equation 3
if $e_{new}\leqslant e_{best}$ then
$S_{e}\leftarrow S_{e}\cup R_{\ell}$
$e_{best}\leftarrow e_{new}$
else
break
end
end
end
return $S_{e}$
end

Algorithm 2 shows the steps of the proposed ensem ble-based IS method (hereafter, dubbed as EDROMTR). EDROMTR comprises two phases: (I) the ensemble’s members select the partial data subsets; and (II) the $q+$ 1 selected data subsets are aggregated by the proposed greedy heuristic. Generally speaking, the first phase requires the construction of the datasets from which the members select the partial data subsets, the execution of $q+1$ IS methods, and finally, the estimation of the frequency that each instance is selected. Let us say that $f_{\textit{DROPMTR}}$ represents the computational cost of the DROPMTR method. Therefore, the overall runtime complexity of the first phase is $O((q+1)\times f_{\textit{DROPMTR}})$ , since the rest of the mentioned steps of the first phase can be performed in linear time. Note that, each member of the ensemble can be executed in parallel, so the efficiency can be significantly improved.

On the other hand, the second phase of the ensemble-based method comprises the execution of the proposed greedy heuristic, which in turn needs to train and test (at most $m$ times) the internal MTR regressor $\Phi$ for determining the data subset $S_{e}\subseteq T$ from which is attained a better estimation of $T$ . Let us say $f_{\Phi}(S_{e},T)$ represents the cost function of training $\Phi$ on $S_{e}$ , added to the cost of testing $\Phi$ on $T$ , where $S_{e}$ $\subseteq T$ . So, the overall complexity of the second phase is $O(m\times f_{\Phi}(S_{e},T))$ . It is noteworthy that the multi-target regressor used in the second phase is the same to the one that is employed by each member of the ensemble. Therefore, the overall complexity of the proposed ensemble-based method is $O(\max((q+1)\times f_{\textit{DROPMTR}},m\times f_{\Phi}(S_{e},T)))$ .

4. Experimental study

In this section, the experimental study is described. First, a description of the datasets and other experimental settings used in the experiments are presented. Second, DROPMTR and EDROPMTR are performed on all the datasets, with the aim of analysing whether the proposed IS methods improve or maintain the predictive performance of the MTR regressors, and to demonstrate that the best performance is attained by the proposed ensemble-based IS method.

4.1 Multi-target datasets

In this experimental study, the largest collection of MTR datasets publicly available was used [6]. All the 18 datasets within this collection have a variety of features and belong to several application domains. Some of these datasets represent well-known engineering problems, for example: the dataset Electrical Discharge Machining (Edm) [59] represents a two-target regression problem, where the task is to minimize the machining time by reproducing the behaviour of a human operator that controls two variables; the dataset Energy Building (Enb) [8] concerns the prediction of the heating and cooling loads requirements of buildings as a function of eight parameters; the dataset Concrete Slump (Slump) [60] comprises the prediction of three properties of concrete as a function of the content of seven concrete ingredients.

On the other hand, the Andromeda (Andro) [61] and Water Quality (Wq) [62] datasets concern the prediction of water quality parameters, whereas the Jura dataset [7] focus on the prediction of the concentration of metals. The Solar Flare datasets [63] (Sf1 and Sf2) are about the prediction of the number of solar flares are observed within one day. The River Flow datasets (Rf1 and Rf2) [6] concern the prediction of river network flows. Finally, we have the following datasets associated with the business domain: Online Product Sales (Osales) [64], See Click Predict Fix (Scpf) [65], Airline Ticket Price (Atp1d and Atp7d) [6], Supply Chain Management (Scm1d and Scm20d) [6] and Occupational Employment Survey (Oes10 and Oes97) [6].

Table 1 shows a summary of the characteristics of the datasets. The datasets vary in size: from 49 up to 9,803 examples, from 7 up to 576 input variables, and from 2 up to 16 target variables. All the datasets have numeric input variables, except for Sf1 and Sf2 whose input variables are discrete. The datasets Scpf, Osales, Rf1, Rf2, Atp1d and Atp1d have missing values that were replaced by the median values of the corresponding input variables. Finally, all the numeric variables were centred and scaled.

Table 1
Summary of the benchmark datasets

Dataset	#Instances	#Input vars.	#Target vars.
Andro	49	30	6
Atp1d	337	411	6
Atp7d	296	411	6
Edm	154	16	2
Enb	768	8	2
Jura	359	15	3
Oes10	403	298	16
Oes97	334	263	16
Osales	639	413	12
Rf1	9125	64	8
Rf2	9125	576	8
Scm1d	9803	280	16
Scm20d	966	61	16
Scpf	137	23	3
Sf1	323	10	3
Sf2	1066	10	3
Slump	103	7	3
Wq	1060	16	14

4.2 Experimental settings

Pugelj and Dzeroski [66] presented a simple adaptation of the classic $k$ NN algorithm for the MTR problem. In this work, this $k$ NN-based method was used as the internal MTR regressor of our IS methods (DROPMTR and EDROPMTR). The best number of nearest neighbours $(k)$ was estimated via cross-validation on the original datasets. The main reason to use this MTR regressor is due to its simplicity and low computational cost. However, note that any other MTR algorithm could be used as internal regressor since the proposed methods follow a wrapper approach.

The parameter $k$ is also important for DROPMTR since the lists of associates are created by computing the $k$ -nearest neighbours of each instance of the training set. This parameter was set the same as the number of nearest neighbours used by the internal MTR regressor described before. As for the distance function used to compute the nearest neighbours of a point, the well-known Heterogeneous Euclidean Overlap Metric (HEOM) was used [67].

On the other hand, Spyromitros-Xioufis et al. [6] showed that the method Ensemble of Regressor Chains (ERC) is one of the most significant state-of-the-art MTR methods. Hence, the effectiveness of the proposed IS methods was assessed by means of evaluating ERC on the selected data subsets. ERC is a problem transformation method, and therefore, it internally requires a single-target regressor. Three single-target regressors were used, namely RepTree, Linear Regression and the classic $k$ NN (the parameters proposed in [6] were used), resulting in three different combinations of ERC (dubbed as ERC-REPTree, ERC-LR and ERC- $k$ NN).

To estimate the predictive performance of the MTR models, the measure aRRMSE (previously described in Section 3) was analysed on the test sets. In all datasets, a 10-fold cross-validation was performed, and the aRRMSE values were averaged across all fold executions. In each fold execution, the following steps were conducted: (I) the IS method reduces the training set; (II) the multi-target regressor is trained on the selected data subset; and (III) the learned model is assessed on the test set. On the other hand, the effectiveness of the IS methods was also studied by means of analysing the reduction levels of the size of the training sets.

Table 2
Average reduction levels attained by DROPMTR and EDROPMTR

Dataset	DROPMTR	EDROPMTR
Andro	0.449	0.330
Atp1d	0.463	0.400
Atp7d	0.459	0.430
Edm	0.385	0.204
Enb	0.306	0.220
Jura	0.342	0.219
Oes10	0.494	0.524
Oes97	0.522	0.584
Osales	0.454	0.451
Rf1	0.372	0.351
Rf2	0.330	0.179
Scm1d	0.344	0.425
Scm20d	0.366	0.199
Scpf	0.327	0.192
Sf1	0.639	0.844
Sf2	0.731	0.710
Slump	0.338	0.272
Wq	0.452	0.728
Ave. reduction rate	0.432	0.403

Finally, non-parametric statistical tests were conducted to analyse and validate the obtained results, as proposed by Demsar [68]. All computational methods were implemented in the Java language and integrated into MULAN library [69]. MULAN is constructed over the popular framework WEKA [70] and is designed for researching in multi-label learning and MTR.

4.3 Reduction levels on the size of the datasets

This experiment aims to analyse whether the two proposed IS methods (DROPMTR and EDROPMTR) can significantly reduce the size of the datasets. The attained reduction rate on a dataset is calculated as $1-s_{r}/s_{o}$ , where $s_{r}$ is the number of instances in the selected data subset, and $s_{o}$ is the number of instances in the original dataset. The higher a reduction rate, the higher the percentage of instances that were removed from the training sets. Table 2 shows the reduction rates averaged across all fold executions. The best reduction rate attained in each dataset is highlighted in bold typeface.

Table 3
Results of the aRRMSE measure for ERC-REPTree. The Friedman’s statistic is equal to 8.333, and the null hypothesis was rejected with a $p$ -value $=$ 0.015 at the significance level $\alpha=$ 0.05

Dataset	Original	Subset ${}_{\textit{DROMTR}}$	Subset ${}_{\textit{EDROMTR}}$
Andro	0.595	0.744	0.674
Atp1d	0.438	0.436	0.431
Atp7d	0.605	0.699	0.657
Edm	0.923	0.906	0.862
Enb	0.133	0.155	0.140
Jura	0.689	0.687	0.675
Oes10	0.616	0.614	0.625
Oes97	0.706	0.838	0.811
Osales	0.782	0.898	0.878
Rf1	0.121	0.120	0.108
Rf2	0.147	0.134	0.105
Scm1d	0.358	0.351	0.341
Scm20d	0.476	0.509	0.496
Scpf	0.887	0.828	0.823
Sf1	1.081	0.832	0.819
Sf2	1.018	0.951	0.904
Slump	0.783	0.762	0.751
Wq	0.952	0.951	0.941
Avg. ranking	2.278	2.278	1.444

It is observed that DROPMTR attained reduction levels from 0.306 till 0.731, whereas EDROPMTR obtained reduction levels from 0.179 till 0.844. DROPM-TR method produced a big reduction (73%) in the dataset Sf2, whereas EDROPMTR method achieved a significant reduction (84%) in the dataset Sf1. In average, the experimental results showed that the DROPMTR can reduce the size of the datasets more than EDROPMTR. This behaviour was expected because EDROPMTR intends to determine the best subset of instances that produces the lowest prediction error on a test set that contains all the instances selected by the $q+1$ DROPMTR members of the ensemble. Consequently, the expected tendency is that the final data subsets selected by EDROPMTR will have a size greater than those data subsets selected by one DROPMTR; the partial $q+1$ data subsets could be very diverse between each other and, therefore, the aggregation process will attain a lower reduction level than the one that could be obtained by a single DROPMTR method. However, it is noteworthy that, although DROPMTR obtained the best reduction levels in 13 datasets, there were no statistical differences between the reduction levels attained by DROPMTR and EDROMTR; the Wilcoxon Signed-Rank test [71] did not reject the null hypothesis with a $p$ -value equal to 0.092 at the significance level $\alpha=$ 0.05.

Figure 2.

All pairwise comparisons conducted by Bergmann-Hommel’s test. In the diagrams, the groups of methods that are not significantly different are connected by a line.

4.4 Analyzing the predictive performance of the multi-target regressors

This experiment focusses on determining whether the application of the proposed IS methods implies a significant improvement or deterioration in the overall predictive performance of the regressors ERC-REPTree, ERC-LR and ERC-kNN. These three multi-target regressors were trained on the original training sets, and on the subsets selected by the IS methods.

Tables 3–5 show the results of the aRRMSE measure. In each row, the best error value is highlighted in bold typeface. The column named “Original” represents the predictive performance obtained on the original datasets, whereas the columns named “Subset ${}_{\textit{DROPMTR}}$ ” and “Subset ${}_{\textit{EDROPMTR}}$ ” represent the predictive performance of the MTR regressors on the data subsets selected by DROPMTR and EDROPMTR, respectively. The Friedman’s test [72] was conducted to perform multiple comparisons, and the last row of the tables shows the average ranking computed by this test.

Table 4
Results of the aRRMSE measure for ERC-LR. The Friedman’s statistic is equal to 7.861, and the null hypothesis was rejected with a $p$ -value $=$ 0.020 at the significance level $\alpha=$ 0.05

Dataset	Original	Subset ${}_{\textit{DROMTR}}$	Subset ${}_{\textit{EDROMTR}}$
Andro	5.335	1.399	2.562
Atp1d	1.280	0.833	1.094
Atp7d	2.119	1.688	1.150
Edm	0.835	0.906	0.855
Enb	0.315	0.323	0.319
Jura	0.607	0.611	0.610
Oes10	0.833	0.579	0.489
Oes97	1.330	0.720	0.649
Osales	1.864	1.754	1.657
Rf1	0.522	0.562	0.533
Rf2	0.488	0.387	0.314
Scm1d	0.393	0.400	0.287
Scm20d	0.643	0.641	0.643
Scpf	0.887	0.840	0.533
Sf1	1.196	0.969	0.922
Sf2	1.545	1.348	1.291
Slump	0.683	0.682	0.677
Wq	0.959	0.976	0.964
Avg. ranking	2.361	2.167	1.472

It was observed that the predictive performance of the three MTR regressors is improved in many cases. Also, it is relevant to note that the predictive performance was improved even on those datasets for which the IS methods attained high reduction levels (e.g. Sf1, Sf2 and Wq), so showing that the proposed IS methods can select subset of relevant instances, and also that these particular datasets have a considerable number of irrelevant and/or redundant instances. The average rankings computed by Friedman’s test shows that, in average, the best results were reported when the MTR regressors were executed on the data subsets selected by EDROPMTR, indicating the effectiveness of the proposed ensemble-based approach. Furthermore, Friedman’s test rejected all the null hypotheses, indicating that significant differences exist in the predictive performance of the MTR regressors.

Table 5

Results of the aRRMSE measure for ERC- $k$ NN. The Friedman’s statistic is equal to 9.194, and the null hypothesis was rejected with a p-value $=$ 0.010 at the significance level $\alpha=$ 0.05

Dataset	Original	Subset ${}_{\textit{DROMTR}}$	Subset ${}_{\textit{EDROMTR}}$
Andro	0.619	0.824	0.628
Atp1d	0.452	0.451	0.439
Atp7d	0.616	0.648	0.619
Edm	0.819	0.846	0.835
Enb	0.308	0.307	0.300
Jura	0.734	0.732	0.720
Oes10	0.452	0.462	0.460
Oes97	0.551	0.570	0.571
Osales	0.919	0.912	0.910
Rf1	0.180	0.120	0.120
Rf2	0.198	0.158	0.132
Scm1d	0.351	0.321	0.311
Scm20d	0.303	0.327	0.322
Scpf	1.054	0.808	0.801
Sf1	1.144	0.825	0.821
Sf2	1.705	1.163	1.115
Slump	0.750	0.749	0.730
Wq	0.932	0.941	0.921
Avg. ranking	2.278	2.305	1.417

The Bergmann and Hommel’s test [73] was conducted in order to perform all pairwise comparisons and detect particular significant differences. Figure 2 shows the results of this statistical test, highlighting two important results: (I) the predictive performance of the MTR regressors that were trained on the data subsets selected by DROPMTR are not significantly different to the performance attained when they were trained on the original training sets, so indicating that DROPMTR can reduce considerably the size of the datasets without deteriorating the performance of the regressors; and (II) the predictive performance of those regressors that were trained on the data subsets selected by EDROPMTR is significantly better than the performance attained on the original training sets and on the data subsets selected by DROPMTR, so showing the potential of the proposed ensemble-based method.

4.5 Noise tolerance

In general, data gathered in real-world problems include noise and, therefore, the predictive performance of learning algorithms can be significantly deteriorated [36]. In this regard, the IS methods also have the intention of eliminating the noise and outliers in data.

Once the superiority of the proposed ensemble-based method has been demonstrated, in this section we analysed its capacity to eliminate noise from data. It was analysed whether EDROPMTR maintains or even increases the predictive performance levels of the regressors on datasets which have different noise levels. Similar to the method proposed by Arnaiz-González et al. in [47], we added noise to the original datasets by exchanging target vectors of randomly selected instances; the random selection was made without replacement. By this way, the sample distributions in the input and output spaces of the training sets are not modified. Three different noise levels (10%, 20% and 30%) were introduced in the training data; target vectors are swapped until these percentages of the total of instances in the training data are modified. A 10-fold cross-validation process was executed five times with different seeds, and finally the results were averaged.

Table 6
Average reduction rates attained at the different noise levels

Dataset	Noise level
	10%	20%	30%
Andro	0.424	0.534	0.545
Atp1d	0.400	0.455	0.519
Atp7d	0.463	0.510	0.588
Edm	0.212	0.301	0.393
Enb	0.280	0.279	0.390
Jura	0.250	0.279	0.300
Oes10	0.637	0.757	0.794
Oes97	0.625	0.632	0.653
Osales	0.467	0.580	0.688
Rf1	0.337	0.448	0.498
Rf2	0.189	0.218	0.257
Scm1d	0.245	0.313	0.399
Scm20d	0.202	0.315	0.417
Scpf	0.450	0.485	0.582
Sf1	0.833	0.855	0.871
Sf2	0.768	0.780	0.838
Slump	0.323	0.447	0.466
Wq	0.127	0.200	0.233

Table 7

The predictive performance of ERC-REPTree at the different noise levels

Dataset	10%		20%		30%
	Noisy	Red	Noisy	Red	Noisy	Red
Andro	0.771	0.851	0.939	0.977	1.051	1.037
Atp1d	0.681	0.600	0.834	0.732	0.875	0.790
Atp7d	0.657	0.739	0.895	0.943	1.003	0.994
Edm	0.929	0.880	0.964	0.883	1.019	0.900
Enb	0.567	0.554	0.812	0.808	0.904	0.895
Jura	0.844	0.808	0.926	0.907	0.987	0.942
Oes10	0.869	0.919	0.970	0.924	0.961	0.931
Oes97	0.821	0.980	0.927	0.974	0.967	0.964
Osales	0.834	0.899	0.892	0.942	0.900	0.956
Rf1	0.471	0.450	0.621	0.546	0.653	0.572
Rf2	0.158	0.112	0.164	0.115	0.175	0.120
Scm1d	0.378	0.347	0.399	0.349	0.455	0.356
Scm20d	0.646	0.638	0.729	0.711	0.803	0.747
Scpf	0.937	0.865	0.989	0.900	0.999	0.908
Sf1	1.119	0.832	1.161	0.850	1.227	0.859
Sf2	1.071	0.931	0.995	0.842	1.662	0.856
Slump	0.916	0.859	0.977	0.909	1.000	0.926
Wq	0.975	0.972	0.986	0.985	0.993	0.993
$p$ -value	0.186		0.008		0.000

Table 8

The predictive performance of ERC-LR at the different noise levels

Dataset	10%		20%		30%
	Noisy	Red	Noisy	Red	Noisy	Red
Andro	5.933	2.909	6.545	3.064	8.848	3.205
Atp1d	2.206	1.890	2.787	2.162	3.248	2.290
Atp7d	2.605	1.213	5.147	1.519	4.757	1.783
Edm	0.910	0.924	0.956	0.955	1.055	1.010
Enb	0.571	0.553	0.790	0.737	0.862	0.790
Jura	0.758	0.739	0.861	0.817	0.936	0.834
Oes10	3.079	1.075	3.185	1.169	3.964	1.103
Oes97	2.497	0.842	2.627	0.975	5.074	1.384
Osales	2.257	1.785	2.272	2.000	2.525	2.085
Rf1	0.683	0.607	0.772	0.624	0.790	0.675
Rf2	0.602	0.402	0.639	0.405	0.645	0.410
Scm1d	0.435	0.297	0.483	0.295	0.493	0.299
Scm20d	0.729	1.000	0.776	0.769	0.826	0.825
Scpf	0.963	0.643	1.134	0.681	1.160	0.687
Sf1	1.401	0.953	1.469	1.078	1.525	0.121
Sf2	1.593	1.318	1.629	1.304	1.707	1.323
Slump	0.811	0.804	0.900	0.885	0.899	0.890
Wq	0.974	0.978	0.984	0.988	0.995	1.007
$p$ -value	0.000		0.000		0.000

Table 6 shows the average reduction rates at the different noise levels. We can see that better reduction rates were obtained as the noise levels increased. Thus, EDROPMTR is able to detect and remove noise from data. On the other hand, Tables 7–9 show the predictive performance of the three regressors considered in the experimental study. In each row, the best aRRMSE value attained at each noise level is highlighted in bold type face. The columns with the label “Noise” represent the predictive performance of the regressors on the datasets with noise, whereas the columns with the label “Red” represent the predictive performance attained on the data subsets selected by EDROPMTR.

It was observed that, in many cases, the predictive performance of the multi-target regressors were improved once the training sets are pre-processed with the IS method. The Wilcoxon’s test was conducted to

Table 9

The predictive performance of ERC-LR at the different noise levels

Dataset	10%		20%		30%
	Noisy	Red	Noisy	Red	Noisy	Red
Andro	0.808	0.803	0.855	0.882	1.090	1.032
Atp1d	0.702	0.685	0.825	0.807	0.879	0.862
Atp7d	0.677	0.708	0.967	0.959	1.032	0.989
Edm	0.859	0.852	0.861	0.851	0.933	0.865
Enb	0.603	0.607	0.848	0.851	1.089	1.012
Jura	0.832	0.827	0.916	1.012	0.980	0.980
Oes10	0.884	0.884	0.912	0.900	0.930	0.906
Oes97	0.776	0.756	0.833	0.839	0.951	0.953
Osales	0.929	0.919	0.959	0.931	0.962	0.943
Rf1	0.495	0.404	0.666	0.409	0.696	0.450
Rf2	0.218	0.142	0.243	0.152	0.256	0.169
Scm1d	0.366	0.322	0.399	0.342	0.423	0.361
Scm20d	0.531	0.537	0.641	0.649	0.845	0.814
Scpf	1.151	0.880	1.184	0.902	1.191	0.913
Sf1	1.151	0.839	1.185	0.845	1.235	0.866
Sf2	1.804	1.128	1.847	1.132	1.931	1.141
Slump	0.806	0.793	0.901	0.899	0.985	0.972
Wq	0.967	0.965	0.986	0.970	1.006	1.000
$p$ -value	0.003		0.002		0.000

detect whether there were significant differences in the predictive performance attained at each noise level. The $p$ -values computed by the Wilcoxon Signed-Rank test are shown in the last row of Tables 7–9. The statistical test rejected all the null hypotheses at the significance level $\alpha=$ 0.05, excepting the one related to ERC-REPTree regressor at the 10% of noise. These results indicate that EDROPMTR is able to detect and eliminate noisy instances, allowing to improve the generalization error of the regressors.

Finally, it is noteworthy that, on average, a worse predictive performance was obtained as the noise level increased. This is an expected result because EDROPMTR selects smaller training sets as the number of noisy instances in data increased.

4.6 Discussion

Two IS methods were proposed: the first one (DRO-PMTR) is a DROP-base extension that removes internal and border points that do not contribute to a better prediction of the target vectors of their neighbours, whereas the second one is an ensemble-based IS method (EDROPMTR) that aggregates multiple predictions to select a final data subset of relevant instances. Any of the two methods do not require the use of threshold values for determining whether an instance is included in the selected data subset or not. This is a major advantage because threshold values are usually problem dependent and, therefore, it is required to conduct an additional analysis to select their adequate values. On the other hand, the proposed IS methods have acceptable runtime complexities that allow their use in large-scale datasets. In the case of EDROPMTR, the members of the ensemble can be easily executed in parallel, so allowing to considerably decrease the runtime needed to select the final data subset.

Another advantage of the proposed ensemble-based method is that it can implicitly model the inter-target dependencies, so easing the selection of more relevant instances. By analysing the way the members of the ensemble are constructed, some similarities with regard to the approach proposed by Spyromitros-Xioufis et al. in [6] are observed. In such an approach, it was demonstrated that the expansion of the input space with target variables is an effective manner to exploit the inter-target dependencies. By this way, each member of the ensemble models the relationship of one target variable with the rest of the targets.

On the other hand, it is possible to consider that the aggregation process formulated in the last step of EDROPMTR is an artefact that tries to exploit the similarities between the structural and stochastic parts of the models. The structural-part of models correspond to the data subsets selected by each member of the ensemble, whereas the stochastic-parts are related to the errors associated with the searching of the data subset that attains the best estimation. According to Dembczynski et al. [74], those methods that follow an architecture similar to the one used by EDROMTR can model the existing marginal and conditional dependencies between target variables.

Finally, it is important to highlight that the two proposed IS methods follow a wrapper approach, and therefore, they can be implicitly benefited from the power of the internal regressor. Consider that, a powerful MTR regressor could tackle not only the modelling of inter-target dependencies, but also the estimation of complex non-linear input-output relationships. Consequently, it is highly likely that the selected data subsets contain those relevant instances that reflect this type of data relationships, which have shown to be of paramount importance for solving more effectively the MTR problem.

As main drawbacks, it is noteworthy that EDROP-MTR focuses more on minimizing errors than on the size of the final subset and, therefore, the final data subset is not necessarily the most consistent subset of instances. Also, EDROPMTR is not suitable for incremental learning scenarios since it would require recomputing from scratch the subset of relevant instances every time that new samples are added.

In this work, an extensive experimental study was carried out. The first experiment showed that the proposed IS methods is able to reduce significantly the size of the datasets. Excellent reduction rates were attained on datasets with a moderate number of input variables (e.g. the datasets Atp1d, Atp7d, Oes10, Oes97 and Osales), as well as on datasets with many target variables (e.g. the datasets Oes10, Oes97, Osales and Wq).

In addition to the high reduction levels that were attained in several datasets, the second experiment showed that EDROPMTR can significantly improve the predictive performance of the regressors. This result is very promising since in the past several authors have noted that not always is possible to maintain the predictive performance of the learning algorithms after applying an IS method. Also, the results showed that EDROPMTR significantly outperforms to DROPMTR, demonstrating the effectiveness of the proposed ensemble-based approach.

The third experiment demonstrated that the pro-posed ensemble-based IS method is robust on datasets which have noise. The results indicated that EDROP-MTR is able to detect the noisy instances, allowing that the regression models do not deteriorate so much their predictive performance. Consequently, EDROPMTR is well suited to be used in real-world engineering applications that require the elimination of noise before performing crucial tasks.

5. Conclusions

In this work, an ensemble-based method to perform the IS task in the MTR problem has been proposed. First, an error accumulation-based approach has been introduced, which is an adaptation of the well-known DROP method to multi-target data. Second, an ensemble-based method that effectively combines the partial data subsets selected by each member of the ensemble has been also presented. The major features of our approach are: (I) a wrapper approach was adopted where any MTR regressor can be used to estimate the relevance of the instances, so the IS task can be benefited from the capacities of the internal regressor to model the inter-target dependencies and complex input-output relationships; (II) the way the ensemble’s members are constructed not only guarantee the diversity between them, but also the modelling of the inter-target dependencies; (III) the proposed ensemble-based method selects the final data subset by a simple greedy heuristic process, avoiding the use of complex optimization algorithms; and (IV) no threshold values are used in order to decide whether an instance is selected or removed, so the proposed approach is less problem dependent.

The experimental study confirmed the benefits of the IS task for solving the MTR problem, which was the main motivation of the present work. A good trade-off between the reduction levels of the size of the datasets and the predictive performance of the regressors was attained. Consequently, not only the runtime needed to construct a regression model on large-scale datasets is significantly reduced, but also its predictive performance can be even improved.

Future works will study better solutions for solving the optimisation problem formulated to aggregate the partial data subsets selected by the ensemble’s members. It is noteworthy that the final data subset determined by the proposed ensemble-based method is not the minimal data subset and, therefore, this is a relevant point to be studied in future works. On the other hand, it would be interesting to consider other ways of exploiting the relationships between targets variables. In this regard, the design of other approaches to constructing the members of the ensemble is a possible idea to follow. Finally, it would also be important to study the benefits of combining the IS task with the feature selection one for constructing better MTR models.

Footnotes

In multi-label learning, the output variables (a.k.a. labels) are restricted to binary values.

The domain of the input variables can be continuous, discrete or mixed type, whereas the domain of the target variables is always continuous.

Acknowledgements

This research was supported by the Spanish Ministry of Economy and Competitiveness and the European Regional Development Fund, project TIN2017-83445-P.

References

Reyes

Morell

Ventura

. Evolutionary feature weighting to improve the performance of multi-label lazy algorithms. Integr Comput-Aid E. 2014; 21(4): 339-354. DOI: 10.3233/ica-140468.

Lostado

Martínez

Mac Donald

Villanueva

. Combining soft computing techniques and the finite element method to design and optimize complex welded products. Integr Comput-Aid E. 2015; 22(2): 153-170. DOI: 10.3233/ica-150484.

Ghani

Tokhi

. Simulation and control of multipurpose wheelchair for disabled/elderly mobility. Integr Comput-Aid E. 2016; 23(4): 331-347. DOI: 10.3233/ica-160526.

Reyes

Cano

Fardoun

Ventura

. A locally weighted learning method based on a data gravitation model for multi-target regression. Int J Comput Int Sys. 2018; 11: 282-295. DOI: 10.2991/ijcis.11.1.22.

Reyes

Morell

Ventura

. Effective active learning strategy for multi-label learning. Neurocomputing. 2018; 273: 494-508. DOI: 10.1016/j.neucom.2017.08.001.

Spyromitros-Xioufis

Tsoumakas

Groves

Vlahavas

. Multi-target regression via input space expansion: Treating targets as inputs. Mach Learn. 2016; 104(1): 55-98. DOI: 10.1007/s10994-016-5546-z.

Han

Liu

Zhao

Wang

. Real time prediction for converter gas tank levels based on multi-output least square support vector regressor. Control Eng Pract. 2012; 20(12): 1400-1409. DOI: 10.1016/j.conengprac.2012.08.006.

Tsanas

Xifara

. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energ Buildings. 2012; 49: 560-567. DOI: 10.1016/j.enbuild.2012.03.003.

Tuia

Verrelst

Alonso

Pérez-Cruz

Camps-Valls

. Multioutput support vector regression for remote sensing biophysical parameter estimation. IEEE Geosci Remote S. 2011; 8(4): 804-808. DOI: 10.1109/lgrs.2011.2109934.

10.

Borchani

Varando

Bielza

Larrañaga

. A survey on multi-output regression. Wires Data Min Knowl. 2015; 5(5): 216-233. DOI: 10.1002/widm.1157.

11.

Boutell

Luo

Shen

Brown

. Learning multi-label scene classification. Pattern Recogn; 2004; 37(9): 1757-1771. DOI: 10.1016/j.patcog.2004.03.009.

12.

Tsoumakas

Dimou

Spyromitros

Mezaris

Kompatsiaris

Vlahavas

. Correlation-based pruning of stacked binary relevance models for multi-label learning. ECML/PKDD 2009 Workshop on Learning from Multi-Label Data. 2009; 101–116. DOI: 10.1109/icci-cc.2015.7259416.

13.

Read

Pfahringer

Holmes

Frank

. Classifier chains for multi-label classification. Mach Learn. 2011; 85(3): 333-359.

14.

Simila

Tikka

. Input selection and shrinkage in multiresponse linear regression. Comput Stat Data An. 2007; 52(1): 406-422. DOI: 10.1016/j.csda.2007.01.025.

15.

Melki

Cano

Kecman

Ventura

. Multi-target support vector regression via correlation regressor chains. Inform Sciences. 2017; 415–416: 53-69. DOI: 10.1016/j.ins.2017.06.017.

16.

Baldassarre

Rosasco

Barla

Verri

. Multi-output learning via spectral filtering. Mach Learn. 2012; 87(3): 259-301. DOI: 10.1007/s10994-012-5282-y.

17.

Stojanova

Ceci

Appice

Dzeroski

. Network regression with predictive clustering trees. Data Min Knowl Dics. 2012; 25(2): 378-413. DOI: 10.1007/978-3-642-23808-6_22.

18.

Aho

Zenko

Dzeroski

Elomaa

. Multi-target regression with rule ensembles. J Mach Learn Res. 2009; 373: 2055-2066.

19.

Zhen

. Multi-target regression via robust low-rank learning. IEEE T Pattern Anal. 2018; 40(2): 497-504. DOI: 10.1109/tpami.2017.2688363.

20.

Reyes

Altalhi

Ventura

. Statistical comparisons of active learning strategies over multiple datasets. Knowl-Based Syst. 2018; 145: 274-288.

21.

Olvera-López

Carrasco-Ochoa

Martinez-Trinidad

Kittler

. A review of instance selection methods. Artif Intell Rev. 2010; 34(2): 133-143. DOI: 10.1007/s10462-010-9165-y.

22.

Blachnik

. Ensembles of instance selection methods based on feature subset. Procedia Comput Sci. 2014; 388–396. DOI: 10.1016/j.procs.2014.08.119.

23.

García

Derrac

Cano

Herrera

. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE T Pattern Anal. 2012; 34(3): 417-435.

24.

Arnaiz-González

Blachnik

Kordos

García-Osorio

. Fusion of instance selection methods in regression tasks. Inform Fusion. 2016; 30: 69-79. DOI: 10.1016/j.inffus.2015.12.002.

25.

Otte

Störmann

. Improving the accuracy of network intrusion detectors by input-dependent stacking. Integr Comput-Aid E. 2011; 18(3): 291-297. DOI: 10.3233/ica-2011-0370.

26.

Wandekokem

Mendel

Fabris

Valentim

Batista

Varejão

Rauber

. Diagnosing multiple faults in oil rig motor pumps using support vector machine classifier ensembles. Integr Comput-Aid E. 2011; 18(1): 61-74. DOI: 10.3233/ica-2011-0361.

27.

Lim

Kim

Kang

Kim

. Vehicle-to-grid communication system for electric vehicle charging. Integr Comput-Aid E. 2012; 19(1): 57-65.

28.

Iacca

Caraffini

Neri

. Multi-strategy coevolving aging Particle optimization. Int J Neural Syst. 2014; 24(1): 1450008. DOI: 10.1142/s0129065714500087.

29.

Iacca

Caraffini

Neri

. Continuous parameter pools in ensemble differential evolution. IEEE Symposium Series on Computational Intelligence. 2015; 1529–1536. DOI: 10.1109/ssci.2015.216.

30.

Ortiz

Munilla

Gorriz

Ramírez

. Ensembles of deep learning architectures for the early diagnosis of the Alzheimer’s disease. Int J Neural Syst. 2016; 26(7): 1650025. DOI: 10.1142/s0129065716500258.

31.

Roveri

Trovò

. An ensemble approach for cognitive fault detection and isolation in Sensor networks. Int J Neural Syst. 2017; 27(3): 1650047. DOI: 10.1142/s0129065716500477.

32.

Ouyang

Yin

. Multi-step time series forecasting with an ensemble of varied length mixture Models. Int J Neural Syst. 2018; 28(4): 1750053. DOI: 10.1142/s0129065717500538.

33.

Kocev

Vens

Struyf

Džeroski

. Tree ensembles for predicting structured outputs. Pattern Recogn. 2012; 46: 817-833. DOI: 10.1016/j.patcog.2012.09.023.

34.

Saidi

Bechar

Settouti

Chikh

. Instances selection algorithm by ensemble margin. J Exp Theor Artif In. 2018; 30(3): 457-478. DOI: 10.1080/0952813x.2017.1409283.

35.

Wilson

Martínez

. Reduction techniques for instance-based learning algorithms. Mach Learn. 2000; 38: 257-286.

36.

García

Luengo

Herrera

. Data preprocessing in data mining. Springer. 2015.

37.

Calvo-Zaragoza

Valero-Mas

Rico-Juan

. Improving kNN multi-label classification in prototype selection scenarios using class proposals. Pattern Recogn. 2015; 48(5): 1608-1622. DOI: 10.1016/j.patcog.2014.11.015.

38.

Marchiori

. Class conditional nearest neighbor for large margin instance selection. IEEE T Pattern Anal. 2010; 32: 364-370. DOI: 10.1109/tpami.2009.164.

39.

Wilson

. Asymptotic properties of nearest neighbor rules using edited data. IEEE T Syst Man Cyb. 1972; 2: 408-421. DOI: 10.1109/tsmc.1972.4309137.

40.

Hart

. The condensed nearest neighbor rule. IEEE T Infom Theory. 1968; 14: 515-516.

41.

Brighton

Mellish

. Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc. 2002; 6: 153-172.

42.

Sierra

Lazkano

Inza

Merino

Larrañaga

Quiroga

. Prototype selection and feature subset selection by estimation of distribution algorithms. A case study in the survival of cirrhotic patients treated with TIPS. Conference on AI in Medicine in Europe, 8

{}^{\rm th}

edition. LNCS, Springer. 2001; 2101: 20-29. DOI: 10.1007/3-540-48229-6_3.

43.

Tolvi

. Genetic algorithms for outlier detection and variable selection in linear regression models. Soft Comput. 2004; 8(8): 527-533. DOI: 10.1007/s00500-003-0310-2.

44.

Antonelli

Ducange

Marcelloni

. Genetic training instance selection in multiobjective evolutionary fuzzy systems: a coevolutionary approach. IEEE T Fuzzy Syst. 2012; 20(2): 276-290. DOI: 10.1109/tfuzz.2011.2173582.

45.

Stojanovic

Bozic

Stankovic

Stajic

. A methodology for training set instance selection using mutual information in time series prediction. Neurocomputing. 2014; 141: 236-245. DOI: 10.1016/j.neucom.2014.03.006.

46.

Kordos

Blachnick

. Instance selection with neural networks for regression problems. Artificial Neural Networks and Machine Learning. 2012; 7553: 263-270. DOI: 10.1007/978-3-642-33266-1_33.

47.

Arnaiz-González

Díez-Pastor

Rodríguez

García-Osorio

. Instance selection for regression: Adapting DROP. Neurocomputing. 2016; 201: 66-81. DOI: 10.1016/j.neucom.2016.04.003.

48.

Arnaiz-González

Díez-Pastor

Rodríguez

García-Osorio

. Instance selection for regression by discretization. Expert Syst Appl. 2016; 54: 340-350. DOI: 10.1016/j.eswa.2015.12.046.

49.

Fernández

Carmona

del Jesús

Herrera

. A pareto based ensemble with feature and instance Selection for learning from multi-class imbalanced datasets. Int J Neural Syst. 2016; 27(6): 1750028. DOI: 10.1142/s0129065717500289.

50.

Woźniak

Graña

Corchado

. A survey of multiple classifier systems as hybrid systems. Inform Fusion. 2014; 16: 3-17. DOI: 10.1016/j.inffus.2013.04.006.

51.

Dietterich

. Ensemble methods in machine learning. International workshop on multiple classifier systems. LNCS, Springer, Berlin, Heidelberg. 2000; 1857: 1-15.

52.

Blachnik

Kordos

. Bagging of instance selection algorithms. Artificial Intelligence and Soft Computing. LNCS, Springer. 2014; 8468: 40-51. DOI: 10.1007/978-3-319-07176-3_4.

53.

Bosnic

Kononenko

. Comparison of approaches for estimating reliability of individual regression predictions. Data Knowl Eng. 2008; 67(3): 504-516. DOI: 10.1016/j.datak.2008.08.001.

54.

Levatić

Ceci

Kocev

Džeroski

. Self-training for multi-target regression with tree ensembles. Knowl-Based Syst. 2017; 123: 41-60. DOI: 10.1016/j.knosys.2017.02.014.

55.

Dragos

Gilles

Alexandre

. Predicting the predictability: a unified approach to the applicability domain problem of QSAR models. J Chem Inf Model. 2009; 49: 1762-1776. DOI: 10.1021/ci9000579.

56.

Heskes

. Practical confidence and prediction intervals. Advances in Neural Information Processing Systems. MIT Press. 1997; 9: 176-182.

57.

Briesemeister

Rahnenführer

Kohlbacher

. No longer confidential: Estimating the confidence of individual regression predictions. PloS one. 2012; 7(11): e48723. DOI: 10.1371/journal.pone.0048723.

58.

Wolpert

. Stacked generalization. Neural Networks. 1992; 5(2): 241-259.

59.

Karalic

Bratko

. First order regression. Mach Lear. 1997; 26(2–3): 147-176.

60.

Yeh

. Modeling slump flow of concrete using second-order regressions and artificial neural networks. Cement Concrete Comp. 2007; 29(6): 474-480. DOI: 10.1016/j.cemconcomp.2007.02.001.

61.

Hatzikos

Tsoumakas

Tzanis

Bassiliades

Vlahavas

. An empirical study on sea water quality prediction. Knowl-Based Syst. 2008; 21(6): 471-478. DOI: 10.1016/j.knosys.2008.03.005.

62.

Dzeroski

Demsar

Grbovic

. Predicting chemical parameters of river water quality from bioindicator data. Appl Intell. 2000; 13(1): 7-17.

63.

Lichman

,UCI machine learning repository, Available from: http://archive.ics.uci.edu/ml.2013.

64.

Kaggle competition: Online product sales. Available from: https://www.kaggle.com/c/online-sales.2012.

65.

Kaggle competition: See click predict fix. Available from: https://www.kaggle.com/c/see-click-predict-fi.2013.

66.

Pugelj

Dzeroski

. Predicting structured outputs k-nearest neighbours method. Discovery Science. Springer. 2011; 262-276. DOI: 10.1007/978-3-642-24477-3_22.

67.

Wilson

Martínez

. Improved heterogeneous distance functions. J Artif Intell Res. 1997; 6: 1-34. DOI: 10.1613/jair.346.

68.

Demsar

. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006; 7: 1-30.

69.

Tsoumakas

Spyromitros-Xioufi

Vilcek

Vlahavas

. Mulan: A java library for multi-label learning. J Mach Learn Res. 2011; 12: 2411-2414.

70.

Frank

Hall

Witten

. The weka-workbench. Data mining: practical machine learning Tools and techniques. Morgan Kaufmann. 2016; 4

{}^{\rm th}

edition.

71.

Wilcoxon

. Individual comparisons by ranking methods. Biometrics. 1945; 1: 80-83. DOI: 10.2307/3001968.

72.

Friedman

. A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat. 1940; 11: 86-92.

73.

Bergmann

Hommel

. Improvements of general multiple test procedures for redundant systems of hypotheses. Multiple Hypotheses Testing. Springer. 1998; 100-115. DOI: 10.1007/978-3-642-52307-6_8.

74.

Dembczynski

Waegeman

Cheng

Hullermeier

. On label dependence and loss minimization in multi-label classification. Mach Learn. 2012; 88(1): 5-45. DOI: 10.1007/s10994-012-5285-8.

An ensemble-based method for the selection of instances in the multi-target regression problem

Abstract

Keywords

1. Introduction

2. Related work

3. An ensemble-based method for the selection of instances

3.1 A DROP-based extension for the MTR problem

4.1 Multi-target datasets

Table 1 Summary of the benchmark datasets

Table 2 Average reduction levels attained by DROPMTR and EDROPMTR

Table 3 Results of the aRRMSE measure for ERC-REPTree. The Friedman’s statistic is equal to 8.333, and the null hypothesis was rejected with a p -value = 0.015 at the significance level α = 0.05

Table 4 Results of the aRRMSE measure for ERC-LR. The Friedman’s statistic is equal to 7.861, and the null hypothesis was rejected with a p -value = 0.020 at the significance level α = 0.05

Table 6 Average reduction rates attained at the different noise levels

5. Conclusions

Footnotes

Acknowledgements

References

Table 1
Summary of the benchmark datasets

Table 2
Average reduction levels attained by DROPMTR and EDROPMTR

Table 3
Results of the aRRMSE measure for ERC-REPTree. The Friedman’s statistic is equal to 8.333, and the null hypothesis was rejected with a $p$ -value $=$ 0.015 at the significance level $\alpha=$ 0.05

Table 4
Results of the aRRMSE measure for ERC-LR. The Friedman’s statistic is equal to 7.861, and the null hypothesis was rejected with a $p$ -value $=$ 0.020 at the significance level $\alpha=$ 0.05

Table 6
Average reduction rates attained at the different noise levels