Abstract
Regression via Classification (RvC) is a process to solve a regression problem by using a classifier. An ensemble consists of many models, in which the final result is the combination of the results of these individual models. In this paper, two RvC ensemble methods are proposed. In the first ensemble method, the output of the ensemble method is modified to achieve the final output. A formula is derived in this paper for this purpose. In the second method, a new approach is proposed to compute the output of each model of an ensemble. It is shown that an accurate binary classifier can be transformed into an accurate regression method with the proposed methods. It is also shown experimentally, by using popular Random Forests as a classifier in the proposed ensemble methods against Random Forests as a regression method, the effectiveness of the proposed RvC ensemble methods.
Introduction
Regression and classification are two areas of supervised learning, [5, 27]. Both estimate the relationship between a dependent variable and one or more independent variables. The difference between these two problems is that a regression problem has continuous dependent variable, whereas in a classification problem, the dependent variable has categories. There are many methods like Naive Bayes that can handle classification problems easily. Therefore, attempts have been made to solve regression problems by using classifiers [32, 33]. A regression problem can be solved by converting it into a classification problem by using discretization process, the process is called Regression via Classification process (RvC) [32, 33]. These methods have not become very popular as they don’t have any well-defined theory that can show that the particular RvC method will perform well. A very important parameter, the number of bins used to convert a continuous target to a categorical target is generally user defined. Hence, the results may change by changing this parameter.
Ensembles consist of many models [25]; each model contributes to the final result. Halawani et al. [20] suggest an ensemble technique for RvC problems. They show that these ensembles perform better than single model for a regression problem y = x when the number of bins is more than 2. However, their method cannot learn y = x problem accurately with a finite number of bins. Ahmad et al. [2] show that weighting schemes can improve the performance but the zero training error was not achieved with the weighting scheme. However, any solution has not been suggested that shows zero training error for RvC problems. Ahmad et al. [3] show that the model can also be used to show that regression tree ensembles that are created by total randomization process cannot learn y = x problem. In this paper, RvC ensembles are proposed that can learn all kinds of regression functions with accurate binary classifiers. Firstly, two ensemble methods for RvC problems for y = x regression functions are proposed. It is shown theoretically that with the two bins accurate results (zero training error) can be achieved. The proposed RvC ensembles can work with binary classifiers that can learn the decision boundaries created by the discretization process. It is also shown that these results are applicable to all the monotonically increasing or decreasing functions with one variable. It is proposed that these ensemble methods can work for all kinds of regression problems with two bins by using binary classifiers that can learn nonlinear decision boundaries. It is demonstrated that the proposed ensemble methods with Random Forests [7] as a classifier perform similar to or better than Random Forests as a regression method on a number of regression datasets. The similarity between RvC ensemble methods and ensembles of univariate regression trees (split the data only on the basis of a single attribute [9, 30]) is used to study the performance of ensembles of univariate regression trees.
The paper has following contributions; Two novel RvC ensemble methods are proposed. The effectiveness of these methods is shown theoretically and experimentally. It is shown that an accurate binary classifier method can be used to create an accurate regression method. In existing RvC ensemble methods, the number of bin is a parameter. The performance of these methods is strongly dependent on the number of bins. The proposed ensemble methods use two bins. In other words, the proposed ensemble methods do not use the number of bins as a parameter. The proposed RvC ensemble methods can be used to study the performance of ensembles of finite sized univariate regression trees for monotonically increasing or decreasing functions of single variable.
In other words, there are two motivations of the proposed approach, first the proposed approach can have better performance accuracy than related regression method and the second that the proposed models can be used to study the performance of ensembles of finite sized univariate regression trees. The rest of the paper is organized as follows. Section 2 describes the related work. In Section 3, RvC ensemble methods are proposed for monotonically increasing or decreasing functions with one variable which use linear classifiers, like decision stumps as classifiers. Experimental results are also presented in this section. Section 4 describes our proposed RvC ensemble methods for all kinds of regression functions. Results are presented in Section 5. Section 6 has conclusion and future work.
Related work
A regression problem has the real-valued target variable, whereas in a classification problem the target variable has categories. A regression problem can be solved by using classifiers [32]. This process is called Regression via Classification (RvC).
The whole process of RvC comprises of two important steps: Discretization divides the feature values or target values into different bins depending upon intervals they fall into. In the first step, the discretization of the numeric target variable of a regression problem is done to convert the output variable into bins that are treated as classes. Hence, a classifier can learn on the new discretized dataset. Various discretization methods e.g. equal-width, equal-frequency etc. [14] are available in the literature. The reverse process transforms the class output of the classifier into a continuous value.
Torgo and Gama [33] present a search based approach for discretization step in RvC problems. Indurkhya and Weiss [22] use k-means clustering algorithm to discretize the output variable and solve the resultant classification problem. They use ensembles of decision-rule solutions for the classification problem. Frank and Bouckaert [15] compute conditional density estimates by using a class probability estimator, this estimator is applied to the discretized target variable. The number of bins is an important parameter in RvC problems [22, 23]. If the number of bin is small, it creates an easier classification problem, however, the final output (e.g. the mean of the points in the bins) has more error. The large number of bins make the classification problem very difficult. Hence, the performance of RvC problems is strongly dependent on this parameter. Ensembles of models are used to improve the accuracy of single model [25, 34]. Accurate ensembles consist of many accurate and diverse models [25]. Halawani et al. [2, 20] propose ensemble methods for RvC problems. In these methods, the discretization process is done randomly, hence, different datasets are created. Models trained on these different datasets are diverse. The results of these models are combined to get the final result. Halawani et al. [2, 20] show that these ensemble methods outperform RvC method with equal width discretization method. In RvC ensemble methods proposed by Halawani et al. [2, 20] the number of bin is a parameter. The performance of RvC ensemble methods changes with this parameter.
In this paper, the RvC ensemble methods of Halawani et al. [2, 20] are extended, hence, these methods are discussed in detail.
RvC with extreme randomized discretization
Ahmad [1] present a discretization method, Extreme Randomized Discretization (ERD), for creating ensembles of decision trees. In ERD, bin boundaries for the discretization are created randomly. Ahmad et al. [2, 20] use ERD to propose an ensemble method for RvC. They use ERD in stage (i) of RvC. As it creates diverse datasets, diverse classifiers are created. These diverse datasets, in turn, produce diverse classifiers. The combination of these classifiers creates an ensemble. Ahmad et al. [2, 20] present the theoretical study of these ensembles. Ahmad et al. show that RvC ensembles with ERD cannot learn y = x regression problem with a finite number of bins. In y = x regression problem, a value of y is predicted by using a value of x and the value of y is equal to the value of x. However, they don’t present any solution to this problem. This work [2, 20] will be extended to present a solution to this problem. For the sake of completeness, the results of the papers [2, 20] are dicussed.
Assumptions
It is assumed that a training data D
n
= (x1, y1) , . . , (x
n
, y
n
) is given for a regression function r where n goes to ∞ . The regression function is learned by using the data D
n
. Following conditions are used for the calculations; There are ∞ training points. All the output values between y
min
and y
max
are present once in the training data. The classification error is 0 for each model. The mean of target variable values of data points in the predicted bin is the output value of a model. As the target value is uniformly distributed, the center of the bin is the predicted value. y
p
is the actual target value and The ensemble has ∞ models and each model is created by using different bin boundary. The mean of all the predictions (by single models) is the final result of an ensemble. The training error is calculated. In other words, the point which is predicted is available in the training data.
Each model of an ensemble is a trained classifier. Each classifier is trained to learn a decision boundary for a binary classification problem. It is assumed that the trained model can predict all the data points correctly (0 classification error). Depending upon the decision boundaries, an appropriate classifier should be selected. For example, if the decision boundaries are linear (Section 3.3), classifiers which can learn linear can be used, whereas if the decision boundaries are nonlinear (Section 5) only those classifier should be used that can learn nonlinear decision boundaries.
The important point in this calculation is that the center point of a bin is the predicted value of a model. Generally, the mean target values of all points present in the predicted bin is taken as the output of the model, hence, the uniformly distribution assumption is made so that the output of the model is the center of the bin. This assumption is presented as theoretical results by Ahmad et al. [2, 20] are shown by using this assumption. However this assumption will not be used in the proposed methods (please seeSection 4).
Theoretical results of RvC ensembles with ERD
In this section, theoretical results of RvC ensembles with ERD [2, 20] are discussed. In RvC ensembles, ERD is used to produce different bin boundaries. The prediction for each model is dependent on the bin boundary. Hence, the predictions are different for different models of ensembles. These predictions are combined by taking the mean of these values to get the final result.
The bin boundary (B1) is between the minimum value (y
min
) and the maximum value (y
max
) of the target variable (Fig. 1). If the target value we want to predict is y
p
, in case y
p
at the right side of the bin boundary, the predicted value is (y
max
+ B1)/2. Whereas, if y
p
is at the left side of the bin boundary, the predicted value is (y
min
+ B1)/2. The mean value of all the results (∞ runs) is the final prediction. The predicted value is

In the top figure the target point, the bin boundary, B1, is at the left of the target data point, y p , whereas in the bottom figure, the bin boundary B1, is the right side of the target point, y p .
Only at y
p
= (y
min
+ y
max
)/2, the predicted value,
Hence, these RvC ensembles with two bins cannot learn y = x regression problem accurately. They also perform the calculation for more than two bins and find out that with the finite number of bins, y = x regression problem cannot be studied accurately.
Ahmad et al. present [3] the results for RvC ensembles with two bins for y = x problem, however, it can be shown that these results are also true for uniformly monotonically increasing or decreasing functions with one variable like y = x2 (x ≥ 0). In RvC ensembles, the results for y = x problems are true for all the functions for which the mean of the target value of points in a bin is the center of the bin. This condition is true for all the monotonically increasing or decreasing functions with one variable (with infinite points of the functions and no duplicate points). Hence, all the results for RvC ensembles with two bins for y = x function are also true for uniformly monotonically increasing or decreasing functions with one variable.
The interesting point is that the predicted output is independent of type of a function. For example, we have different datasets for the different monotonically increasing or decreasing functions with one variable with the same minimum and maximum values of the target value, the prediction of data points with the same target values will be the same for both datasets. In other words, the predicted output is independent of the type of the function. It is only dependent on the target value, and the minimum and the maximum target values of the training dataset.
These results will be extended to show that some modifications in this algorithm can lead to development of more accurate RvC ensembles.
In this section, two RvC ensemble methods with ERD will be presented, which use linear classifiers like decision stumps. It will be shown that RvC ensembles with ERD with two bins can accurately learn a uniformly monotonically increasing or decreasing function with one variable. Ahmad el al. [2] show that when the number of bins is increased, the relationship between predicted and actual values becomes complicated, hence, applying the following methods becomes difficult when the number of bins is more than two.
Modified final output of RvC ensembles with ERD with two bins
Our first solution is based on the modified output to get the correct prediction. As discussed in the previous section, for RvC ensembles with ERD with two bins, the predicted value
By rearranging the equation
Hence, if the y
min
and y
max
are known, the actual value y
p
can be calculated by the predicted value
In RvC ensembles with ERD, the center of a bin is the output of each model. It is proposed that instead of taking the center of the bin as the output of the model, we take the extreme point of the bin as the output of the model. In the case of RvC with two bins (Fig. 1), the output of the left bin will be y min and the output of the right bin will be y max . In this case, the average of outputs of all the models will be
Hence, the predicted value
In the experiments, two bins were created for each RvC model. Hence, each model had one bin boundary. We wanted to create uniform bin boundaries. Hence, the bin boundary of i
th
model was created by using the formula
Weka [21] implementation of decision stumps was used in the experiments. For the comparative study, we also did the experiments with two popular regression tree ensemble methods, Decision stump regression ensembles with Bagging and Random Forests [8]. As there is no WEKA implementation of Random Forests (for regression), the R [31] implementation the Random Forests was used in our experiments. In our proposed ensemble methods, all the models of an ensemble are created independently. Whereas, in Boosting, the training of a new member is dependent on the already trained members. Hence, the boosting [18] like methods are not included in the study. As the base classifier for the proposed ensemble method was decision stump, for fair comparative study the experiments were carried out with decision tree ensemble methods.
Three different datasets used in the experiments were created from three different functions; y = x (0≤x≤100), y = x2 (0≤x≤10) and y = 100sinx (0≤x≤1.57). The range of target variable, y, was between 0 to 100. The datasets were created in such a way that each point in one dataset had one corresponding point in other dataset such that these points had same y values. It other words, y value was created first and the corresponding x value was created from the y values. For n data points, the y values of data points were created by using a formula, (i/n + 1)(y max - y min ) (1< = i < = n). There were two goals of selecting these datasets, first these datasets came from monotonically increasing functions, hence, our methods should work on these datasets, the second goal was to show that for the proposed RvC ensemble methods, the predicted output of a test point was independent of the type of datasets (training dataset and testing dataset) generated from uniformly monotonically increasing or decreasing functions. Root mean square error (RMSE) measure was used to compute prediction error. A dataset with 1000 data points was created for each function. Training error was calculated in the experiments.
Results are presented in Table 1–3. The results suggest the proposed ensemble methods performed much better than decision stump ensembles with Bagging. It is interesting to note that similar to decision stump ensembles with Bagging (it uses decision stumps as a regression method), our proposed methods use decision stumps (it uses decision stumps as a classification method), however, the ways final results are achieved are different. Both proposed RvC ensemble methods gave almost the same results.
Training RMSE for different ensembles methods for the dataset with 1000 data points generated from y=x function
Training RMSE for different ensembles methods for the dataset with 1000 data points generated from y=x function
Training RMSE for different ensembles methods for the dataset with 1000 data points generated from y = x2 function
Training RMSE for different ensembles methods for the dataset with 1000 data points generated from y=100sinx function
Table 1–3. show that results of RvC are independent of the function. This verifies our theoretical results that for monotonically increasing or decreasing functions with one variable, the predicted results are independent of the function.
Experimental results suggest that with our proposed ensemble methods, as the size of the ensemble is increasing the training error is decreasing. For Random Forests, the error doesn’t change much for the size of the ensembles greater than 500. This indicates that the proposed ensemble methods are able to create diverse models. As a decision stump (a decision tree with two leaves) is used as a base classifier for an ensemble, the individual models are not as accurate as random forests models, hence for smaller sized ensembles (up to 1000), random forests performed better. However, the performance of our methods with the ensemble size 5000 is better than Random Forests. This suggests that the proposed ensemble methods create diverse models, the combination of these diverse models produce excellent results. These results are interesting as the size of the individual tree in Random Forests was between 265-300, whereas in the proposed methods, decision stumps which have only two leaves were used. A question may be asked how 5000 different models were created with 1000 data points with the proposed methods. The answer is that bin boundaries are different for all the models (please see the formula in the first paragraph of this section), hence the output which is the center of the predicted bin is different for all the models for Model-1. Similarly with different bin boundaries, different models are created in Model-2.
Ahmad et al. [2, 3] show that RvC ensembles and infinite sized ensembles, consisting of finite sized univariate regression trees, created by a pure randomized method are similar. Random Forests and Bagging methods represent these kinds of ensemble methods. Results suggest that the performance of these ensembles are almost the same for datasets created by using different types of monotonically increasing or decreasing functions of single variable. This is an interesting observation. It shows that the proposed method can be used to study regression tree ensembles.
It is to be noted that datasets created by monotonically increasing or decreasing functions of single variable belong to a very small subset of regression datasets. The learning of a diagonal decision boundary by univariate tree ensembles has been an important research problem [3, 12]. Ahmad et al. [3] Ahmad et al. [2, 3] show that RvC ensembles and infinite sized ensembles, consisting of finite sized univariate regression trees cannot learn a diagonal decision boundary (y = x). It this section, it is shown that the proposed RvC ensemble methods with decision stumps (trees with two leaves) can outperform Random Forests ensembles consisting of trees having size between 265-300. It is also shown that these results are valid for all monotonically increasing or decreasing functions of single variable.
These are interesting observations, however, for the practical purpose, RvC ensemble methods are required that can learn all kinds of regression functions. In the next section, RvC ensemble methods will be proposed that can learn all kinds of regression functions by using binary classifiers that can learn nonlinear decision boundaries.
It will be shown that that proposed RvC ensemble methods can be used to learn all kinds of regression functions provided binary classifiers are available that can learn all kinds of decision boundaries.
In RvC ensembles with ERD with two bins (Section 2), two bins are created for each model with a bin boundary and a classifier is trained on this discretized data by using two bins as two classes. The zero classification error is the main condition that is required to get the theoretical results presented in Section 3. The other condition which is important for the Method-1 is that the final prediction of each model should be the center of each bin. This condition can be achieved easily by taking the center of bin as the predicted value. We are not taking the mean of target variable values of data points in the predicted bin as the output value of a model, hence uniform distribution assumption is not required to achieve the bin center as the predicted output.
The first job of the model is to predict the class (bin) of the testing point then the class (bin) can be converted into a continuous output by using the minimum or maximum of the training data points and the bin boundary. Generally, testing data points and training data points have similar distribution, hence, minimum and maximum of the training data points should be useful for predicting the testing data points. All models in an ensemble give continuous outputs which are combined to get the final result. This suggests that if each model predicts the bins correctly, the minimum and maximum of the function are known, this framework can be applied to any regression function. The problem of RvC ensembles can be considered as deciding in each model that whether the point we want to predict is in the left bin or the right bin, each model will give one output which can be used as proposed in Section 3 to get the final results. The accuracy of the final prediction depends upon the accuracy of the classifiers. The decision boundaries in this case could be non-linear and the number of classes are two, Hence, those binary classifiers are required that can learn non-linear decision boundaries. The detailed algorithm is illustrated in Algorithm 1.
The algorithm for RvC ensembles.
The algorithm for RvC ensembles.
This method is similar to Error-correcting codes ensembles [13], both methods covert a learning problem into many binary classification problems, however Error-correcting codes ensembles are for multi-class classification problems whereas the proposed methods are for regression problems.
Method-2 has similarity with the method proposed by Langford and Zandrozny [26] that uses a combination of binary classifiers, created by using different class boundaries for different classifiers, to estimate the class probabilities. In the proposed methods, the binary classifiers are used with different binboundaries.
The essential step in the proposed RvC ensemble methods is to find the maximum and the minimum of the target values of the training dataset. If the number of data points in the training data is n, the complexity of this step will be O (n). For RvC ensembles of size m, m classifiers are trained; one for each model. If the number of steps required to train a classifier and get the classification result for a given point is s. The complexity of this step for the ensemble is O (ms). Hence, the complexity of the proposed RvC ensemble methods for a testing data point is O (n + ms) which is dependent on the size of the training data, the size of the ensemble and the type of the classifier.
Experiments with regression datasets
Experiments were carried out with a sinusoidal univariate function, y = 1 + x2 - 50xsin (x/2), with x is a real number [16]. This is a highly nonlinear function. 10000 points were generated uniformly with 0 ≤ x ≤ 100. We also did the experiments with different popular regression datasets available withDelve datasets (www.cs.toronto.edu/∼delve/data/datasets.html), UCI data repository [17] and www.dcc.fc.up.pt/∼ltorgo/Regression/DataSets.html. The information about these datasets is presented in Table 4. Random Forests are very accurate ensemble method for classification and regression and do not have a large number of parameters. Hence, in these experiments, we selected Random Forests as classifiers for our methods. Results are compared with Random Forests as regression method. R [31] implementation of Random forests (for regression and for classification) were used in our experiments. The target values of the training data is discretized (two bins) by the method presented in Algorithm 1 to create different datasets. Classifiers (Random forests) were trained on each of these datasets. For each testing point, the classification result (bin) for each classifier was computed. The maximum and minimum values of the target values of the training dataset were used to convert the classification results into a predicted value by using the Method-1 or Method-2 (please see Algorithm 1). The average error measure of testing points is presented as the error measure of the testing data. Following [4], 5x2 fold cross-validation was used in the experiments. Five replications of a two-fold cross-validation were performed. In each replication, the dataset was divided into two random equal-sized sets. Each learning algorithm was trained on one set at a time and its error was estimated on the other set. The average results of these 10 runs were calculated and presented. We carried out with different parameters; the size of RvC ensembles and the size of the Random forests classifier and found that the results with 200 as RvC ensembles size and 50 as size of Random forests classifier were good representation of the proposed ensemble methods. Hence, the size of the ensembles was set to 200, whereas, the size of the Random Forests as classifiers in our methods was set to 50. It is to be noted that in the proposed methods, 200*50=10000 trees are generated against 200 trees in Random Forests as a regression method. To check the effect of more trees, the results for Random Forests as a regression method were also presented with 500 trees. The purpose of using two values for the size of the Random Forests was that we wanted to show that the further increase of the size of the Random Forests did not improve the regression results much. Hence, the results of Random forests with 500 trees were taken as the best results that was achieved with Random Forests as a regression method. These results are used as the benchmark to check how the proposed methods performed against the benchmark.
Details of datasets used in the experiments
Details of datasets used in the experiments
In the proposed ensemble methods, we solve binary classification problems. We predict the bin, as either left bin or right bin. All the theoretical results are achieved by using this assumption, however, classifiers also give the probability of two classes. In one of the versions of the proposed method, these probabilities are used to get the final results. For the modified version of Method-2, the following equation was used.
Where p i is the probability of predicting the bin for which the y max is one of the bin boundary. In our theoretical calculation, it is assumed that p i is either one or zero, however, in the modified version, p i can take any value between zero and one. Generally, we do not get the zero classification error, so we are taking the output from the wrong bin. To minimize the effect of wrong prediction the proposed modification may be useful. The similar modification was done in Method-1. In other words, the predicted value for each model is calculated from the classification probabilities for each bin and the mid points of the each bin. As both methods (Method-1 and Method-2) were giving the same results, we present the results with the Method-2 and its modification.
The results for nonlinear function are presented in Table 5. When the size of ensemble was small (100), Random Forests performed much better than RvC ensembles. However, as we increased the sizes of ensembles, the results did not improve much for Random Forests, however, there was a large improvement for RvC ensembles (Method-2 and its modified version). Results shows that the best results was achieved with 1000 trees with the proposed method (with probability). In this problem, generally the classification error was small. When, this is the case, one can get a large improvement with large sizes of RvC ensembles, because this condition is near to the condition of the theoretical result (0 classification error) and with this condition, RvC ensembles can achieve a very small error with large sizes.
Average RMSE for different ensemble methods for the dataset created from y = 1 + x2- 50xsin (x/2) function, standard deviation is given in brackets. The bold number shows the best result
The average RMSE for other datasets with different methods is presented in Table 6. It is to be noted that the performance of an ensemble method depends upon its ability to produce accurate and diverse models. It has been shown [19] that generally the performance improvement of an ensemble method occurs for first few models. As the size of an ensemble increases, the new added models are not very diverse, hence they don’t provide large improvement. In the experiments, we carried out the classification process in each model with Random Forests of size 50. Then the results of 200 models were combined. Though 10000 trees are trained, they were not combined together in one step.
Average RMSE for different ensemble methods for the different datasets, standard deviation is given in brackets. The bold number shows the best result.‘+/-’ shows that the performance of the algorithm is statistically better/worse than that of Random Forests with 500 trees for that dataset.
The results for Random Forests as a regression method suggest that there is not much difference between with 200 trees and 500 trees so we can conclude that the results with 500 trees are the almost the best results that can be achieved with Random Forests as regression method. We agree that there is no guarantee that further increase of the size will not improve the performance, however, the comparative study of results suggests (the improvement is less than 1% for different datasets as compared to the performance of ensembles of size 200) that there is a small possibility that the further increase of the size of the ensemble will have the big improvement in the performance of Random Forests as a regression method.
Unpaired t-test with 95% was used to compare the performance of two classifiers [6]. Out of 17 datasets, on 6 datasets (Airfoil, Bank8FM, CT slices,Housing(Boston) Kin8mm and Puma32H) the proposed methods show clear advantage over the other methods, whereas for the other datasets, the proposed methods performed similar to other methods. This shows that the proposed methods are useful for different kinds of regression problems.
In this paper, ensemble methods for RvC problems are proposed. It is shown theoretically that these methods give accurate results with accurate binary classifiers. Experiments with different kinds of regression datasets show the effectiveness of the proposed approach. In this paper, most of the experiments were done with Random Forests as classifiers, however, in future, we will do more experiments with different classifiers including deep learning approaches to assess the effectiveness of the proposed ensemble with other classifiers. We will also study for which types of datasets the proposed methods are more useful.
Ensemble pruning [10, 28] selects a subset of individual classifiers in an ensemble so that the combination of these classifiers produces similar or better performance against the combination of all the classifiers in the ensemble. Ensemble pruning has the advantage of low storage resources. Ensemble pruning is another direction of the future research.
