Abstract
The concept of a large margin is central to support vector machines and it has recently been adapted and applied for nearest neighbour classification. In this paper, a modification of this method is proposed in order to be used for regression problems. This model also allows the use of a set of prototypes with different distance metrics, which can increase the flexibility of the method especially for problems with large number of instances. The learning of the distance metrics is performed by two optimization methods, namely an evolutionary algorithm and an approximate differential approach. A real world problem, i.e. the prediction of the corrosion resistance of some alloys containing titanium and molybdenum is considered as a case study. It is shown that the suggested method provides very good results compared to other well-known regression algorithms.
Keywords
Introduction
Regression analysis includes any technique for modelling different kinds of processes with the goal of finding a relationship between a dependent variable and one or more independent variables, given a set of training instances or vectors in the form of (
Presently, there are many methods that can be used for regression. Beside analytical models, where the task is to find adequate values for their coefficients, usually by means of differential optimization, several machine learning techniques can be mentioned, such as neural networks, support vector machines (ɛ-SVR, ν-SVR), decision trees (M5P, Random Forest, REPTree) or rules (M5, Decision Table) etc.
k-Nearest Neighbour (kNN) [6] is a simple, efficient way to estimate the value of the unknown function in a given point using its values in other points. Let S ={
Another more elaborate version of the method considers a weighted average, where the weight of each neighbour depends on its proximity to the query point:
A comprehensive review of locally weighted regression and classification methods is given in [3].
Even if kNN is a very simple method, it usually performs very well for a wide range of problems, especially when the number of instances is large enough and there is little noise in the data. One of the most important aspects of the method is the choice of the distance function. While Euclidian distance is the most commonly encountered, other particularisations of the general Minkowski distance can be used, e.g. the Manhattan distance. Some experiments in cognitive psychology suggest that humans use an exponential negative distance function for certain types of classification tasks [1].
However, the classical approach does not take into account any specific information about the problem. Beside the common practice of normalizing the instance values independently on all dimensions, there is little domain knowledge incorporated into the method.
This article investigates the use of the concept of a large margin, best known in the context of support vector machines, for regression problems, by adapting its existing formulation for classification problems.
We organize our paper as follows. Section 2 presents the large-margin nearest neighbour method and some of its extensions. Section 3 describes the proposed method, including the use of prototypes and two optimization approaches: an evolutionary algorithm and an approximate differential technique. Section 4 focuses on a case study, and Section 5 contains the conclusions.
The present work is an extended version of [12], which introduces an approximate differential method for the optimization of weights and prototype positions, in addition to the initial evolutionary approach. This method is much faster and allows the study of more complex scenarios, e.g. the simultaneous search for weights and prototype positions. Both optimization techniques are evaluated using original software implementations.
Because of the importance of the distance metric, researchers sought to find ways to adapt it to the problem at hand in order to yield better performance. This is the issue of distance metric learning [5, 18,]. The idea of a large margin, one of the fundamental ideas of support vector machines, was transferred to the kNN method for classification tasks [19–21], resulting in the large-margin nearest neighbour method (LMNN). In this case, learning involves the optimization of a convex problem using semidefinite programming. The LMNN technique was also extended to incorporate invariance to multivariate polynomial transformations [11]. Since our regression method builds on the LMNN method for classification [21], we will present it with more details as follows.
In general, distance metric learning aims at finding a linear transformation
Since all the operations in k-nearest neighbour classification or regression can be expressed in terms of square distances, an alternative way of stating the transformation is by means of the square matrix:
For a classification problem, the choice of
Here, the value of 1 is arbitrary; the idea is to have some minimum value for the margin that separates the classes. However, it was proven that other minimum values for the margin would not change the nature of the optimization problem, but will only result in the scaling of matrix
Overall, the optimization problem is defined as follows:
The two objectives, i.e. minimizing the distance to targets and maximizing the distance to impostors, are conflicting. Following an analogy to attraction and repulsion forces in physics, Weinberger and Saul [21] introduce two forces, ɛ pull and ɛ push , whose balance is reached by setting the value of λ h . In their work, the authors consider the importance of these forces to be equal, i.e. λ h = 1.
Other researchers, e.g. [11], suggest the use of regularization to avoid the overfitting caused by large values of the elements of
Although there are several studies concerned with LMNN classification, so far not many researchers have addressed the issue of regression. One recent paper that presents some modifications of the LMNN metric learning for regression is [2].
A binary classification problem can be considered as a special case of a regression problem where the desired function only takes two values: 0 and 1. In this section, we will generalize the concepts of the LMNN method and adapt it for regression. As optimization tools we will consider both an evolutionary algorithm and an approximated differential method based on the central difference formulation of the derivative. They can provide more flexibility in situations when several distance metrics should be learned instead of a single one.
Distance metric learning using prototypes
In our work,
By using the
In this formulation, there is a single, global matrix
The objective function F, which is to be minimized, takes into account 3 criteria:
In order to simplify the expressions of the F
i
functions, let us make the following notations, where d
M
means the weighted square distance function using the weights we search for: d
ij
= d
M
(
Thus, the first criterion is:
The second criterion is expressed as follows:
It takes into account a pair of neighbours, j and l, by analogy to a target and an impostor. However, for our regression problem we do not have these notions because we do not have a class which could be the same or different. We can only take into account the real values of the instance outputs. The reasoning is the same as for the first criterion, but now we try to minimize the distance to the neighbours with close values (the positive term), while simultaneously trying to maximize the distance to the neighbours with distant values (the negative term). By analogy to Equation (8), we consider that a margin of at least 1 should be present between an instance with a close value and another with a distant value. The max function is used by analogy to the expression of the hinge loss. If we consider that j has a close value and l has a distant one, the corresponding hinge loss will be 0 only when d il ≥ 1 + d ij . The condition j ≠ l is implicit because when the terms are equal they cancel each other out.
The third criterion is used for regularization, to avoid large values of the weights:
However, for our case study presented in Section 4, the weights were already very small and this term was not needed. Overall, the following values for the weights of the criteria were used: and .
The first optimization method used to solve the problem defined in Equation (8) is an evolutionary algorithm. The following parameters were used in the experiments: 40 chromosomes in the population, tournament selection with 2 candidates, arithmetic crossover with a probability of 95% and mutation with a probability of 5% in which a gene value is reinitialized to a random value in its corresponding domain of definition. Elitism was also used such that the best solution in a generation is never lost. As a stopping criterion, a maximum number of generations, 500 in our case, was used. The number of genes in a chromosome depends on the number of prototypes and the dimensionality of the problem, namely: n g = n p · n i , where n g is the number of genes, n p is the number of prototypes and n i is the number of inputs.
The domain of the genes, which represent the elements of
The advantage of using an evolutionary algorithm for optimization (as well as the method described in Section 3.3) is that prototypes can be used, with different weight values, instead of a single, global set of weights. As mentioned above, when computing the distance from a certain training instance, the distance is weighted by the values corresponding to the nearest prototype of the training instance.
We compute the positions of the prototypes using k-means, a simple clustering algorithm which tends to favour (hyper)-spherical clusters. This is why it is very appropriate for our problem, where distances are computed with variations of the Euclidian metric.
Approximate differential optimization
An alternative to evolutionary algorithms is the optimization based on gradient descent. If the current estimate is in a neighbourhood of a solution, differential optimization guarantees that the exact solution will be found, unlike the evolutionary-based approach, which often provides an approximation of the actual solution. On the other hand, the initial value in a differential approach is very important, because the algorithm can only converge to its basin of attraction, and that can represent the difference between a local and a global optimum. The evolutionaryalgorithm, as a population-based method which conducts a parallel search through the problem space, has more chances to find the global optimum, or at least an acceptable approximation of it.
For our particular problem, there are several issues that make standard gradient-based approach difficult to apply. First, the objective function is non-linear, because of the max function in Equation (13). Moreover, even if this could be circumvented, the expressions of the components of the objective functions themselves, i.e. Equations (12) and (13), are quite complex, and this means that finding the gradient expression analytically is challenging. Third, it is convenient to also optimize the positions of the prototypes in the same process.
When using the evolutionary approach, the problem space proved to be large enough that it was not feasible to also search for the positions of the prototypes. That is why a compromise solution was attempted, namely to use a clustering method to pre-compute the position of the prototypes and then only optimize their corresponding weights.
However, with a much faster gradient-based approach, it becomes feasible, but this problem has an additional level of non-linearity. When the position of a prototype changes, the instances that are closer to it also change. Therefore, the objective function is computed in a different way, by considering different instances or the same instances with all of a sudden different weights, especially since the weights of different prototypes can be quite dissimilar. This is also difficult to express in an analytical formulation of the gradients.
That is why it was decided to use an approximate differential method instead, following the central difference definition of the derivative [10]. That is, for a very small ɛ, the following relation holds:
An alternative expression is more complex but has higher theoretical accuracy:
In our case, both equations were tried, but the actual results were quite similar, therefore Equation (15) was chosen for the case studies presented in Section 4, because it is simpler and can be computedfaster.
The solution is found iteratively. After computing the gradient of the current estimate ∇F (
In order to mitigate the problem of convergence into local optima, the algorithm was applied several times, with different initial estimates. Despite an increased execution time, this proved to be a very effective way to find good quality solutions.
In this section, we present a simple example regarding the effect of altering the distance metric. Let us consider a simple problem with 3 instances, enumerated in the form (x1, x2, y): {0, 0.4, 0}, {1, 0.4, 0}, {0.5, 0.6, 1}. Although the output is 0 and 1, like in a classification problem, we used the regression algorithm described above. A very small weight is obtained for x1 and a very large one for x2, meaning that the distance between the first two instances becomes virtually 0 and the distance between each of them and the third one is maximized.
The actual distances between these 3 instances, both in the original space and in the transformed space, are presented in Table 1, where d x i stands for the square distance on the x i dimension, and d is the total square distance.
Figure 1a presents the initial nearest neighbour regression with 3 neighbours, where the shades of grey correspond to the value of the function from 0 (black) to 1 (white). Figure 1b represents the same decision area with a threshold of 0.5, transforming the problem into a binary classification one. Figure 1c represents the decision area after distance metric learning. Since the space is now stretched on the x2 axis (the ordinate), the margin is seen as a condensed area in the middle of the original space. Figure 1d turns again the regression problem into a corresponding binary classification one. It can be seen that the large margin imposed between the instances with different values splits the original space into two approximately equal areas, since the problem is symmetrical.
Case studies
In this section we will present the results of applying the regression method to a practical problem, namely the prediction of the corrosion of some alloys containing titanium and molybdenum (TiMo).
Description of data
Titanium (Ti) and its alloys are widely used in dental applications due to the excellent corrosion resistance and mechanical properties. However, it has been reported that Ti is sensitive to fluoride (F–) and lactic acid. Consequently, the samples were examined using electrochemical impedance spectroscopy (EIS) in acidic artificial saliva with NaF and/or caffeine. The material corrosion was quantified by the polarization resistance (R p ) of the TiMo alloys (the output of the model) which was recorded depending on the immersion time, caffeine concentration, NaF concentration, the type of alloy, i.e. Ti content, and solution pH (the inputs of the model). The experimental data covered a large domain, corresponding to the following conditions: 12, 20 and 40 wt.% of Mo, 0.1, 0.3 and 0.5 wt.% NaF, 0, 0.5, 1 and 1.5 mg/mL caffeine and 3–8 for solution pH [4].
Results and discussion
In order to compare different algorithms and models, two separate problems are considered. The first is to assess the performance on the testing set, after randomizing the dataset and selecting 2/3 of the instances as the training set and 1/3 as the testing set. The second one is to perform cross-validation with 10 folds.
The coefficient of determination (R2), the squared coefficient of correlation, is used as a metric to compare the performance of different models.
Table 2 presents the results obtained with various algorithms implemented in the popular collection of machine learning algorithms Weka [8], for the training-testing split and for cross-validation. The results are sorted in decreasing order of the cross-validation results. On each column, the maximum value is emphasised with bold characters.
The abbreviations in the table are as follows: ν-SVR and ɛ-SVR stand for ν-Support Vector Regression and ɛ-Support Vector Regression, respectively, RBF stands for Radial Basis Function kernel, P2 stands for polynomial kernel of the second degree, kNN stands for k-Nearest Neighbours and NN stands for simple Nearest Neighbour.
Evolutionary approach
When using the evolutionary algorithm, 10 neighbours were considered when computing the output, because this value gave the best results for the kNN method in Weka. It was considered that fewer neighbours do not provide enough information to compute a precise value, while if more neighbours are used, noise can be added and often the influence of the more distant ones is negligible, since the neighbour weights are proportional to the inverse of the square distance.
In the computation of the fitness function, 3 reference neighbours were used. This number should be at least 2, in order to provide some contrast between an instance with a close value and another with a distant value, by analogy with the target and impostor instances for classification. With 2 reference neighbours, the results were far worse than with 3, with the best coefficient of determination of 0.8984. With more reference neighbours, the computation time greatly increased, while the results were no better than in case of 3 reference neighbours.
Table 3 shows the average and best results for the case of the training set – testing set split and also for cross-validation with 10 folds, while considering 4 different configurations: with 1, 2, 5 and 10 prototypes. Theoretically, each instance could have its own metric, but in our case, with a dataset of 1152 instances, the computation time would be too high for any practical purposes.
The positions of the clusters were obtained with the k-means algorithm taking into account only the inputs of the instances.
The results presented were obtained after 50 runs. The results with R2 < 0.5 were eliminated because they were considered to be outliers. The results on the training set are omitted, because they are always 1.
While cross-validation is a way to compare a set of algorithms, it cannot provide a unique set of “good” values for the internal parameters, because it generates 10 different models. Therefore, the single split into training and testing sets is useful in this respect.
The best performance for the testing set is given by the configuration with 1 prototype, i.e. a global set of weights for all the instances of the problem. The actual values of the weights are presented in Table 4.
In order to be more intuitive, the values are normalized, such that their sum is equal to 1. By normalization, i.e. multiplication with a constant, the results of the regression procedure do not change. It should be reminded that the weights m
i
i are in fact the squares of the gene values representing the elements of the
Table 5 presents the weights for the best fit on the training-testing problem, corresponding to 2 prototypes.
Beside the variation in performance between different runs, which is normal especially since the number of generations (500) and individuals in the population (40) may be a little too small for our problem, we hypothesise that there may be another reason why 1 prototype provides the best results for the training-testing case and 2 prototypes provide the best results for cross-validation. In one fold of cross-validation, 90% of the data are used, compared to 67% for training in the first case. Therefore, the data diversity is larger and 2 prototypes provide more flexibility. Also the positions of the prototypes are currently fixed, pre-computed, taking into account the whole dataset, and thus the 90% situation better matches the distribution of the full data. However, when the number of prototypes increases, the problem space becomes too finely partitioned, because the evolutionary algorithm evolves the weights independently and it is almost certain that the sets of weights corresponding to different prototypes will be different, although the actual topology of the problem may not require such distinctions in the distance metric of the instances.
Differential approach
An important issue concerning the speed of the algorithm is the step size γ in Equation (17). That is why several methods were tried to find the best value, both statically and dynamically. If the step size is too small, the algorithm converges very slowly. If the step size is too big, the algorithm may “jump over” the solution and fail to provide a good result.
The static approach is to have a constant value for γ, in our case 0.1 and 0.5. The dynamic approach is to consider a decreasing rate for γ, i.e. larger at the beginning and getting smaller as the algorithm approaches the solution, similar to a fine tuning procedure.
Two empirically found models were evaluated. It was considered that the step size should start around 1 and then it should decrease with the number of iterations, but not below 0.1.
The first model is the reciprocal quadratic one, with the following general expression:
In our case, the value of the step size is:
The second model is the Farazdaghi-Harris one, with the following general expression:
In our case, the value of the step size is:
Figure 2 presents a comparison between the evolution of the step sizes following the proposed models, for a typical scenario with the minimum number of parameters: 1 prototype and 3 neighbours both for regression predictions and the optimization of the objective function.
One can see that the smallest number of iterations appears when using the reciprocal quadratic model. For different configurations, and even for different runs of the same configuration, the actual number of iterations differs. However, the reciprocal quadratic model provided the quickest convergence times for the tests we made. Therefore, it was used for all the following case studies.
When using a constant step size, one can see that a value of 0.1 leads to a number of iterations approximately four times greater than for a value of 0.5.
In Fig. 3, one can see the convergence of the objective function. The algorithm was stopped when the difference between two consecutive values of the objective function was below 10–6. The reciprocal quadratic model for the step size also leads to good optimization results. In fact, all 4 models succeed in solving the optimization problem, but the main difference is the number of steps needed to achieve the goal.
Since 3 optimization neighbours were used, the objective function converges to 9. In general, the objective function converges to the square of the number of optimization neighbours, because criterion F1 in Equation (12) tends to converge to 0 and criterion F2 in Equation (13) tends to converge to the square of the number of instances in N (i).
For all the following configurations, the algorithm was applied 10 times and the best result wasrecorded.
In Table 6, one can see the cross-validation results when varying the parameters of the configuration, i.e. the number of prototypes, the number of optimization neighbours and the number of regression neighbours.
It must be mentioned that the number of optimization neighbours was only increased in the approximate differential approach. For the evolutionary approach, the execution time greatly increased when using more than 3 optimization neighbours. The regression neighbours are those instances actually used to compute the value of the test instance, independent of the distance metrics.
The coefficient of determination R2 was again used to measure the performance of the algorithm.
Unlike the case of the evolutionary approach, with the approximate differential optimization good results were obtained when using a smaller number of regression neighbours, e.g. 3, while 10 were used for the evolutionary algorithm. One explanation for this difference is that the evolutionary algorithm does not provide exact values for the optimal weights and therefore more regression neighbours must be used to compensate for this lack of precision.
Figure 4 presents the correlation between the desired values and the values provided by the model, where the provided values are the predictions of the cross-validation models for the testing groups alone. Beside the high value of the coefficient of determination, one can also graphically observe the good fit of the data.
One can also see that there is a single instance with the value of 1, for which the algorithm fails to provide a good prediction. The rest of the instances have a maximum output value around 0.52. In this case, the model treats the single instance as an outlier. It cannot reach a value of 1, based on the composition of the other values which are below 0.52.
There are also 2 instances whose values should be around 0.1, but the prediction is around 0.4. This is because they lie in a region of the problem space where the closest neighbours are those with higher output values. When using a larger number of regression neighbours, their values are reduced, but in turn, the values of other instances are affected.
Overall, despite those exceptions, a small number of regression neighbours provide the best results for the considered problem.
Table 7 presents the best performance of the algorithm in case of the training-testing data split.
Figure 5 presents the correlation between the desired values and the values provided by the model for the training-testing data split. One can see that the distribution of data is similar to the one in Fig. 4, but, generally, the results are better. The instance with the output value of 1 is missing here, because it was randomly included in the training set. Also, there is only one instance whose predicted value is clearly different from its desired output value. Even in this case, the predicted value is below 0.3, which is closer to the desired output than the two similar instances in Fig. 4.
Tables 8–11 present the prototype positions and weights for the best solutions obtained with a different number of prototypes: 1, 2, 3 and 5. Of course, with only 1 prototype, only the weight values matter.
It can be seen that there is no unique combination of weights that are found by the algorithm, because it is very likely that the objective function has a large number of local optima. The dimension corresponding to input x3 has small weights in most situations. However, it is difficult to obtain a general interpretation of the results, because different prototypes characterize different regions in the problem space, and in each region there may be different rules regarding the importance of the inputs for the regression process.
Concerning the optimization methodology, some additional tests have been made to compare the optimization of the large margin nearest neighbour model with the direct differential optimization based on the mean square error criterion. The latter method was used for a division of the dataset of 50% for training, 17% for validation and 33% for testing (the same testing set as in the training-testing data split presented above). It was found that the former method consistently offered better results.
It must also be reported that sometimes, with both evolutionary and differential approaches, the minimization of the objective function does not lead to good regression results. Thus, it is possible to have situations where the objective function is very small, but the coefficient of determination between the desired and the predicted output values is small as well, e.g. about 0.8. Basically, the same value for the objective function can lead to both very good and less than good results. While the objective function expressed in Equation (11) is nevertheless useful, a future direction of investigation needs to search for an alternative objective function which should be more appropriate for regression than the one used here, designed by analogy with the function proposed for classification.
The results obtained for a rather difficult problem using a large margin nearest neighbour regression method are quite promising. The weights are obtained using two optimization methods, based on an evolutionary algorithm and on an approximate differential technique, which provides simplicity and flexibility and allows the use of several distance metrics in different regions of the problem space, corresponding to different prototypes. As a future direction of research, one should also find a way to automatically determine the optimal number of prototypes and their positions.
The presented method allows the user to change the number of neighbours that are considered for distance calculation and in the criteria of the composite fitness function in order to adapt these parameters to the particular characteristics of the problem.
Even if the optimization methods presented here require repeated experiments for the same configuration, the results are very good and actually better than the results provided by well-established machine learning algorithms implemented in Weka. This is a very encouraging fact and it shows that the model based on the combination of the k-nearest neighbour paradigm with the idea of the large margin has a great potential for regression problems as well as for classification ones.
The modelling procedure applied for corrosion resistance evaluation as a function of process conditions contributes to a better understanding of the process and can partially replace a series of experiments that usually require great quantities of materials, time and energy.
Footnotes
Acknowledgments
This work was supported by the “Partnership in priority areas – PN-II” programme, financed by ANCS, CNDI - UEFISCDI, project PN-II-PT-PCCA-2011-3.2-0732, no. 23/2012.
