On robust twin support vector regression in primal using squared pinball loss

Abstract

Construction of robust regression learning models to fit training data corrupted by noise is an important and challenging research problem in machine learning. It is well-known that loss functions play an important role in reducing the effect of noise present in the input data. With the objective of obtaining a robust regression model, motivated by the link between the pinball loss and quantile regression, a novel squared pinball loss twin support vector machine for regression (SPTSVR) is proposed in this work. Further with the introduction of a regularization term, our proposed model solves a pair of strongly convex minimization problems having unique solutions by simple functional iterative method. Experiments were performed on synthetic datasets with different noise models and on real world datasets and those results were compared with support vector regression (SVR), least squares support vector regression (LS-SVR) and twin support vector regression (TSVR) methods. The comparative results clearly show that our proposed SPTSVR is an effective and a useful addition in the machine learning literature.

Keywords

Kernel methods pinball loss robust support vector regression

1 Introduction

Over the last years, support vector machines (SVMs) introduced by Vapnik [20] have emerged as one of the most powerful methods for solving classification and regression problems. They have been widely studied and successfully applied in various pattern recognition areas of research, such as image processing, bioinformatics, economics [2 , 6]. Although SVMs are shown to be powerful machine learning tools, the training of SVMs lead to solving time consuming quadratic programming problems (QPPs). In fact, the training time complexity of SVM is O (m³) which makes SVM intractable once m becomes large where m is the size of the training set.

With the aim of reducing training cost, following the work on nonparallel SVM called Proximal SVM via Generalized Eigenvalues (GEPSVM) introduced in [10], twin SVM (TWSVM) for binary classification was proposed recently by Jayadeva et al. [4].

TWSVM seeks two nonparallel hyperplanes with the property that each one of them is as close as possible to inputs of one of the two classes and at the same time at least one unit distance away from the inputs of the other class. This method has attracted lot of interest in recent years and the interested reader is referred to [8, 17].

Prediction by regression is an important field of study in machine learning. With the introduction of ε- insensitive error loss function by Vapnik [20], SVM has been successfully extended to regression. Like SVM for classification, the training of support vector regression (SVR) solves a QPP with linear inequality constraints. In the sprit of TWSVM, Peng [15] developed twin support vector regression (TSVR) for function approximation and regression. TSVR generates a pair of ε- insensitive down-bound and up-bound functions such that (i) both the functions lie as close as possible to the training data; (ii) all the training data is required to lie above the down-bound function but below the up-bound function; (iii) for each training data, its distance from the down-bound function and similarly from the up-bound function should be at least ε. This strategy results in solving two smaller sized QPPs rather than solving a single QPP of large size as in the case of SVR [9, 20]. The advantage of this methodology is that the time complexity of TSVR becomes significantly smaller than SVR [15].

In many real world applications, observed data are very often subject to unknown noise distributions. It is well known that loss functions play a crucial role in reducing the effect of noise present in the training data. In fact, the generalization ability of a learning method is largely dependent on choosing the proper loss function that represents the characteristics of the noise present in the training data. The popularly used loss functions for regression are: quadratic, 1-norm and ε- insensitive functions, defined by: for x ∈ R, (i) L (x) = x²; (ii) L (x) = |x| and (iii) L (x) = max {|x| - ε, 0} respectively [9, 20]. Among them, the quadratic loss function is smooth and hence attractive but is less robust because it is sensitive to large errors. Unlike quadratic loss function, both the 1-norm and ε- insensitive loss functions reduce the sensitivity to noise and hence they are more robust for learning. However, they are not smooth which precludes the application of the well known numerical minimization methods.

Recently, with the aim of reducing the sensitivity to noise and further improving the stability to re-sampling, asymmetric pinball SVM classifier based on pinball loss has been proposed in [13]. As its extension to the squared pin ball loss SVM for classification, an asymmetric least squares SVM is studied in [14]. In this work, it is proposed to use asymmetric squared pinball loss for TSVR and study its effectiveness and applicability on few synthetic noisy datasets and also on few well known bench mark datasets. For recent work on pinball loss SVM, we refer the reader to [12 , 19].

In this study, all vectors will be column vectors. For any vector x = (x₁, …, x_n) ^t ∈ Rⁿ, let its transpose and 2-norm be denoted by x^t and ||x|| respectively, and define the plus function x₊ by: (x₊) _i = max {0, x_i} and i = 1, …, n. The m dimensional vector of zeros and the vector of ones will be denoted by 0 and e respectively and the identity matrix of appropriate size is denoted by I .

The rest of the paper is organized as follows. In the next section, the formulations of the standard SVR, LS-SVR and TSVR are briefed. In Section 3, we formulate our novel robust squared pinball loss TSVR in primal (SPTSVR) and solve by functional iterative method. Numerical experiments have been performed on synthetic datasets having different types of noise and on real world datasets whose results are compared with SVR, LS-SVR and TSVR in Section 4. The conclusion is drawn in Section 5.

2 Related work

In this section, we brief the formulations of support vector regression (SVR), least squares support vector regression (LS-SVR) and a popular variant of SVR known as twin support vector regression (TSVR) proposed by Peng [15].

Let a set {(x_i, y_i)} _i=1,2,...,m of training examples be given where for the input x_i ∈ Rⁿ its corresponding observed value be y_i ∈ R. Let A ∈ R^m×n be the input matrix whose i-th row is $x_{i}^{t}$ and let y = (y₁, …, y_m) ^t be the vector of observed values. For the kernel function k (. , .) given, let K (A, A^t) be the kernel matrix of order m whose ij-th entry is (K (A, A^t)) _ij = k (x_i, x_j) . Again, for any x ∈ Rⁿ, K (x, A^t) = (k (x, x₁) , …, k (x, x_m)) be a row vector.

2.1 Support vector regression

For a training set given and correspondingly to obtaining a nonlinear regression function, it is desired that the training data will be mapped into a higher dimensional feature space via a nonlinear mapping φ (.) [9, 20] and a linear regressor will be constructed in the feature space in which the resulting regression function will made as flat as possible.

The standard SVR learning method determines the regression function f (.) by solving the following QPP [9, 20] $\min_{w, b, ξ_{1}, ξ_{2}} \frac{1}{2} w^{t} w + C \sum_{i = 1}^{m} (ξ_{1 i} + ξ_{2 i})$ subject to $\begin{matrix} y_{i} - w^{t} φ (x_{i}) - b \leq ε + ξ_{1 i}, \\ w^{t} φ (x_{i}) + b - y_{i} \leq ε + ξ_{2 i} \end{matrix}$ (2.1)and $ξ_{1 i}, ξ_{2 i} \geq 0$ for i = 1,2,…,m, where ξ_1i and ξ_2i are slack variables, C > 0 is the regularization parameter and ε > 0 is an input.

The solution of (2.1) can be obtained by solving its dual of the form $\begin{matrix} min_{u_{1}, u_{2} \in R^{m}} \frac{1}{2} \sum_{i, j = 1}^{m} (u_{1 i} - u_{2 i}) k (x_{i}, x_{j}) (u_{1 j} - u_{2 j}) + \\ ε \sum_{i = 1}^{m} (u_{1 i} + u_{2 i}) - \sum_{i = 1}^{m} y_{i} (u_{1 i} - u_{2 i}) \end{matrix}$ subject to $\sum_{i = 1}^{m} (u_{1 i} - u_{2 i}) = 0 and 0 \leq u_{1}, u_{2} \leq C e$ and the nonlinear regression function is given by $f (x) = \sum_{i = 1}^{m} (u_{1 i} - u_{2 i}) k (x, x_{i}) + b,$ where u₁ = (u_1l, …, u_1m) ^t, u₂ = (u_2l, …, u_2m) ^t ∈R^m are the Lagrangian multipliers and assuming a kernel function k (x, y) = φ (x) ^tφ (y) avoids the construction of the nonlinear function φ (.). For more details on the formulation and the derivations, we refer the reader to [9, 20].

2.2 Least squares support vector regression

For the set of training examples given, the least squares support vector regression (LS-SVR) method solves a QPP based on quadratic loss function subject to equality constraints defined as $\begin{matrix} min_{w, b, ξ} \frac{1}{2} w^{t} w + \frac{C}{2} \sum_{i = 1}^{m} ξ_{i}^{2} \\ subject to \\ y_{i} = w^{t} φ (x_{i}) + b + ξ_{i}, i = 1, 2, \dots m, \end{matrix}$ where the mapping φ (.) takes the input data into a higher dimensional feature space; the vector w from the feature space and the bias b ∈ R are unknowns; ξ = (ξ₁, …, ξ_m) ^t is the residual vector and C > 0 is the regularization parameter.

The dual of the above problem can be obtained leading to solving the following matrix equation $[\begin{matrix} 0 & e^{t} \\ e & (I / C + K (A, A^{t})) \end{matrix}] [\begin{matrix} b \\ u \end{matrix}] = [\begin{matrix} 0 \\ y \end{matrix}],$ where u is the Lagrangian multiplier vector.

Using the solution of the above matrix equation, the nonlinear prediction functionf (.) is obtained $f (x) = \sum_{i = 1}^{m} u_{i} k ({x, x}_{i}) + b$

2.3 Twin support vector regression

Motivated by the study of twin support vector machines (TWSVM) for classification proposed by Jayadeva et al. [4], a new learning method for regression called twin support vector regression (TSVR) was developed by Peng in [15]. TSVR determines two nonparallel functions f₁ (x) and f₂ (x) namely the ε—insensitive down-bound and up-bound functions by solving two smaller SVR type QPP’s rather than a single QPP of larger size. Due to the strategy of solving two smaller QPP’s the training time of TSVR becomes shorter than SVR [15].

In this section, we briefly state the TSVR problem formulation. For a detailed discussion on the problem formulation, the method of solving and its advantages, see [15].

Assume that the kernel generated ε-insensitive down- and up- bound regression functions are $\begin{matrix} f_{1} (x) = K (x, A^{t}) w_{1} + b_{1} and \\ f_{2} (x) = K (x, A^{t}) w_{2} + b_{2} \forall x \in R^{n} \end{matrix}$ (2.2)respectively. They will be determined using the solutions of the following pair of minimization problems [15] $\begin{matrix} \min_{(w_{1}, b_{1}, ξ_{1}) \in R^{m + l + m}} \frac{1}{2} ∥ y - ε_{1} e - (K (A, A^{t}) w_{1} + b_{1} e) ∥^{2} \\ + C_{1} e^{t} ξ_{1} \\ subject to \\ y - (K (A, A^{t}) w_{1} + b_{1} e) \geq ε_{1} e - ξ_{1}, ξ_{1} \geq 0 \end{matrix}$ (2.3) and $\begin{matrix} min_{(w_{2}, b_{2}, ξ_{2}) \in R^{m + l + m}} \frac{1}{2} {∥ y + ε_{2} e - (K (A, A^{t}) w_{2} + b_{2} e) ∥}^{2} \\ + C_{2} e^{t} ξ_{2} \\ subject to \\ (K (A, A^{t}) w_{2} + b_{2} e) - y \geq ε_{2} e - ξ_{2}, ξ_{2} \geq 0 \end{matrix}$ (2.4)respectively, where C₁, C₂ > 0 are regularization parameters; ε₁, ε₂ > 0 are inputs and ξ₁, ξ₂ are vectors of slack variables.

Since the pair of constrained optimization problems (2.3) and (2.4) is equivalent to the following pair of unconstrained optimization problems $\begin{matrix} min_{(w_{1}, b_{1}) \in R^{m + l}} \frac{1}{2} {∥ y - ε_{1} e - (K (A, A^{t}) w_{1} + b_{1} e) ∥}^{2} \\ + C_{1} \sum_{i = l}^{m} L_{H} ((K (x_{i}, A^{t}) w_{1} + b_{1}) - (y_{i} - ε_{1})) \end{matrix}$ (2.5) and $\begin{matrix} min_{(w_{2}, b_{2}) \in R^{m + l}} \frac{1}{2} {∥ y + ε_{2} e - (K (A, A^{t}) w_{2} + b_{2} e) ∥}^{2} \\ + C_{2} \sum_{i = l}^{m} L_{H} ((y_{i} + ε_{2}) - K (x_{i}, A^{t}) w_{2} + b_{2}) \end{matrix}$ (2.6)where for any t ∈ R, L_H (t) = t₊ = max {0, t} is the hinge loss function, one can solve (2.5) and (2.6) and determine the bound functions (2.2).

For a new test sample, its prediction value using TSVR will be obtained as the mean of its down- and up-bound regressors, i.e. $f (x) = \frac{1}{2} (f_{1} (x) + f_{2} (x)) for all x \in R^{n}$ (2.7)

3 Squared pinball loss twin support vector regression (SPTSVR)

In this section, we propose a novel twin support vector regression algorithm using squared pinball loss function (SPTSVR) as an extension of pinball SVM for classification proposed in [13] with the purpose of reducing sensitivity to noise present in the training data. Our formulation leads to solving a pair of unconstrained minimization problems which we will solve by functional iterative method.

Consider the squared pinball loss function defined as: For the input parameter 0 ≤ p ≤ 1, $L_{p} (t) = {\begin{matrix} {pt}^{2} & for t \geq 0 \\ (1 - p) t^{2} & otherwise \end{matrix}$

Clearly, L_p (t) = p max {0, t } ² + (1 - p) max {0, - t } ² or equivalently $L_{p} (t) = {pt}_{+}^{2} + (1 - p) {(- t)}_{+}^{2} .$ (3.1)

The graphical representation of squared pinball loss function for various values of p is shown in Fig. 1. It is clear from Fig. 1 that squared pinball loss function is an asymmetric function. Note that this function is more suitable for dealing with asymmetric noise [13]. Again, one can observe that (3.1) will become the symmetric quadratic loss function when p = 1/2.

Fig. 1

Illustration of squared pinball loss functions for different values of p.

Now we describe our proposed squared pinball TSVR (SPTSVR) learning method. Like TSVR, our SPTSVR method constructs a pair of nonparallel regressors of the form (2.2) by solving the pair of unconstrained minimizationproblems $\begin{matrix} min_{(w_{1}, b_{1}) \in R^{m + 1}} (w_{1}^{t} w_{1} + b^{2}) \\ + D_{1} | | y - ε_{1} e - (K (A, A^{t}) w_{1} + b_{1} e) | |^{2} \\ + C_{1} \sum_{i = 1}^{m} L_{p} ((K (x_{i}, A^{t}) w_{1} + b_{1}) - (y_{i} - ε_{1})) \end{matrix}$ (3.2) and $\begin{matrix} min_{(w_{2}, b_{2}) \in R^{m + 1}} (w_{2}^{t} w_{2} + b^{2}) \\ + D_{2} | | y + ε_{2} e - (K (A, A^{t}) w_{2} + b_{2} e) | |^{2} \\ + C_{2} \sum_{i = 1}^{m} L_{p} ((y_{i} + ε_{2}) - (K (x_{i}, A^{t}) w_{2} + b_{2})) \end{matrix}$ (3.3)where C₁, D₁ > 0 and C₂, D₂ > 0 are regularization parameters. Note that the hinge loss in TSVR has been replaced by squared pinball loss and further to avoid over-fitting an extra regularization term has been added in our SPTSVR model. This will make the objective functions (3.2) and (3.3) strongly convex which implies the minimization problems will have unique solutions.

Now, by taking the vectors $u_{1} = [\begin{matrix} w_{1} \\ b_{1} \end{matrix}]$ , $u_{2} = [\begin{matrix} w_{2} \\ b_{2} \end{matrix}]$ and from (3.1), the pair of problems (3.2) and (3.3) can be rewritten as $\begin{matrix} min_{u_{1} \in R^{m + 1}} L_{1} (u_{1}) = u_{1}^{t} u_{1} + D_{1} | | y - ε_{1} e - G u_{1} | |^{2} \\ + C_{1} [p | | (ε_{1} e + G u_{1} - y)_{+} | |^{2} \\ + (1 - p) | | (y - ε_{1} e - G u_{1})_{+} | |^{2}] \end{matrix}$ (3.4) and $\begin{matrix} min_{u_{2} \in R^{m + 1}} L_{2} (u_{2}) = u_{2}^{t} u_{2} + D_{2} | | y + ε_{2} e - G u_{2} | |^{2} \\ + C_{2} [p | | (y + ε_{2} e - G u_{2})_{+} | |^{2} \\ + (1 - p) | | (G u_{2} - y - ε_{2} e)_{+} | |^{2}] \end{matrix}$ (3.5)where G = [K (A, A^t) e] is an augmented matrix.

In this work, it is proposed to find the solutions of the problems (3.4) and (3.5) by obtaining their critical points using functional iterative method.

Consider the problem (3.4). Its critical point can be obtained by taking its gradient vector to zero. Let ∇L₁ (u₁) be the gradient of L₁ (u₁) . Then, solving for ∇L₁ (u₁) = 0 will become

$\begin{matrix} u_{1} + D_{1} [G^{t} G u_{1} - G^{t} (y - ε_{1} e)] \\ + C_{1} [{pG}^{t} (ε_{1} e - y + G u_{1})_{+} \\ - (1 - p) G^{t} (y - ε_{1} e - G u_{1})_{+}] = 0 \end{matrix}$

from which we get $\begin{matrix} (\frac{I}{D_{1}} + G^{t} G) u_{1} = G^{t} [(y - ε_{1} e) \\ + \frac{C_{1}}{D_{1}} {(1 - p) (y - ε_{1} e - G u_{1})_{+} - p (ε_{1} e - y + G u_{1})_{+}}] \end{matrix}$ which further leads to a simple functional iterative procedure: for i = 0,1,2, … $\begin{matrix} u_{1}^{i + 1} = {(\frac{I}{D_{1}} + G^{t} G)}^{- 1} G^{t} [(y - ε_{1} e) \\ + \frac{C_{1}}{D_{1}} {(1 - p) (y - ε_{1} e - G u_{1}^{i})_{+} - p (ε_{1} e - y + G u_{1}^{i})_{+}}] \end{matrix}$ (3.6)where ${(\frac{I}{D_{1}} + G^{t} G)}^{- 1}$ is the inverse of $(\frac{I}{D_{1}} + G^{t} G)$ .

Similarly, the critical point for the problem (3.5) will be determined by $\begin{matrix} u_{2}^{i + 1} = {(\frac{I}{D_{2}} + G^{t} G)}^{- 1} G^{t} [(y + ε_{2} e) \\ + \frac{C_{2}}{D_{2}} {p (y + ε_{2} e - G u_{2}^{i})_{+} - (1 - p) (G u_{2}^{i} - y - ε_{2} e)_{+}}] \end{matrix}$ (3.7)

Finally, like in TSVR, the predicted value for a test sample will be computed using (2.7) or more precisely $f (x) = [K (x, A^{t}) 1] (\frac{u_{1} + u_{2}}{2})$ where u₁, u₂ are the solutions of (3.6), (3.7) and [K (x, A^t) 1] is an 1 × (m + 1) augmented row vector

Remark. For simplicity, numerical experiments in the following section were performed by taking C₁ = D₁ and C₂ = D₂.

4 Experimental results

Now, we demonstrate the performance of SPTSVR in comparison with SVR, LS-SVR and TSVR. We implemented the algorithms in MATLAB R2008b running on a Microsoft Windows7 PC with Intel(R) Core (TM) i7-4790 CPU@(3.60 GHz) processor having 4GB of memory. SVR and TSVR were implemented using the MOSEK (http://www.mosek.com) optimization toolbox for MATLAB. But, no external optimizer is used for LS-SVR and SPTSVR. The Gaussian kernel of the form K (x, y) = exp(- μ||x - y||²) is used, where μ >0 is a parameter. For measuring the test accuracy, the root mean square error (RMSE) is applied.

To avoid model complexity, we assumed ε₁ = ε₂ for TSVR and, however, for SPTSVR we set ε₁ = ε₂ = 0.01. The optimal values for the kernel parameter μ, regularization parameter C₁ = C₂ = C, the parameters ε and p are selected using ten-fold cross validation methodology by searching from the sets {2⁻⁵, 2⁻⁴, …, 2⁴, 2⁵} , {10⁻⁵, 10⁻⁴, …, 10⁴, 10⁵}, {10⁻¹, 10⁻², 10⁻³} and {0.1, 0.2, 0.4, 0.7, 0.9, 1} respectively.

4.1 Synthetic datasets

In this subsection, it is proposed to examine the robustness of our SPTSVR in comparison with SVR, LS-SVR and TSVR. For this purpose, the observed values of the training examples, generated as function values, will be purposefully corrupted by uniform or Gaussian additive noise having heteroscedastic error structure. In our study, we considered three functions and their training samples are constructed as follows

Function 1. $f_{1} (x) = \frac{sin (x)}{x}$ such that

$y_{i} = f_{1} (x_{i}) + (0.5 - \frac{| x_{i} |}{8 π}) ζ_{i}$ and

x_i ∈ U (-4π, 4π), i = 1, 2, …, 200;

Table 1
Accuracy comparison of SPTSVR with SVR, LS-SVR and TSVR on synthetic datasets with noise Type A and Type B. Test accuracy is reported in terms of RMSE. Gaussian kernel was used

Dataset	Types of noise	SVR RMSE (C, μ, ε)	LS-SVR RMSE (C, μ)	TSVR RMSE (C₁ = C₂, μ, ε)	SPTSVR RMSE (C₁ = C₂, μ, p)
Function1	Type A	0.0552	0.0638	0.0427	0.0362
(200×2,500×2)		(10⁰, 2⁻³, 10⁻²)	(10², 2⁻²)	(10⁻⁵, 2⁻¹, 10⁻¹)	(10⁴, 2⁻⁵, 1)
	Type B	0.0210	0.0297	0.0216	0.0200
		(10², 2⁻⁵, 10⁻²)	(10², 2⁻²)	(10⁻¹, 2⁻⁴, 10⁻³)	(10¹, 2⁻⁴, 1)
Function2	Type A	0.0147	0.0245	0.0465	0.0195
(200×2,500×2)		(10¹, 2¹, 10⁻²)	(10⁻¹, 2⁻³)	(10⁻², 2³, 10⁻³)	(10³, 2¹, 1)
	Type B	0.0208	0.0194	0.0106	0.0099
		(10³, 2⁰, 10⁻²)	(10⁰, 2⁻³)	(10⁻⁵, 2¹, 10⁻¹)	(10⁴, 1, 0.7)
Function3	Type A	0.0961	0.0864	0.0406	0.0722
(200×2,500×2)		(10⁰, 2³, 10⁻²)	(10¹, 2¹)	(10¹, 2³, 10⁻³)	(10¹, 2³, 1)
	Type B	0.0933	0.0756	0.0626	0.0332
		(10¹, 2³, 10⁻²)	(10⁰, 2¹)	(10⁻⁵, 2³, 10⁻³)	(10¹, 2³, 1)

Function 2. f₂ (x) =0.2 sin(2πx) +0.2x² + 0.3

such that $y_{i} = f_{2} (x_{i}) + (0.1 x_{i}^{2} + 0.05) ζ_{i}$ and

x_i = 0.01 (i - 1) , i = 1, 2, …, 200;

Function 3. f₃ (x) = x + 2 exp(-16x²) such that $y_{i} = f_{3} (x_{i}) + (x_{i}^{2} + 0.5) ζ_{i}$ and x_i = 0.01 (i - 1) -1, i = 1, 2, …, 200,

where U (a, b) denotes the uniform probability distribution in (a, b) and the noise ζ is drawn from the distributions:

Type A: ζ ∈ U (-1, 1), Type B: ζ ∈ N (0, 0.5²).

Here, N (0, 0.5²) denotes the Gaussian distribution with mean value zero and standard deviation equals to 0.5.

For testing, 500 samples free of noise were randomly selected. To avoid biasness due to the selection of random samples, we performed the experiments ten times and their averaged accuracy is reported. The test accuracies by the proposed SPTSVR along with SVR, LS-SVR and TSVR for Gaussian kernel are summarized in Table 1. From the table, one can observe that all the methods show comparable generalization performance. Always SPTSVR shows the best performance for both types of noise, except for Function1 and Function3 for the case of uniform noise. To further illustrate the advantage of SPTSVR, the accuracy plots and the prediction error plots for one sample trial for Function1 by all the methods for both type of noise considered are shown in Figs. 2–5. From the error plots we can clearly see that SPTSVR has a better performance than the other methods considered. In summary, we conclude that SPTSVR is a robust, efficient method for noisydataset.

Fig. 2

Prediction plot of SVR, LS-SVR, TSVR and SPTSVR on Function1 with uniform noise.

Fig. 3

Prediction error plot of SVR, LS-SVR, TSVR and SPTSVR on Function1 with uniform noise.

Fig. 4

Prediction plot of SVR, LS-SVR, TSVR and SPTSVR on Function1 with Gaussian noise.

Fig. 5

Prediction error plot of SVR, LS-SVR, TSVR and SPTSVR on Function1 with Gaussian noise.

4.2 Real world datasets

To investigate the effectiveness of SPTSVR, numerical experiments were conducted on fifteen real world datasets: Pollution, Bodyfat, Quake, Balloon and NO2 datasets from the statlib library: “http://lib.stat.cmu.edu/datasets” Auto price, Triazines, Auto-original, Forest fires, ConcreteCS, Abalone and Wine quality datasets from UCI repository [11]; SantaFe-A time series dataset from “http://www-psych.stanford.edu/~andreas/Time-Series/SantaFe.html” the well known hydraulic actuator dataset, used in nonlinear system identification [1, 7] and Demo dataset from “http://www.cs.toronto.edu/~delve/data”, and their results are compared with SVR, LS-SVR and TSVR.

Table 2
Accuracy comparison of SPTSVR with SVR, LS-SVR and TSVR on real world datasets. Test accuracy is reported in terms of RMSE. Gaussian kernel was used

Dataset	SVR RMSE (C, μ, ε)	LS-SVR RMSE (C, μ)	TSVR RMSE (C₁ = C₂, μ, ε)	SPTSVR RMSE (C₁ = C₂, μ, p)
Pollution	0.0967	0.0958	0.0927	0.0911
(60×15)	(10⁰, 2⁻³, 10⁻²)	(10¹, 2⁻²)	(10⁻³, 2⁻⁵, 10⁻¹)	(10¹, 2⁻², 1)
Auto price	0.0660	0.0736	0.0740	0.0744
(159×15)	(10¹, 2⁻⁴, 10⁻²)	(10¹, 2⁻⁴)	(10³, 2⁻⁵, 10⁻¹)	(10², 2⁻⁵, 0.4)
Triazines	0.1514	0.1354	0.1215	0.1206
(186×19)	(10⁰, 2⁻⁴, 10⁻¹)	(10¹, 2⁻³)	(10⁻¹, 2⁻⁴, 10⁻¹)	(10³, 2⁻⁴, 1)
Bodyfat	0.0260	0.0259	0.0258	0.0253
(252×14)	(10², 2⁻⁵, 10⁻³)	(10², 2⁻⁵)	(10⁻¹, 2⁻⁴, 10⁻¹)	(10⁴, 2⁻⁵, 0.4)
Auto-original	0.0607	0.0604	0.0619	0.0590
(392×7)	(10⁰, 2⁰, 10⁻²)	(10², 2⁻²)	(10⁻¹, 2⁻¹, 10⁻¹)	(10¹, 2¹, 0.2)
NO2	0.0848	0.0947	0.0864	0.0810
(500×7)	(10⁰, 2⁰, 10⁻²)	(10¹, 2⁰)	(10⁻², 2⁻¹, 10⁻³)	(10¹, 2¹, 1)
Forest fires	0.0592	0.0515	0.0576	0.0593
(517×12)	(10⁻⁵, 2⁻⁵, 10⁻³)	(10⁻², 2⁻²)	(10⁻¹, 2⁻³, 10⁻¹)	(10⁻⁴, 2⁵, 0.1)
SantafeA	0.0262	0.0239	0.0237	0.0241
(995×5)	(10², 2¹, 10⁻³)	(10⁻¹, 2²)	(10⁻², 2², 10⁻³)	(10⁴, 2², 0.4)
Hydraulic actuator	0.0120	0.0124	0.0112	0.0108
(1021×5)	(10⁵, 2⁻³, 10⁻³)	(10², 2²)	(10⁻², 2², 10⁻³)	(10⁴, 2², 0.9)
ConcreteCS	0.0707	0.0784	0.0780	0.0734
(1030×8)	(10¹, 2⁻¹, 10⁻²)	(10², 2⁻¹)	(10⁻¹, 2⁻¹, 10⁻¹)	(10², 2⁰, 0.4)
Wine quality red	0.1192	0.1252	0.1246	0.1197
(1599×11)	(10⁰, 2⁰, 10⁻²)	(10⁰, 2¹)	(10^–1, 2^–3, 10^–3)	(10⁰, 2¹, 0.1)
Balloon	0.0447	0.0436	0.0453	0.0436
(2001×1)	(10⁰, 2⁰, 10⁻¹)	(10¹, 2⁰)	(10⁰, 2⁻², 10⁻¹)	(10⁻², 2¹, 1)
Demo	0.0837	0.0854	0.0841	0.0810
(2048×4)	(10⁰, 2⁴, 10⁻²)	(10⁰, 2⁴)	(10⁻², 2¹, 10⁻¹)	(10⁻¹, 2⁵, 0.4)
Quake	0.1730	0.1716	0.1714	0.1715
(2178×3)	(10⁴, 2⁻¹, 10⁻¹)	(10⁻¹, 2⁰)	(10⁻², 2⁻⁵, 10⁻³)	(10⁻¹, 2⁻², 0.4)
Abalone	0.0743	0.0753	0.0731	0.0726
(4177×8)	(10⁰, 2¹, 10⁻²)	(10², 2⁰)	(10⁻¹, 2⁰, 10⁻¹)	(10¹, 2², 0.4)

Hydraulic actuator dataset contains 1024 pair of samples with u(t) being the input gas flow rate and its output y (t) being the oil pressure. In our experimental study, we predict y (t) by assuming

x (t) = (y (t - 1) , y (t - 2) , y (t - 3) , u (t - 1) , u (t - 2)). With this choice, a set of 1021 samples {(x (t) , y (t))} is obtained. Note that all experiments were performed after normalizing the data by making each attribute value to lie in the interval [0, 1].

As in the case of synthetic datasets, experimental results of SVR, LS-SVR, TSVR and our proposed method SPTVR along with their optimal parameter values for the Gaussian kernel are shown in Table 2. Also, using Table 2, the ranking of the learning algorithms on RMSE for each of the fifteen datasets is summarized in Table 3. From the table, we see that (i). the average rank of SPTVSR is the minimum; (ii). among the fifteen datasets, the best accuracy is obtained at nine times by SPTSVR. This clearly indicates that our proposed SPTVSR solved by functional iterative method is more accurate than the popular SVR, LS-SVR and TSVR methods. Also, the average rank of TSVR is smaller than SVR and LS-SVR and because of which TSVR may be preferred over SVR and LS-SVR. However, for the statistical comparative analysis of the four algorithms, as it was suggested in [5], the nonparametric Friedman test with the corresponding post-hoc tests may be applied. Assuming the null hypothesis that all the four algorithms are equivalent, let us compute the following:

Table 3

Average rank for SVR, LS-SVR, TSVR and SPTSVR with Gaussian kernel for real world datasets.

Dataset	SVR	LS-SVR	TSVR	SPTSVR
Pollution	4	3	2	1
Auto price	1	2	3	4
Triazines	4	3	2	1
Bodyfat	4	3	2	1
Auto-original	3	2	4	1
NO2	2	4	3	1
Forest fires	3	1	2	4
SantafeA	4	2	1	3
Hydraulic actuator	3	4	2	1
ConcreteCS	1	4	3	2
Wine quality red	1	4	3	2
Balloon	3	1.5	4	1.5
Demo	2	4	3	1
Quake	4	3	1	2
Abalone	3	4	2	1
Average rank	2.8	2.9667	2.4667	1.7667

$χ_{F}^{2} = \frac{12 \times 15}{4 \times 5} [2 . 8^{2} + 2 . 9667^{2} + 2 . 4667^{2} + 1 . 7667^{2} - \frac{4 \times 5^{2}}{4}]$ ≅7.6239 $and F_{F} = \frac{14 \times 7.6239}{15 \times 3 - 7.6239} ≅ 2.8557$ , where F_F is distributed according to F-distribution with (3, 3 × 14) = (3, 42) degrees of freedom. From statistical table, the critical value of F (3, 42) for the level of significance α = 0.05 is 2.8270. Since the computed F_F value on RMSE is greater than 2.8270, we reject the null hypothesis. So, we proceed with Nemenyi post-hoc test [5] and perform pair-wise comparison of the algorithms.

From [5], the critical value at p = 0.10 is 2.291 and its corresponding critical difference (CD) is: $2.291 \sqrt{\frac{4 \times 5}{6 \times 15}} \approx 1.08 .$

Since the difference between the best and the worst of SVR, TSVR and SPTSVR is: 2.8 - 1.7667 = 1.0333 < 1.08, the post-hoc test becomes unsuccessful to find any difference among them;

However, the difference between the ranks of LS-SVR and SPTSVR is: 2.9667 - 1.7667 = 1.2>1.08, we conclude that SPTSVR performs superior to LS-SVR;

Similarly, the difference between the best and the worst of SVR, LS-SVR and TSVR is: 2.9667 - 2.4667 = 0.5<1.08, the post-hoc test is unsuccessful.

In conclusion, SPTSVR has improved generalization performance than the popular SVR, LS-SVR and TSVR methods on either the input samples were subject to noise, as in the case of synthetic datasets, or may be free of noise which clearly shows the usefulness and the applicability of SPTSVR.

5 Conclusion

In this work, a novel formulation of TSVR with asymmetric pinball loss function in primal (SPTSVR), whose solution was obtained using a functional iterative algorithm, was proposed. Numerical experiments were conducted on synthetic and real world bench mark datasets. Empirical results obtained for datasets which were corrupted by adding noise having heteroscedastic error structure show that SPTSVR was both robust and efficient. In addition, better performance by SPTSVR on majority of the fifteen bench mark datasets considered further clearly shows the effectiveness and usefulness of SPTSVR. As our method can be extended to classification, our future work will be on the study of its effectiveness for classification, more specifically when the input samples are corrupted by noise.

Footnotes

Acknowledgments

The authors are extremely thankful to the referees for their valuable comments.

References

Gretton

, Doucet

, Herbrich

, Rayner

P.J.W.

and Scholkopf

, Support vector regression for black-box system identification. In: Proceedings of the 11th IEEE Workshop on Statistical Signal Processing, 2001.

Osuna

, Freund

and Girosi

, Training Support Vector Machines: An Application to Face Detection, Proceed Computer Vision and Pattern Recognition (1997), 130–136in:.

Guyon

, Weston

, Barnhill

and Vapnik

, Gene selection for cancer classification using support vector machine, Machine Learning46 (2002), 389–422.

Khemchandani Jayadeva

and Chandra

, Twin support vector machines for pattern classification, IEEE Transactions on Pattern Analysis and Machine Intelligence29 (5) (2007), 905–910.

Demsar

, Statistical comons of classifiers over multiple data sets, Journal of Machine Learning Research7 (2006), 1–30.

Min

J.E.

and Lee

Y.C.

, Bankruptcy prediction using optimal choice of kernel function parameters, Expert Systems with Applications28 (4) (2005), 603–614.

Sjoberg

, Zhang

, Ljung

, Berveniste

, Delyon

, Glorennec

, Hjalmarsson

and Juditsky

, Nonlinear black-box modeling in system identification: A unified overview, Automatica31 (1995), 1691–1724.

Kumar

M.A.

and Gopal

, Least squares twin support vector machines for pattern classification, Expert Systems with Applications36 (2009), 7535–7543.

Cristianini

and Shawe-Taylor

, An introduction to support vector machines and other kernel based learning method, Cambridge. Cambridge University Press, 2000.

10.

Mangasarian

O.L.

and Wild

E.W.

, Multisurface proximal support vector classification via generalized eigenvalues, IEEE Transactions on Pattern Analysis and Machine Intelligence28 (1) (2006), 69–74.

11.

Murphy

P.M.

and Aha

D.W.

, UCI repository of machine learning databases, University of California, Irvine, 1992, http://www.ics.uci.edu/~mlearn.

12.

Leng

, Liu

and Qin

, Binary tree construction of multiclass pinball SVM via farthest centroid selection, in Advances in Intelligent Systems and Interactive Applications, Xhafa

(ed.), Springer, 2017, pp. 323–329.

13.

Huang

, Shi

and Suykens

J.A.K.

, Support vector machine classifier with pinball loss, IEEE Transactions on Pattern Analysis and Machine Intelligence36 (5) (2014), 984–997.

14.

Huang

, Shi

and Suykens

J.A.K.

, Asymmetric least squares support vector machine classifiers, Computational Statistics and Data Analysis77 (2014), 371–382.

15.

Peng

, TSVR: An efficient twin support vector machine for regression, Neural Networks23 (3) (2010), 365–372.

16.

Shen

, Niu

, Qi

and Tian

, Support vector machine classifier with truncated pinball loss, Pattern Recognition68 (2017), 199–210.

17.

Shao

, Deng

and Yang

, Least squares recursive projection twin support vector machine for classification, Pattern Recognition45 (6) (2012), 2299–2307.

18.

, Wang

, Pang

and Tian

, Maximum margin of twin spheres machine with pinball loss for imbalanced data classification, Applied Intelligence48 (2018), 23–34.

19.

, Yang

and Pan

, A novel twin support-vector machine with pinball loss, IEEE Transactions on Neural Networks and Learning Systems28 (5) (2017), 359–370.

20.

Vapnik

V.N.

, The nature of statistical learning theory, 2nd ed., New York, Springer, 2000.