Least squares support vector machine with parametric margin for binary classification 1

Abstract

In this paper, we propose a least squares support vector machine with parametric margin (Par-LSSVM) for binary classification, which only needs to solve a system of linear equation. Par-LSSVM is able to handle the datasets with heteroscedastic noise. And the closer hyperplane to the test data point gives the class label, and this makes Par-LSSVM capable of dealing with “Cross Planes” datasets. The experimental results on several artificial, benchmark and USPS datasets indicate that our proposed algorithm outperforms Par-ν-SVM for binary classification problem.

Keywords

Support vector machine classification least squares parametric margin

1 Introduction

Support vector machine(SVM) is a novel kernel-based tool for binary classification and regression problems [1 –3]. SVM is based on the statistical learning theory and was introduced by Vapnik et al[3]. The standard SVM aims at minimizing an upper bound on the generalization error through maximizing the margin between two parallel hyperplanes for binary classification problem, resulting in a constrained optimization criterion using quadratic programming (QP) problem. This QP problem leads to higher computational cost. Suykens and Vandewalle established least squares support vector machine(LS-SVM)[4] by replacing inequality constraints with equality constraints in formulation of the standard SVM, which tries to avoid the above shortcoming and obtain an analytical solution directly from solving a set of linear equations.

The standard SVM generates a margin with tube shape by parallel transporting the decision hyperplane, however, a margin with arbitrary shape is more accord with the distribution of data points in most cases. Based upon the idea, Pei-Yi Hao[5] introduced a parametric margin ν-support vector machine (Par-ν-SVM), in which a parametric margin model of arbitrary shape by incorporating a change from the functional margin ρ in ν-support vector machine(ν-SVM)[6]. This model is more effective in many cases, especially when the noise is heteroscedastic, that is, the noise strongly depends on the input value. The main point of Par-ν-SVM is the utilization of nonparallel hyperplanes, which is similar to the the generalized eigenvalues in proximal support vector machine(7)[7], the twin support vector machine(8)[8], and the nonparallel hyperplane support vector machine(9)[9]. These algorithms need to solve one (9) or two (8) quadratic programming problems to obtain the two nonparallel hyperplanes. Experimental results presented in [7 , 9] underscore their efficacy and their special abilities in dealing with the “Cross Planes” datasets.

Inspired by Pei-Yi Hao’s Par-ν-SVM and LS-SVM, in this paper, we introduce a parametric margin model into LS-SVM and keep the advantages of both Par-ν-SVM and LS-SVM, and finally develop a new algorithm, called least squares support vector machine with parametric margin, Par-LSSVM for short. Our Par-LSSVM uses the square of slack variable ξ_i with weight $\frac{C}{2 N}$ instead of the slack variable ξ_i with weight $\frac{C}{N}$ , i = 1, …, N . More precisely, for the binary classification problem, our algorithm aims at a pair of nonparallel hyperplanes such that each one is close to one of the two classes. These two hyperplanes are obtained by rotating and transporting the f (x) with h (x) and expressed as $f (x) \pm h (x) = 0,$ (1) where f (x) = 〈w · Φ (x) 〉 + b and h (x) = 〈z · Φ (x) 〉 + d. It should be pointed out that our Par-LSSVM need only to solve a system of linear equations instead of a quadratic programming problem. And our Par-LSSVM is also able to deal with heteroscedastic noise. It is worth mentioning that a new data point x ∈ Rⁿ is assigned to class, depending on which of the two hyperplanes given by (1) it lies closer to. This decision function is more profitable for dealing with “Cross Planes” dataset.

The rest of this paper is organized as follows. A brief introduction of the Par-ν-SVM is given in Section 2. Section 3 develops our Par-LSSVM for binary classification problem. Section 4 deals with the experimental results of toy examples, several UCI benchmark datasets and USPS datasets. Section 5 concludes this paper.

2 ν-Support vector machine with parametric margin(Par-ν-SVM)

The ν-support vector machine with parametric margin model(Par-ν-SVM) [5] is the extensions of ν-SVM for binary classification problem. This section gives a overview of the Par-ν-SVM for classification problem. Interested readers may consult Pei-Yi Hao [5] for details.

Consider a binary classification problem with training set $T = {(x_{1}, y_{1}), \cdot \cdot \cdot, (x_{N}, y_{N})},$ (2) where x_i ∈ Rⁿ is the input and y_i ∈ {-1, + 1} is the output, i = 1, ·· · , N. The Par-ν-SVM was established by using a parametric margin model of arbitrary shape h (x) = 〈z · Φ (x) 〉 + d instead of the fixed margin ρ in ν-SVM [6], where Φ (x) is a mapping from Rⁿ to Hilbert space. Its prime problem is the following constrained optimization problem $\begin{matrix} min_{w, b, z, d, ξ_{i}} & \frac{1}{2} {∥ w ∥}^{2} + C (- ν (\frac{1}{2} ({∥ z ∥}^{2} + d)) + \frac{1}{N} \sum_{i = 1}^{N} ξ_{i}), \\ s . t . & y_{i} (〈 w \cdot Φ (x_{i}) 〉 + b) ⩾ (〈 z \cdot Φ (x_{i}) 〉 + d) - ξ_{i}, \\ d ⩾ 0, ξ_{i} ⩾ 0, for i = 1, \cdot \cdot \cdot, N, \end{matrix}$ (3) where ξ_i ∈ R is the slack variable, i = 1, ⋯ , N and C is a positive penalty parameter. The solution of the prime problem (3) is obtained by finding the saddle point of its Lagrangian function, which leads to the following dual problem $\begin{matrix} max_{α} & \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} y_{i} y_{j} α_{i} α_{j} K (x_{i}, x_{j}) \\ + \frac{1}{2 C ν} \sum_{i = 1}^{N} \sum_{j = 1}^{N} α_{i} α_{j} K (x_{i}, x_{j}), \\ s . t . & \sum_{i = 1}^{N} y_{i} α_{i} = 0, \end{matrix}$ (4) $\begin{matrix} \sum_{i = 1}^{N} α_{i} = C ν, \\ 0 ⩽ α_{i} ⩽ \frac{C}{N}, for i = 1, \dots, N, \end{matrix}$ where α = (α₁, . . . , α_N) ^T is Lagrange multiple vector.

By exploiting the KKT conditions, the solution of the prime problem (3) and the solution of the dual problem (4) have the following relation $w = \sum_{i = 1}^{N} y_{i} α_{i} Φ (x_{i}),$ (5) $z = \frac{1}{C ν} \sum_{i = 1}^{N} α_{i} Φ (x_{i}),$ (6) and the bias terms b and d are

$\begin{matrix} b & = - \frac{1}{2} [〈 w \cdot Φ (x_{i}) 〉 + 〈 w \cdot Φ (x_{j}) 〉 - 〈 z \cdot Φ (x_{i}) 〉 \\ + 〈 z \cdot Φ (x_{j}) 〉], \end{matrix}$ (7)

$\begin{matrix} d & = \frac{1}{2} [〈 w \cdot Φ (x_{i}) 〉 - 〈 w \cdot Φ (x_{j}) 〉 - 〈 z \cdot Φ (x_{i}) 〉 \\ - 〈 z \cdot Φ (x_{j}) 〉], \end{matrix}$ (8) for some i, j such that $α_{i} \in (0, \frac{C}{N})$ , y_i = +1 and $α_{j} \in (0, \frac{C}{N})$ , y_j = -1. Therefore, the decision function f (x) and the parametric margin function h (x) of the resulting Par-ν-SVM can be represented as the following form $f (x) = sgn (\sum_{i = 1}^{N} y_{i} α_{i} K (x_{i}, x) + b),$ (9) $h (x) = \frac{1}{C ν} \sum_{i = 1}^{N} α_{i} K (x_{i}, x) + d,$ (10) where K (· , ·) is a kernel function.

3 Least squares support vector machine with parametric margin (Par-LSSVM)

In this section, we propose the least squares version of SVM with parametric margin for binary classification problem in details. Our Par-LSSVM is a fast and simple algorithm and only needs to solve a system of linear equations for generating both linear and nonlinear classifiers. As earlier mentioned, the hyperplanes f (x) + h (x) =0 and f (x) - h (x) =0 are margins in Par-ν-SVM. While in our Par-LSSVM, which is inspired by clustering, these two hyperplanes are positive and negative hyperplanes which are closer to the positive and negative class, respectively. This lead to the following primal problem:

$\begin{matrix} min_{w, z, b, d, ξ_{i}} & \frac{1}{2} {∥ w ∥}^{2} - C_{1} (\frac{1}{2} {∥ z ∥}^{2} + d) + \frac{C_{2}}{2} \sum_{i = 1}^{N} {ξ_{i}}^{2}, \\ s . t . & y_{i} (〈 w \cdot Φ (x_{i}) 〉 + b) + (〈 z \cdot Φ (x_{i}) 〉 + d) = ξ_{i}, \\ i = 1, \dots, N, \end{matrix}$ (11) where ξ = (ξ₁, ⋯ , ξ_N) ^T is slack variable. Compared with Par-ν-SVM, we make three improvements. Firstly, the optimization problem (11) uses the square of slack variable ξ_i instead of the linear of slack variable ξ_i as usual in the problem (3). Secondly, maximizing the size between f (x) + h (x) =0 and f (x) - h (x) =0, which is characterized by $\frac{1}{2} {∥ z ∥}^{2} + d$ [5], means the positive and negative hyperplanes are far away from the other class to some extent. Thirdly, we use C₁ and C₂ to trade off the model complexity, the size between the two hyperplanes and the clustering effect. An intuitive geometric interpretation is shown in Figure 1, where the two lines f (x) ± h (x) =0 can be thought of as “proximal” lines around which the points of each class are clustered and which may be nonparallel. Based on the above idea, the motivation of our proposed Par-LSSVM is similar to that of 7 and 8. So they have similar motivations and different formulations and algorithms.

In order to get the solution of the prime problem (11), we introduce the Lagrangian function as follows $\begin{matrix} L (w, b, z, d, ξ_{i}, α) = \frac{1}{2} {∥ w ∥}^{2} - C_{1} (\frac{1}{2} {∥ z ∥}^{2} + d) \\ + \frac{C_{2}}{2} \sum_{i = 1}^{N} ξ_{i}^{2} \\ - \sum_{i = 1}^{N} α_{i} [y_{i} (〈 w \cdot Φ (x_{i}) 〉 + b) + (〈 z \cdot Φ (x_{i}) 〉 + d) - ξ_{i}], \end{matrix}$ (12) where α_i is the Lagrangian multiplier, i = 1, …, N. According to the KKT optimality conditions, we have $\begin{matrix} \frac{\partial L}{\partial w} = w - \sum_{i = 1}^{N} α_{i} y_{i} Φ (x_{i}) = 0 \\ \Rightarrow w = \sum_{i = 1}^{N} y_{i} α_{i} Φ (x_{i}), \end{matrix}$ (13a) $\frac{\partial L}{\partial b} = - \sum_{i = 1}^{N} α_{i} y_{i} = 0 \Rightarrow \sum_{i = 1}^{N} y_{i} α_{i} = 0,$ (13b) $\begin{matrix} \frac{\partial L}{\partial z} = - C_{1} z - \sum_{i = 1}^{N} α_{i} Φ (x_{i}) = 0 \\ \Rightarrow z = - \frac{1}{C_{1}} \sum_{i = 1}^{N} α_{i} Φ (x_{i}), \end{matrix}$ (13c) $\frac{\partial L}{\partial d} = - C_{1} - \sum_{i = 1}^{N} α_{i} = 0 \Rightarrow \sum_{i = 1}^{N} α_{i} = - C_{1},$ (13d) $\frac{\partial L}{\partial ξ_{i}} = C_{2} ξ_{i} + α_{i} = 0 \Rightarrow ξ_{i} = - \frac{1}{C_{2}} α_{i},$ (13e) $\begin{matrix} \frac{\partial L}{\partial α_{i}} = y_{i} (〈 w \cdot Φ (x_{i}) 〉 + b) + (〈 z \cdot Φ (x_{i}) 〉 + d) \\ - ξ_{i} = 0 . \end{matrix}$ (13f) Substituting (13a), (13c) and (13e) into (13f), we have

$\begin{matrix} \sum_{i = 1}^{N} y_{i} y_{j} α_{i} 〈 Φ (x_{i}) \cdot Φ (x_{j}) 〉 + y_{j} b + \frac{1}{C_{2}} α_{i} \\ - \frac{1}{C_{1}} \sum_{i = 1}^{N} α_{i} 〈 Φ (x_{i}) \cdot Φ (x_{j}) 〉 + d = 0, \end{matrix}$ (14) arranging (13b), (13d) and (14) in matrix form, we have the linear system of equations as below $(\begin{matrix} 0 & 0 & Y^{T} \\ 0 & 0 & e^{T} \\ Y & e & {ZZ}^{T} - \frac{1}{C_{1}} {XX}^{T} + \frac{1}{C_{2}} I \end{matrix}) (\begin{matrix} b \\ d \\ α \end{matrix}) = (\begin{matrix} 0 \\ C_{1} \\ 0 \end{matrix}),$ (15) where X = (Φ (x₁) , ⋯ , Φ (x_N)) ^T, Y = (y₁, ⋯ , y_N) ^T,Z = (y₁ Φ (x₁) , ⋯ , y_N Φ (x_N)) ^T, e = (1, ⋯ , 1) ^T, 0 = (0, ⋯ , 0) ^T, α = (α₁, ⋯ , α_N) ^T,ξ = (ξ₁, ⋯ , ξ_N) ^T and I is an identity matrix.

After choosing the proper kernel function, and the parameters C₁, C₂, the Lagrangian multiplier vector α and biases b, d can be get by solving the system of linear equations (15). Here, the Matlab command x = A \ b is used to solve the system of linear equations (15). And the functions f (x) = 〈w · Φ (x) 〉 + b and h (x) = 〈z · Φ (x) 〉 + d can be rewrote into a kernel function form $f (x) = \sum_{i = 1}^{N} y_{i} α_{i} K (x_{i}, x) + b,$ (16) $h (x) = - \frac{1}{C_{1}} \sum_{i = 1}^{N} α_{i} K (x_{i}, x) + d,$ (17) where K (· , ·) is a kernel function.

Note that the class of a test point x ∈ Rⁿ is assigned by the following $label (x) = arg min_{k = \pm 1} | f (x) + kh (x) | .$ (18) where | · | is absolute value, that is, depending on which of the two hyperplanes f (x) ± h (x) =0 it lies closer to. This decision function is more profitable for dealing with “Cross Planes” dataset.

This leads to the following algorithm:

Algorithm 1. Par-LSSVM

Given the training set (2);

Select the appropriate kernel function and the parameters C₁, C₂;

Solve the system of linear equations (15) and get its solution b^*, d^* and α^*;

Construct the decision function $label (x) = arg min_{k = \pm 1} | f (x) + kh (x) | .$ (19) where $f (x) = \sum_{i = 1}^{N} y_{i} {α^{*}}_{i} K (x_{i}, x) + b^{*}$ and $h (x) = \frac{1}{C_{1}} \sum_{i = 1}^{N} {α^{*}}_{i} K (x_{i}, x) + d^{*}$ .

4 Numerical Experiments

In this section, we give the experimental results about our Par-LSSVM on toy examples, some UCI benchmark datasets[10] and USPS dataset, which are commonly used in testing machine learning algorithms. We compare our implementation with state of the standard SVM, ν-SVM and Par-ν-SVM. All numerical experiments are carried out on a personal notebook with Inter(R) Core^TMi5 2450M CPU@2.50GHz, 4.00GB memory, and Windows 7 operation system such that the same platform is provided for simulations. The standard SVM and ν-SVM are implemented by LibSVM [14]. And the dual QP problems arising in Par-ν-SVM is solved by optimization toolbox for Matlab 2010. For the value of parameters in these algorithms, we select C and ν from the sets of values {2ⁱ|i = -15, - 14, ⋯ , 15} and {0.1, 0.2, ⋯ , 0.9} by employing the standard 10-fold cross validation techique[11] respectively. In nonlinear case, we employ RBF kernel K (x, x′) = exp(- γ||x - x′||²) and the kernel parameter γ is selected from the set {2ⁱ|i = -15, - 14, ⋯ , 15}. Note that all data sets are normalized such that the features locate in [-1, 1] before learning.

4.1 Toy examples

In this subsection, we first construct two examples of two-dimensional datasets to compare our Par-LSSVM with the other SVMs.

Example 1 is a simple two-dimensional “Cross Planes” dataset, which was also tested in [7 , 12]. The “Cross Planes” dataset are generated by perturbing points lying on two intersecting lines. Figure 2(a) shows the result of SVM, ν-SVM and Par-ν-SVM in linear case, these algorithms are all single plane decision method; Figure 2(b) shows the result of SVM, ν-SVM and Par-ν-SVM in nonlinear case with RBF kernel; Figure 2(c) shows the result of our Par-LSSVM algorithm in linear case. It is ease to see that our Par-LSSVM can get a much better accuracy than the other algorithms even in linear case. And in order to get a comparable accuracy, we only need to utilize the linear kernel, which reduce the compute complexity much more.

Example 2, also utilized in [13], is a artificial dataset with heteroscedastic noise, where the positive class of training points consists of uniform points satisfying x₁ ∈ U [- π/2, 2π], x₂ ∈ U [0.6sinx₁ - 0.25, 0.6sinx₁ + 0.25], while the negative class of training points consists of uniform points satisfying x₁ ∈ U [- π/2, 2π], x₂ ∈ U [0.6sin(x₁/1.05 + 0.4) -1.35, 0.6sin(x₁/1.05 - 0.85)], where the input data X = (x₁, x₂) ^T (see Fig. 3). The width of the margin between the two classes of the dataset depends on the input location. Fig. 3(a–d) describe the results of SVM, ν-SVM, Par-ν-SVM and our Par-LSSVM with the RBF kernels on this data set. It can be seen that the separating hyperplanes of these classifications have similar results. However, our proposed Par-LSSVM gives a pair of more suitable hyperplanes which best describe the tendency of these two classes.

4.2 Benchmark datasets

In order to further investigate our proposed Par-LSSVM with SVM, ν-SVM and Par-ν-SVM, we choose eleven datasets from the UCI machine learning repository [10].

Table 1 is concerned with Par-LSSVM, SVM, and par-v-SVM and Par-ν-SVM with linear kernel. We report the accuary and the training time of these algorithms. We computed the means and standard errors of the accuracies through ten 10-fold cross-validation that generates then disjoint validation sets. We can see that Par-LSSVM is more than ten times faster than Par-v-SVM because it solves a linear equation in Par-LSSVM with same size. The classification accuracy of linear Par-LSSVM is the best on four UCI datasets:Australian, Haberman, Heart and Sonar. Although the performance of Par-LSSVM is not the best on the rest of UCI datasets, it is comparable with of SVM, ν-SVM and Par-ν-SVM.

Table 2 shows the comparison of classification accuracies and the training times of SVM, ν-SVM, Par-ν-SVM and Par-LSSVM with the nonlinear RBF kernel. We also give the means and standard errors of the accuracies through ten 10-fold cross-validation. Table 2 reveals that the accuracies of Par-LSSVM are better than that of Par-ν-SVM. And it is to be noted that computing time of Par-LSSVM is more than ten times faster than Par-ν-SVM. The classification accuracies of Par-LSSVM are the best on six of eleven UCI datasets. And the performance of Par-LSSVM are similar to other algorithms on the rest five UCI datasets.

By the way, in fact, our proposed Par-LSSVM requires solving a linear system of equations. The training time of Par-LSSVM should be faster than SVM, ν-SVM and Par-ν-SVM. However, in Table 1 and Table 2, the computing times of our Par-LSSVM are less compared to those of SVM and ν-SVM with both linear and nonlinear cases. This is because that SVM and ν-SVM are implemented by LibSVM. It is well known that LibSVM implements an SMO-type algorithm and its source code was wrote by C++.

In addition, our proposed Par-LSSVM is an approach based on nonparallel hyperplanes idea, so we give another set of numerical experiment results including nine UCI benchmark datasets, which compare with two classical nonparallel hyperplanes approaches: 7 and 8. Table 3 give the detail results. From the Table 3, we can see that the performance of Par-LSSVM is better than those of 7 and 8 in six datasets. And the classification accuracies of Par-LSSVM have greater difference than the rest two algorithms in the following datasets: BUPA liver, Australian and Pima-Indian.

4.3 USPS dataset

The USPS dataset is the well-known United States Postal Service(USPS) handwritten digits recognition corpus. Figure 4 shows the appearance of USPS dataset. We compute ten pairwise digits of varying difficulty for digit recognition. Table 4 shows the results of experiments for SVM, ν-SVM, Par-ν-SVM and Par-LSSVM with RBF kernel function. Similarly, we also give the means and standard errors of the accuracies through ten 10-fold cross-validation. Tables 3 discloses that the accuracies of Par-LSSVM are similar or better comparing to SVM, ν-SVM and Par-ν-SVM.

5 Conclusion

In this paper, we have proposed the least squares support vector machine with parametric margin model for the binary classification problem(Par-LSSVM). Compared with the previous Par-ν-SVM, its computational cost reduced since it needs only solving system of linear equation instead of solving quadratic programming problem. Par-LSSVM uses a pair of nonparallel hyperplanes to close each class respectively. The closer hyperplane to the test data point gives the class label, and this makes our algorithm capable for dealing with “Cross Planes” data sets. The experimental results on toy, UCI benchmark datasets and USPS dataset indicated that Par-LSSVM had a comparable generalization performance with SVM, ν-SVM and Par-ν-SVM, and faster learning speed than Par-ν-SVM.

References

Cortes

and Vapnik

, Support-vector networks, Machine Learning20(3) (1995), 273–297.

Deng

, Tian

and Zhang

, Support vector machines: Optimization based theory, algorithms, and extensions(Vol, 29). CRC Press, 2012.

Vapnik

, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.

Suykens

J.A.

and Vandewalle

, Least squares support vector machine classifiers, Neural Processing Letters9(3) (1999), 293–300.

Hao

P.Y.

, New support vector algorithms with parametric insensitive/margin model, Neural Networks23(1) (2010), 60–73.

Chang

C.C.

and Lin

C.J.

, Training v-support vector classifiers: Theory and algorithms, Neural Computation13(9) (2001), 2119–2147.

Mangasarian

O.L.

and Wild

E.W.

, Multisurface proximal support vector machine classification via generalized eigenvalues, Pattern Analysis and Machine Intelligence, IEEE Transactions on28(1) (2006), 69–74.

Jayadeva

Khemchandani R.

and Chandra

, Twin support vector machines for pattern classification, Pattern Analysis and Machine Intelligence, IEEE Transactions on29(5) (2007), 905–910.

Shao

Y.H.

, Chen

W.J.

and Deng

N.Y.

, Nonparallel hyperplane support vector machine for binary classification problems, Information Sciences263 (2014), 22–35.

10.

Blake

C.L.

and Merz

C.J.

, UCI repository of machine learning databases, Univ. Irivine. CA: California, Dept. Inform. Comptut. Sci., [Online]. Available, 1998, http://kdd.ics.uci.edu/.

11.

Duda

R.O.

, Hart

P.E.

and Stork

D.G.

, Pattern Classification, 2nd Edition. John Wiley and Sons, 2001.

12.

Shao

Y.H.

, Zhang

C.H.

, Wang

X.B.

and Deng

N.Y.

, Improvements on twin support vector machines, Neural Networks, IEEE Transactions on22(6) (2011), 962–968.

13.

Peng

, TPMSVM: A novel twin parametricmargin support vector machine for pattern recognition, Pattern Recognition44(10) (2011), 2678–2692.