Support vector classification with fuzzy hyperplane

Abstract

In this paper, we develop a novel support vector algorithm with fuzzy hyperplane for pattern classification. We first introduce the concepts of fuzzy hyperplane and fuzzy linear separability. Then, the proposed approach seeks a fuzzy hyperplane that best separates the positive class from the negative class with the widest margin in the feature space. Further, the decision function of the proposed approach is generalized so the values assigned to the individuals fall within a specified range and indicate the membership degree of these individuals in a given category. This integration preserves the benefits of fuzzy set theory and SVM theory, where the use of the fuzzy hyperplane provides the SVM with effective means for capturing the approximate, imprecise nature of the real world. On the other hand, the SVM provides the advantage to minimize the structural risk and effectively generalize the unseen data. Experimental results are then presented which indicate the performance of the proposed approach.

Keywords

Support vector machine (SVM)maximal-margin classifier fuzzy set theory fuzzy classifier

1 Introduction

In many real applications, the observed information is often imprecise, uncertain and incomplete, which can be represented by fuzzy set. Fuzzy set theory provides a strict mathematical framework in which vague conceptual phenomena can be precisely and rigorously studied [9 , 28]. Since its inception in 1965, the fuzzy sets theory has advanced in a variety of ways and in many disciplines. Fuzzy set theory lets us work at a high abstraction level and offers a means for dealing with imprecise data measurement. Most important, fuzzy set theory allows us to deal with vague or inexact nature of the real world.

The Support Vector Machine (SVM) is a promising kernel-based learning algorithm for data classification and regression. It was first introduced by Vapnik et al. in 1995 as an approximate implementation of the structure risk minimization [8, 25]. For achieving best generalization performance, SVM finds the best tradeoff between the model complexity and the learning ability according to the principle of the statistical learning theory. In many applications, SVM has been shown to outperform most traditional learning machines and has been introduced as powerful tools for solving classification and regression problems [3]. Since SVM and fuzzy set theory have been very successful in machine learning problems, Lin et al. [17, 18] introduced a fuzzy membership as a weighting factor to reduce the influence of noises or outliers over the training procedure and proposed the first fuzzy SVM. Whereas Chiang et al. [7] applied SVM theory for the fuzzy rules based modeling. Lin et al. [19] proposed the support vector based fuzzy neural network for pattern classification. Hao et al. [10] applied the fuzzy function to the support vector regression machine. Ji et al. [Ji, 2014] incorporated the concept of the possibility measure and fuzzy number to the support vector based fuzzy classifier.

In this paper, we incorporate the concept of fuzzy set theory into the support vector classification model. We first introduce the concepts of fuzzy hyperplane and fuzzy linear separability. Then, we derive a new support vector classification algorithm with the fuzzy hyperplane. The proposed approach seeks an optimal fuzzy hyperplane in the feature space to separate the positive class from the negative class with the widest margin. Further, the decision function of the proposed approach is generalized so the values assigned to the individuals fall within a specified range and represent the membership degree of these individuals in a given category. The proposed approach combines the superior classification power of support vector machine in high dimensional data spaces and the efficient means of fuzzy set theory in handling imprecise information. The experimental results show the proposed approach can yield a satisfactory generalization performance and meanwhile can estimate the membership grade that an individual belongs to a given class.

The rest of this paper is organized as follows. A brief review of the theory of the support vector classification machine is described in Section 2. The proposed support vector classification with the fuzzy hyperplane is derived in Section 3. Experiments are presented in Section 4, and some concluding remarks are given in Section 5.

2 Review of support vector classification machine

Suppose we are given a set of labeled training data vectors {x_i, y_i}, i = 1, …, N. Each training data vector x_i ∈ Rⁿ belongs to either of two classes and is given a label y_i ∈ {-1, 1}. The support vector classification machine (for brevity, we call it C-SVC) attempts to find the hyperplane 〈w · Φ (x) 〉 + b = 0 that best separates two classes of data vectors with the widest margin. The optimal separating hyperplane is obtained by solving the following constrained optimization problem:

$\begin{matrix} \underset{w, b, ξ_{i}}{minimize} \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{N} ξ_{i} \\ subject to y_{i} (〈 w \cdot Φ (x_{i}) 〉 + b) \geq 1 - ξ_{i} ξ_{i} \geq 0 \forall i . \end{matrix}$ (1)

The regularization parameter C controls the trade-off between the maximization of the margin (the width of the margin in C-SVC is 2/∥ w ∥) and the minimization of the training error. By introducing the Lagrange multiplier technique, we obtain the following dual optimization problem:

$\begin{matrix} \underset{α_{i}}{maximize} {\begin{matrix} - \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} y_{i} y_{j} α_{i} α_{j} 〈 Φ (x_{i}) \cdot Φ (x_{j}) 〉 \\ + \sum_{i = 1}^{N} α_{i} \end{matrix} \\ subject to \sum_{i = 1}^{N} y_{i} α_{i} = 0, α_{i} \in [0, C] \forall i, \end{matrix}$ (2) where α_i are the Lagrange multipliers. After optimizing the dual problem, the vector w has the form: $w = \sum_{i = 1}^{N} y_{i} α_{i} Φ (x_{i})$ (3)

By introducing the Mercer kernel k (x_i, x_j) =〈 Φ (x_i) · Φ (x_j) 〉, the corresponding decision function (or classifier) is then given by $f (x) = sgn (\sum_{i = 1}^{N} y_{i} α_{i} k (x_{i}, x) + b) .$ (4)

We can easily check that all training points are considered as the same important during the learning procedure of SVM. This leads to a higher sensitivity to some special cases, such as noises or outliers. Besides, the decision function f (x) of the SVM takes a value of either +1 or –1, i.e. each data point belongs to either of two classes. An unambiguous, sharp distinction exists between the positive class and the negative class. However, many classification concepts we commonly employ and express in natural language describe sets that do not exhibit this characteristic. Examples are the set of tall people, highly contagious diseases, or business news. We perceive these sets as having imprecise boundary that facilitate gradual transition from membership to non-membership and vice versa. The decision function of the SVM should be generalized so the value assigned to the individual falls within a specified range and represents the membership degree that individual is similar or compatible with the classification concept represented by the fuzzy set. Thus, individual may belong in the fuzzy set to a greater or lesser degree as indicated by a larger or smaller membership grade.

3 Support vector classification with fuzzy hyperplane

In this section, we incorporated the concept of fuzzy set theory into the support vector classification model. The proposed approach attempt to find a “fuzzy hyperplane” that best separates the positive class from the negative class. The parameters to be identified in the fuzzy hyperplane, such as the components of the weight vector and bias term, are fuzzy numbers. The fuzzy parameters studied in this work are restricted to a class of symmetric “triangular” fuzzy numbers.

3.1 The quadratic programming problem

To seek a fuzzy hyperplane that best separates the positive class from the negative class with the widest margin, we need the following preliminaries.

Preliminary 1. [16] For any fuzzy number A, B and α ∈ (0, 1], A^α = [a₁, a₂] and B^α = [b₁, b₂] denote the α-cuts of A and B, respectively. If we define the partial ordering of closed intervals in the usual way, that is

[a₁, a₂] ≥ [b₁, b₂] iff a₁ ≥ b₁ and a₂ ≥ b₂

then for any fuzzy number A, B, we have $A \underset{f}{\geq} B iff A^{α} \geq B^{α}$ (5) for all α ∈ (0, 1], where “ $\underset{f}{\geq}$ ” denotes the fuzzy larger than.

Let X = (m, c) be a symmetric triangular fuzzy number where m is the center and c is the width. From Preliminary 1, for any two symmetric triangular fuzzy numbers A = (m_A, c_A) and B = (m_B, c_B), we have

$\begin{matrix} A \underset{f}{\geq} B iff m_{A} + c_{A} \geq m_{B} + c_{B} \\ and m_{A} - c_{A} \geq m_{B} - c_{B} . \end{matrix}$ (6)

Preliminary 2. [24] Given the fuzzy weight vector w = (w, c) and fuzzy bias term B = (b, d), w is the fuzzy weight vector, where each component within it, w_i = (w_i, c_i), are fuzzy numbers. It was denoted in the vector form of w = [w₁, …, w_n] ^t and c = [c₁, …, c_n] ^t, meaning “approximation w”, described by the center w and the width c. Similarly, B = (b, d) is the fuzzy bias term, meaning “approximation b”, described by the center b and the width d. The fuzzy hyperplane, $Y = w_{1} x_{1} + \dots w_{n} x_{n} + B = 〈 w \cdot x 〉 + B,$ (7) is defined by the following membership function: $μ_{Y} (y) = {\begin{matrix} 1 - \frac{| y - (〈 w \cdot x 〉 + b) |}{〈 c \cdot | x | 〉 + d} x \neq 0 \\ 1 x = 0, y = 0 \\ 0 x = 0, y \neq 0 \end{matrix}$ (8) where μ_Y (y) =0 when 〈c · |x| 〉 + d ≤ |y - (〈w · x〉 + b) |.

Consider a set of N data vectors {x_i, y_i, μ_i}, i = 1, …, N, y_i ∈ {-1, 1}, μ_i ∈ (0, 1], x_i ∈ Rⁿ, where x_i is the ith data vector belonging to a binary class y_i. Each training point x_i is given a fuzzy membership μ_i indicating the attitude of the point x_i toward the corresponding class [17]. The proposed approach seeks a fuzzy hyperplane 〈w · x 〉 + B = Θ that best separates the two classes with the widest margin, where w = (w, c) is the fuzzy weight vector and B = (b, d) is the fuzzy bias term. Θ denotes the “fuzzy zero” which is also a triangular fuzzy number with center zero and width O_w. The set of labeled training patterns is said to be “fuzzy linearly separable” if the following inequalities $\begin{matrix} 〈 w \cdot x_{i} 〉 + B \underset{f}{\geq} I_{F} if y_{i} = 1 \\ 〈 w \cdot x_{i} 〉 + B \underset{f}{\leq} - I_{F} if y_{i} = - 1 \end{matrix}$ are valid for all elements of the training set, where I_F denoted the “fuzzy one” which is also a triangular fuzzy number with center one and width I_w. It was shown [25] the optimal hyperplane is the one giving the largest margin of separation between the classes. The margin of the separating fuzzy hyperplane is simply 2/∥ w ∥. Figure 1 depicts the situation graphically. According to the maximum margin and minimum fuzziness principle, our fuzzy SV classification task is therefore to

$\begin{matrix} \underset{w, b, c, d, ξ_{i}}{minimize} \\ J = \frac{1}{2} {∥ w ∥}^{2} + C (v (\frac{1}{2} {∥ c ∥}^{2} + d) + \frac{1}{N} \sum_{i = 1}^{N} μ_{i} ξ_{i}) \\ subject to y_{i} (〈 w \cdot x_{i} 〉 + B) \underset{f}{\geq} I_{F} - ξ_{i} \\ ξ_{i} \geq 0 for all i = 1, \dots, N . \end{matrix}$ (9)

The minimization of ∥w ∥ ² is equivalent to the maximization of the margin of the fuzzy hyperplane. $\frac{1}{2} {∥ c ∥}^{2} + d$ is the term characterizing the vagueness of the model. v > 0 is the vagueness parameter chosen by the user. Since the fuzzy membership μ_i is the attitude of the point x_i toward the corresponding class and ξ_i is the slack variable measuring the amount of variation of the separable constraints, the term μ_i ξ_i is a measure of error with different weighting. Noted, a smaller μ_i reduces the effect of the parameter ξ_i such that the corresponding point x_i is treated as less important. C > 0 can be regarded as a regularization parameter.

Specifically, from the above preliminaries, our goal is to identify the fuzzy weight vector w^* = (w, c) and the fuzzy bias term B^* = (b, d), which is the solution to the following constrained optimization problem: $\begin{matrix} \underset{w, c, b, d, ξ_{1 i}, ξ_{2 i}}{minimize} \\ \frac{1}{2} {∥ w ∥}^{2} + C (v (\frac{1}{2} {∥ c ∥}^{2} + d) + \frac{1}{N} \sum_{i = 1}^{N} μ_{i} (ξ_{1 i} + ξ_{2 i})) \\ subject to \\ y_{i} (〈 w \cdot x_{i} 〉 + b) + (〈 c \cdot | x_{i} | 〉 + d) \geq 1 + I_{w} - ξ_{1 i} \\ y_{i} (〈 w \cdot x_{i} 〉 + b) - (〈 c \cdot | x_{i} | 〉 + d) \geq 1 - I_{w} - ξ_{2 i} \\ and d \geq 0, ξ_{1 i}, ξ_{2 i} \geq 0, for i = 1, \dots, N . \end{matrix}$ (10)

We can find the solution of this optimization problem given by Equation (10) in dual variables by finding the saddle point of the Lagrangian: $\begin{matrix} L & = & \frac{1}{2} {∥ w ∥}^{2} + C (v (\frac{1}{2} {∥ c ∥}^{2} + d) \\ + \frac{1}{N} \sum_{i = 1}^{N} μ_{i} (ξ_{1 i} + ξ_{2 i})) \\ - \sum_{i = 1}^{N} α_{1 i} (y_{i} (〈 w \cdot x_{i} 〉 + b) \\ + 〈 c \cdot | x_{i} | 〉 + d - 1 - I_{w} + ξ_{1 i}) \end{matrix}$

$\begin{matrix} - \sum_{i = 1}^{N} α_{2 i} (y_{i} (〈 w \cdot x_{i} 〉 + b) \\ - 〈 c \cdot | x_{i} | 〉 - d - 1 + I_{w} + ξ_{2 i}) \\ - \sum_{i = 1}^{N} ρ_{1 i} ξ_{1 i} - \sum_{i = 1}^{N} ρ_{2 i} ξ_{2 i} - γ d \end{matrix}$ (11)

where α_1i, α_2i, ρ_1i, ρ_2i and γ are the nonnegative Lagrange multipliers. Differentiating L with respect to w, c, b, d, ξ_1i and ξ_2i and setting the result to zero, we obtain: $\partial L / \partial w = 0 \Rightarrow w = \sum_{i = 1}^{N} y_{i} (α_{1 i} + α_{2 i}) x_{i}$ (12) $\partial L / \partial c = 0 \Rightarrow c = \frac{1}{Cv} \sum_{i = 1}^{N} (α_{1 i} - α_{2 i}) | x_{i} |$ (13) $\partial L / \partial b = 0 \Rightarrow \sum_{i = 1}^{N} y_{i} (α_{1 i} + α_{2 i}) = 0$ (14) $\partial L / \partial d = 0 \Rightarrow \sum_{i = 1}^{N} (α_{1 i} - α_{2 i}) = Cv - γ$ (15) $\partial L / \partial ξ_{1 i} = 0 \Rightarrow α_{1 i} \leq \frac{C}{N} μ_{i}$ (16) $\partial L / \partial ξ_{2 i} = 0 \Rightarrow α_{2 i} \leq \frac{C}{N} μ_{i} .$ (17)

Substituting Equations. (12–17) into (11), we obtain the dual problem as

$\begin{matrix} \max_{α_{1 i}, α_{2 i}} \frac{- 1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} y_{i} y_{j} (α_{1 i} + α_{2 i}) (α_{1 j} + α_{2 j}) 〈 x_{i} \cdot x_{j} 〉 \\ - \frac{1}{2 Cv} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (α_{1 i} - α_{2 i}) (α_{1 j} - α_{2 j}) 〈 | x_{i} | \cdot | x_{j} | 〉 \\ + \sum_{i = 1}^{N} (α_{1 i} - α_{2 i}) I_{w} + \sum_{i = 1}^{N} (α_{1 i} + α_{2 i}) \\ subject to \sum_{i = 1}^{N} y_{i} (α_{1 i} + α_{2 i}) = 0, \\ \sum_{i = 1}^{N} (α_{1 i} - α_{2 i}) \leq Cv, α_{1 i}, α_{2 i} \\ \in [0, \frac{C μ_{i}}{N}] i = 1, \dots, N \end{matrix}$ (18)

As for the pair (w, c, b, d), it is easy to find $\begin{matrix} w & = & \sum_{i = 1}^{N} y_{i} (α_{1 i} + α_{2 i}) x_{i} and \\ c & = & \frac{1}{Cv} \sum_{i = 1}^{N} (α_{1 i} - α_{2 i}) | x_{i} | . \end{matrix}$

While parameters b and d can be determined from the following Karush-Kuhn-Tucker (KKT)conditions

$\begin{matrix} α_{1 i} (y_{i} (〈 w \cdot x_{i} 〉 + b) \\ + 〈 c \cdot | x_{i} | 〉 + d - 1 - I_{w} + ξ_{1 i}) \end{matrix}$ (19)

$\begin{matrix} α_{2 i} (y_{i} (〈 w \cdot x_{i} 〉 + b) \\ - 〈 c \cdot | x_{i} | 〉 - d - 1 + I_{w} + ξ_{2 i}) \end{matrix}$ (20) $(\frac{C}{N} μ_{i} - α_{1 i}) ξ_{1 i} = 0$ (21) $(\frac{C}{N} μ_{i} - α_{2 i}) ξ_{2 i} = 0$ (22)

For some $α_{1 i} \in (0, \frac{C μ_{i}}{N})$ and $α_{2 j} \in (0, \frac{C μ_{i}}{N})$ , we have ξ_1i = ξ_2j = 0 and moreover the second factors in Equations (19) and (20) have to vanish. Hence, b and d can be computed as follows: $b = \frac{- 1}{y_{i} + y_{j}} (\begin{matrix} y_{i} 〈 w \cdot x_{i} 〉 + y_{j} 〈 w \cdot x_{j} 〉 \\ + 〈 c \cdot | x_{i} | 〉 - 〈 c \cdot | x_{j} | 〉 - 2 \end{matrix})$ (23) $d = \frac{- 1}{2} (\begin{matrix} y_{i} 〈 w \cdot x_{i} 〉 - y_{j} 〈 w \cdot x_{j} 〉 \\ + 〈 c \cdot | x_{i} | 〉 + 〈 c \cdot | x_{j} | 〉 - 2 I_{w} \end{matrix})$ (24) for some i, j such that $α_{1 i} \in (0, \frac{C μ_{i}}{N}),$ $α_{2 j} \in (0, \frac{C μ_{i}}{N})$ , and y_i · y_j = 1.

The fuzzy hyperplane $Y_{i}^{*} = 〈 w^{*} \cdot x_{i} 〉 + B^{*}$ is defined by the following membership function:

$\begin{matrix} μ_{Y_{i}^{*}} (y) \\ = 1 - \frac{| y - (\sum_{k = 1}^{N} y_{k} (α_{1 k} + α_{2 k}) 〈 x_{i} \cdot x_{k} 〉 + b) |}{(\frac{1}{Cv} \sum_{k = 1}^{N} (α_{1 k} - α_{2 k}) 〈 | x_{i} | \cdot | x_{k} | 〉) + d} . \end{matrix}$ (25)

For any x_i, $Y_{i}^{*} = 〈 w^{*} \cdot x_{i} 〉 + B^{*}$ is a symmetric triangular fuzzy number with center 〈w · x 〉 + b and width 〈c · |x| 〉 + d. The fuzzy origin, Θ, is also defined as a symmetric triangular fuzzy number with center zero and width O_w. For a new point x, we determine the degree that $Y^{*} = 〈 w^{*} \cdot x 〉 + B^{*}$ is fuzzy larger than Θ by applying the following fuzzy partial ordering relation. For any two symmetric triangular fuzzy numbers A = (m_A, c_A) and B = (m_B, c_B), the degree that A is larger than B is defined by the following membership function: $R_{\geq B} (A) = {\begin{matrix} 1 & if α > 0 and β > 0 \\ 0 & if α < 0 and β < 0 \\ 0.5 (1 + \frac{α + β}{max (| α |, | β |)}) & o . w . \end{matrix}$ (26)

where α = (m_A + c_A) - (m_B + c_B) and β = (m_A - c_A) - (m_B - c_B).

Note, R_≥B (A) =0.5 if m_A = m_B, R_≥B (A) <0.5 if m_A < m_B, and R_≥B (A) >0.5 if m_A > m_B. And the decision function of the proposed fuzzy SV classification model is $f (x) = R_{\geq Θ} (〈 w^{*} \cdot x 〉 + B^{*}) .$ (27)

This decision function takes a value between 0.0 and 1.0 and represents the degree of membership that the new point x belongs to the positive class. The boundary between the positive class and the negative classes is vague and imprecise.

3.2 Extension to the nonlinear fuzzy SV classification model

The decision boundary given by (25) is a hyperplane in Rⁿ. More complex decision surfaces can be extended by employing a nonlinear mapping Φ : Rⁿ → F to map the data points into a higher dimensional feature space F, and finding the maximal-margin fuzzy separating hyperplane in the feature space. Note, the data point x_i never appears isolated in the algorithm for the model but always in the form of inner products 〈x_i· x_j 〉 and 〈|x_i|· |x_j| 〉. The algorithm would only depend on the data through inner products in F, i.e. on functions of the form 〈Φ (x_i)· Φ (x_j) 〉 and 〈Φ (|x_i|)· Φ (|x_j|) 〉. Hence, it suffices to know and use k (x_i, x_j) =〈 Φ (x_i) · Φ (x_j) 〉 and k (|x_i|, |x_j|) =〈 Φ (|x_i|) · Φ (|x_j|) 〉 instead of Φ (•) explicitly [10]. By replacing 〈x_i· x_j 〉 and 〈|x_i|· |x_j| 〉with k (x_i, x_j) and k (|x_i|, |x_j|), respectively, this leads to the following dual optimization problem:

$\begin{matrix} \max_{α_{1 i}, α_{2 i}} \frac{- 1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} y_{i} y_{j} (α_{1 i} + α_{2 i}) (α_{1 j} + α_{2 j}) k (x_{i}, x_{j}) \\ - \frac{1}{2 Cv} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (α_{1 i} - α_{2 i}) (α_{1 j} - α_{2 j}) k (| x_{i} |, | x_{j} |) \\ + \sum_{i = 1}^{N} (α_{1 i} - α_{2 i}) I_{w} + \sum_{i = 1}^{N} (α_{1 i} + α_{2 i}) \\ subject to \\ \sum_{i = 1}^{N} y_{i} (α_{1 i} + α_{2 i}) = 0, \sum_{i = 1}^{N} (α_{1 i} - α_{2 i}) \leq Cv \cdot \\ 0 \leq α_{1 i} \leq \frac{C μ_{i}}{N}, 0 \leq α_{2 i} \leq \frac{C μ_{i}}{N}, i = 1, \dots, N \end{matrix}$ (28)

The fuzzy surface is defined by the following membership function: $μ_{Y_{i}^{*}} (y) = 1 - \frac{| y - (\sum_{k = 1}^{N} y_{k} (α_{1 k} + α_{2 k}) k (x_{i}, x_{k}) + b) |}{(\frac{1}{Cv} \sum_{k = 1}^{N} (α_{1 k} - α_{2 k}) k (| x_{i} |, | x_{k} |)) + d} .$ (29)

3.3 Support vectors and margin-error vectors

The set of labeled training patterns is said to be fuzzy linearly separable if the following inequality $y_{i} (〈 w \cdot x_{i} 〉 + B) \underset{f}{\geq} I_{F}$ (30) is valid for all data vectors of the training set. Here, I_F denotes the “fuzzy one” which is also a triangular fuzzy number with center one and width I_w. Specifically, from the preliminary 1, the above inequality is equivalent to $\begin{matrix} y_{i} (〈 w \cdot x_{i} 〉 + b) + 〈 c \cdot x_{i} 〉 + d \geq 1 + I_{w}, \\ y_{i} (〈 w \cdot x_{i} 〉 + b) - 〈 c \cdot x_{i} 〉 - d \geq 1 - I_{w} . \end{matrix}$

According to the Karush-Kuhn-Tucker (KKT) conditions given by Equations (19–22), the training data points can be classified into the following categories.

Normal Vectors (NVs): The data points x_i for which α_1i = α_2i = 0 are called Normal Vectors. According to Equations (19–22) and (26), the degree that y_i (〈 w^* · x_i 〉 + B^*) is larger than I_F is 1 (i.e., the membership degree that x_i satisfies the fuzzy separable constraint Equation (30) is R_{≥I_F} (y_i (〈 w^* · x_i 〉 + B^*)) = 1) for those NVs. Namely, those NVs completely satisfy the fuzzy separable constraint Equation (30).

Support Vectors (SVs): The data points x_i for which α_1i > 0 or α_2i > 0 are called Support Vectors. According to Equations (19–22), all training points with ξ_1i ≥ 0 or ξ_2i ≥ 0 certainly satisfy α_1i > 0 or α_2i > 0. By the definition of fuzzy partial ordering relation, the degree that y_i (〈 w^* · x_i 〉 + B^*) is larger than I_F is in the range of (0,1] (i.e., the membership degree that x_i satisfies the fuzzy separable constraint Equation (30) is R_{≥I_F} (y_i (〈 w^* · x_i 〉 + B^*)) ∈ (0, 1]) for those SVs. It follows that those SVs partially satisfy/violate the fuzzy separable constraint Equation (30).

Margin-Error Vectors (MEVs): The data points x_i for which α_1i = Cμ_i/ N and α_2i = Cμ_i/ N are termed Margin-Error Vectors. Combining Equations (19–22) shows ξ_1i > 0 and ξ_2i > 0 if α_1i = Cμ_i/ N and α_2i = Cμ_i/ N, respectively. By the definition of fuzzy partial ordering relation, the degree that y_i (〈 w^* · x_i 〉 + B^*) is larger than I_F is 0 (i.e., the membership degree that x_i satisfies the fuzzy separable constraint Equation (30) is R_{≥I_F} (y_i (〈 w^* · x_i 〉 + B^*)) = 0) for those MEVs. Namely, those MEVs completely violate the fuzzy separable constraint Equation (30). Note, those MEVs could still lie on the correct side of the fuzzy decision boundary.

We will discuss the influences of model parameters on the number of SVs and MEVs in the experimental section.

4 Experiments

In this section, several examples are used to verify the effectiveness of the proposed fuzzy SV classification algorithm. We apply the proposed method to benchmark datasets, handwritten digits problem, and stock market trend prediction problem. In these simulations, we only consider the radial basis function (RBF) kernel k (x, y) = exp(- q ∥ x - y ∥ ²) for all algorithms. The optimal choice of model parameters C, v, q, I_w, and O_w was tuned using a grid search mechanism. For simplicity, we set I_w = O_w in the following simulations.

4.1 Benchmark dataset

We evaluate the classification performance of the proposed fuzzy SV classification algorithm on several well-known benchmark datasets from the PROBEN1 [22], UCI repository [2], and the Statlog collection [20], respectively. We scale all data to be in [–1, 1]. Some of those datasets have more than two classes. For simplicity, we treat all data not in the first class as in the negative class. We compare the proposed fuzzy SV classification approach with the original C-SVC [25], v-SVC [23], par-v-SVC [11], and WCS-FSVM [1]. For fair comparison with the C-SVC and v-SVC, we set fuzzy membership μ_i = 1, ∀ i for all datasets. The most important criterion for evaluating the performance of those algorithms is their accuracy rate. Note, the generalized accuracy of the SVM classifier depends on the values of the regularization parameter C, the kernel parameter q, the vagueness parameter v, I_w, and O_w. For simplicity, we set I_w = O_w for all problems. Although there exits many model-parameters selection methods for SVMs [4], the most popular method for choosing the model-parameters of SVMs is still the exhaustive search [12]. In the following experiments, we estimate the generalized accuracy using different regularization parameters C = [10^–1,10⁰,2¹, … ,10⁶], kernel parameters q = [2⁴,2³, … ,2^–10], vagueness v = [0.1, 0.2, 0.3, … ,0.8], and I_w = O_w = [0.1, 0.2, 0.3, …, 0.9] for each algorithms. We apply the ten-fold cross-validation method to the whole training data to select the model parameters and estimate the generalized accuracy.

Table 1 reports the comparison of C-SVC, v-SVC, par-v-SVC, WCS-FSVM and the proposed approach in terms of accuracy rate, training time, and number of SVs. The training time complexity for solving the SVM-type quadratic optimization problem with N variables is O(N³). Because the number of variables to be solved in the proposed quadratic optimization problem is double that of the original SVM, the training speed of the proposed fuzzy SVC approach approximately eight times slower than that of a classical SVM (as shown in Table 1).

As for the test time complexity, the classification time of a SVM model increases linearly with the number of support vectors. Although the number of variables in the proposed quadratic optimization problem is double that of the original SVM, the numbers of support vectors do not increase significantly in our approach. The proposed fuzzy SVC algorithm achieves a better accuracy rate with fewer support vectors on wdbc and new-thyroid datasets, indicating the proposed approach are more accurate and faster on testing in those datasets. In many real time applications, the generalization ability and classification speed are more important than training speed. Hence, it is necessary to present our fuzzy SVC that yields comparable accuracy and number of SVs to the classical SVM approaches, demonstrating that it is suitable for real-world classification problems. The experimental results demonstrate our approach performs fairly well on those benchmark datasets.

4.2 Analysis of the influence of model parameters

In the part, we analyze how the model parameters influence the results of the proposed fuzzy SV classification model. We consider the following measures: the number of SVs/MEVs, training/test accuracy rate, and the width of the fuzzy hyperplane. The width of the fuzzy hyperplane is defined as $FuzzyWidth : = \frac{1}{N} \sum_{i = 1}^{N} | 〈 c \cdot | x_{i} | 〉 + d |$ . Figure 2 (a) illustrates the influence of fuzzy width on the degree of vagueness of the fuzzy boundary. As seen from Fig. 2(a), $Y_{i}^{*}$ is completely larger than Θ (i.e., $R_{\geq Θ} (Y_{i}^{*}) = 1$ ) when the width of $Y_{i}^{*}$ is small. As the value of 〈c · |x_i| 〉 + d increases, $Y_{i}^{*}$ becomes partially larger than Θ (i.e., $R_{\geq Θ} (Y_{i}^{*}) \in (0.5, 1)$ ). Namely, point x_i would be located on the fuzzy boundary and the classification model gets vaguer as the fuzzy width is increased.

Figures 3 through 6 illustrate the results of the proposed fuzzy SVC algorithm on the iris dataset with different settings of the following parameters: v (vagueness parameter), I_w (width of symmetric triangular fuzzy numbers one), C (regularization parameter), and q (RBF kernel parameter). The arithmetic means and standard deviation error bars are estimated by employing the ten-fold cross-validation mechanism. Figure 3 illustrates the results of the proposed fuzzy SV classification model for different values of the vagueness parameter v. It can be seen the numbers of SVs/MEVs and training/test accuracy rates are insensitive toward changes in v. Moreover, the width of the fuzzy hyperplane increases when the vagueness parameter v is decreased. Namely, more data points would be located on the fuzzy boundary and the resulting classification model gets vaguer when the vagueness parameter v is decreased.

Figure 4 shows how the parameter I_w (= O_w) settings influence the proposed fuzzy SVC model. The parameter I_w (= O_w) also controls the vagueness of the resulting fuzzy boundary. As illustrated in Fig. 2(b), the membership degree that $Y_{i}^{*} = 〈 w^{*} \cdot x_{i} 〉 + B^{*} \underset{f}{\geq} Θ$ is satisfied is $R_{\geq Θ} (Y_{i}^{*}) = 1$ when O_w is small. Namely, $Y_{i}^{*}$ is completely larger than Θ. As the parameter O_w increases, $Y_{i}^{*}$ becomes partially larger than Θ (i.e., $R_{\geq Θ} (Y_{i}^{*}) \in (0.5, 1)$ ). In other words, point x_i would be located on the fuzzy boundary and the resulting classification model get vaguer as O_w is increased. As shown in Fig. 4, the number of SVs increases as I_w is increased and the number of MEVs decreases as I_w is increased, largely due to the fact NVs (completely satisfying the fuzzy separable constraint) and MEVs (completely violating the fuzzy separable constraint) turn into SVs (partially satisfying/violating the fuzzy separable constraint) as I_w increases. Moreover, the training/test accuracy rate and FuzzyWidth are independent of I_w.

Figure 5 gives an illustration of the proposed fuzzy SVC model for different values of the regularization parameter C (the horizontal-axis represents the logarithm of C to the base 10). The chosen C mainly controls the cost that one is willing to pay for the amount of variation of the fuzzy separable constraints in Equation 10. As shown in Fig. 5, the numbers of SVs/MEVs decrease when the regularization parameter C is increased, largely due to more training points would completely satisfy the fuzzy separable constraint given in Equation (30) as the penalty assigned to the misclassified patterns is increased. The accuracy rate increases as the parameter C increases initially and decreases afterwards, largely because increasing the penalty to the misclassified error by too much would lead to overfitting. Moreover, the FuzzyWidth varies independently with C.

Figure 6 shows how the RBF kernel parameter q settings influence the proposed fuzzy SVC model (the horizontal-axis represents the logarithm of q to thebase 2). With increasing q, the decision boundary fits the data more tightly. On the contrary, the decision boundary gets smoother as q is decreased. As shown in Fig. 6, the number of SVs decreases as the parameter q initially increases and increases afterwards, while the number of MEVs decreases as the parameter q increases. This is caused by the fact that as the decision boundary fits the data more tightly, the number of training patterns that partially/completely violate the fuzzy separable constraint (i.e., SVs or MEVs) is reduced. However, as the decision boundary become more sophisticated later on, it requires more SVs to construct the fuzzy hyperplane. Besides, the accuracy rate among training patterns increases as the parameter q increases. The test accuracy rate increases as the parameter q initially increases and decreases afterwards. Largely due to as the decision boundary fitting the data more tightly, it could increase the learning ability. However, it would also lead to overfitting due to memorizing the errors and noises among the training data points. Moreover, the FuzzyWidth is insensitive toward changes in q. In summary, parameters v and I_w mainly control the vagueness of the resulting model while parameters C and q mainly control the accuracy rate of the proposed approach.

4.3 Handwritten digits dataset

We now evaluate the classification performance of the proposed fuzzy SV classification approach on the handwritten digits problem. This dataset consists of isolated binary handwritten digits partially extracted from the BR digit set of the SUNY CDROM-1 [13] and the ITRI database [5].

We use two categories of confusion group, class “2” and class “7”, with 600 samples per class. Each digit has a different size, and a sample of some of the digits is shown in Fig. 7. For handwritten digit problem, most of the digit classes consist of different writing styles. Besides, the noises or outliers patterns make the handwritten digit recognition even more difficult. Since the optimal classifier constructed by the SVM depends on only a small part of the training patterns, it may become sensitive to outliers or noises in the training set [17, 18]. To overcome this problem, we can assign lower fuzzy membership to those outliers or noises and treat them as less important.

In this experiment, we apply the multi-sphere fuzzy SV clustering algorithm [6] to estimate the fuzzy membership μ_i for each digit patterns. The multi-sphere fuzzy SV clustering approach first maps data points into a higher dimensional feature space through a desired kernel function. It then applies an adaptive cell growing model to essentially identify dense regions in the original space by finding their corresponding hyperspheres with minimal radius in the feature space. Unlike original support vector clustering, this approach regards each hypersphere in the feature space as representing a single cluster of arbitrary shapes in the original space. To determine the membership grade of a point belonging to a certain cluster, the multi-sphere fuzzy SV clustering approach considers not only the distance to the corresponding spherical center, but also the radius of the hypersphere in the feature space. The multi-sphere fuzzy SV clustering approach uses the following fuzzy membership function to compute the degree of a given point x that belongs to a cluster j as

$\begin{matrix} μ_{j} (x) \\ = {\begin{matrix} 0.5 \cdot (\frac{1 - (\frac{1}{R_{j}}) D_{j} (x)}{1 + λ_{1} (\frac{1}{R_{j}}) D_{j} (x)}) + 0.5 if D_{j} (x) \leq R_{j} \\ 0.5 \cdot (\frac{1}{1 + λ_{2} (D_{j} (x) - R_{j})}) otherwise \end{matrix} \end{matrix}$ (31)

where R_j is the spherical radius in feature space corresponding to cluster j; D_j (x) is the distance between x and the spherical center in feature space; and the parameters λ₁ and λ₂ satisfy $λ_{2} = \frac{1}{R_{j} (1 + λ_{1})}$ (32) making μ_j (x) differentiable as D_j (x) = R_j. In this experiment, we set λ₁ = 0 and λ₂ satisfies Equation (32).

For class “2” and “7”, we randomly select 300 samples as training instances and set the remaining samples as testing instances. We employ a ten-fold cross-validation over the training set to obtain the optimal model parameters. Then we train the whole training set using the model parameters that achieve the best validation rate and predict the test set. Table 2 reports the result of comparing the original C-SVC, par-v-SVC [11], lin’s FSVM [18], WCS-FSVM [1], and the proposed fuzzy SV classification approach (without/with assigning fuzzy membership to each data points). We report the optimal model parameters, the corresponding cross-validation rates on the training set, the accuracy rates on the test set, and the numbers of support vectors. As seen from Table 2, associating a fuzzy membership to each digit pattern can effectively reduce the influences of noises, and incorporating the concept of fuzzy hyperplane into the SVM might be very useful for dealing with the vague and inexact nature of the real world. As a whole, the experimental results demonstrate the proposed method performs fairly well on the handwritten digit dataset.

Though this experiment achieves satisfactory results, it did not really address the actual goal the fuzzy SV classification algorithm was designed for. The main advantage of the proposed approach is that there exists an imprecise boundary between the positive and the negative classes to capture the uncertain nature of the real world. Figure 8 shows patterns located in the fuzzy boundary constructed by the proposed approach. Namely, the values of the fuzzy decision function f (x) given in Equation (27) are in the range of (0, 1) for all of these patterns, representing the maximum degree of vagueness between those two classes. As seen from Fig. 8, the number of the “fuzzy digit pattern” in class “2” is more than in class “7”. This is caused by the various writing styles in the training set of class “2”. As shown in Fig. 8, digit patterns located in the fuzzy boundary are the most confusing and difficult to classify. Rather than having a precise cutoff between categories for those digit patterns, tolerating a reasonable amount of imprecision, vagueness, and uncertainty during the modeling phase could capable of solving the complex handwritten digits problem in an appropriate way. The experimental results show that the fuzzy hyperplane provides an effective means of capturing the inexact, approximate nature of the real world.

4.4 Stock market trend prediction problem

In this section, we apply the proposed fuzzy SVC approach to the stock market trend prediction problem. The financial market is characterized by complex, evolutionary, and non-linear dynamical systems. Predicting the next-day stock trend (either increase or decrease) is therefore a difficult task. However, it is important in the sense that it provides tangible information for investment decisions and it is a highly attractive issue to researchers from different fields. Numerous studies of financial economics have reported that technical indicators can be used to predict the directions of daily change of the stock price index. We select 17 technical indicators to construct the initial attributes, as determined by the review of domain experts and prior research [15], and classify the day point into two classes: increase and decrease.

The stock trend prediction problem is a really ambiguous classification problem. Clearly, the day point whose stock price changes +5% has a higher membership degree in the “increase” class than the day point whose stock price changes +0.05% . Moreover, the day point whose stock price changes +0.001% seems to be located on the neutral and ambiguous boundary between the “increase” and “decrease” classes. We assign the fuzzy membership to each day point by using the following S-shaped membership function.

$\begin{matrix} μ_{i} (Δ p_{i}, a, b) \\ = {\begin{matrix} 0 & | Δ p_{i} | \leq a \\ 2 {(\frac{| Δ p_{i} | - a}{b - a})}^{2} & a \leq | Δ p_{i} | \leq \frac{a + b}{2} \\ 1 - 2 {(\frac{| Δ p_{i} | - b}{b - a})}^{2} & \frac{a + b}{2} \leq | Δ p_{i} | \leq b \\ 1 & | Δ p_{i} | \geq b \end{matrix} . \end{matrix}$ (33)

where Δp_i is the daily change of the stock price index in the ith day point, a = 0 and b = median (|Δp_i|) is the median value of |Δp_i|, i = 1, …, N.

The stock price index data set is collected from TEJ+ database (http://www.finasia.biz/ensite/). We randomly select four companies: China Steel Co., Ltd. (CSC), Taiwan Semiconductor Manufacturing Co., Ltd. (TSMC), United Microelectronics Co., Ltd. (UMC), and Winbond Electronics Co., Ltd. (WEC). The daily stock price data covers from 1-Jan-2011 to 1-Jan -2013. We compare the proposed fuzzy SVC approach with original C-SVC, par-v-SVC [11], lin’s FSVM [18], and WCS-FSVM [1]. We apply the ten-fold cross-validation method on the whole training data to select the model parameters and estimate the generalized accuracy. Table 3 presents the result of comparing these methods in terms of cross-validation rate and number of SVs. As seen from Table 3, the proposed fuzzy SVC yields comparable results to the traditional SVM and FSVM approaches, demonstrating that incorporating the concept of fuzzy hyperplane into SVM might be very usefully for dealing with ambiguous boundary in the stock market trend prediction problem. In summary, rather than either belonging to the “increase” class or “decrease” class, tolerating a reasonable amount of imprecision, vagueness, and uncertainty during the estimation of the optimal separable hyperplane provides an effective way for solving the complex stock market trend prediction problem (neither increase nor decrease is meaningful). The experimental results show that the fuzzy hyperplane provides an effective means of capturing the inexact, approximate nature of the real world.

5 Conclusion

In this paper, by combining the fuzzy set theory with SVM, we propose a novel support vector algorithm with fuzzy hyperplane for pattern classification. We first introduce the concepts of fuzzy linear separability and fuzzy hyperplane. Then, the proposed approach constructs a fuzzy hyperplane that best separates the positive class from the negative class with the widest margin in the feature space. Further, the decision function of the proposed approach is generalized so the values assigned to the individuals fall within a specified range and represent the membership degree of these individuals in a given category. This integration preserves the benefits of fuzzy set theory and SVM theory, where the use of the fuzzy hperplane provides the SVM with effective means for handling the approximate, imprecise nature of the real world. On the other hand, the SVM provides the advantage to minimize the structural risk and effectively generalize the unseen data. The experimental results show the proposed approach can yield a satisfactory generalization performance and meanwhile can estimate the membership grade that an individual belongs to a given class.

Footnotes

Acknowledgments

This research work was supported in part by the Ministry of Science and Technology Research Grant MOST 103-2221-E-151-032.

References

and Liang

, Fuzzy support vector machine based on within-class scatter for classification problems with outliers or noises, Neurocomputing 110, (2013), 101–110.

C.L.Blake andC.J.Merz,UCI repository ofMachine Learning Databases. Univ California, Dept Inform Comput Sci, Irvine, CA. 1998. [Online]. Available: http://kdd.ics.uci.edu/

Burges

C.J.C.

, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2(2) (1998), 955–974.

Chapelle

, Vapnik

, Bousquet

and Mukherjee

, Choosing multiple parameters for support vector machines, Machine Learning 46 (2002), 131–159.

Chiang

J.-H.

and Gader

, Recognition of handprinted numerals in VISA card application forms, Machine Vision and Applications 10 (1997), 144–149.

Chiang

J.-H.

and Hao

P.-Y.

, A new kernel-based fuzzy clustering approach: Support vector clustering with cell growing, IEEE Trans on Fuzzy Systems 11(4) (2003), 518–527.

Chiang

J.-H.

and Hao

P.-Y.

, Support vector learning mechanism for fuzzy rule-based modeling: A new approach, IEEE Trans on Fuzzy Systems 12(1) (2004), 1–12.

Cortes

and Vapnik

V.N.

, Support vector network, Machine Learning 20 (1995), 1–25.

Dubois

and Prade

, Operations on fuzzy number, Int J Syst Sci 9 (1978), 613–626.

10.

Hao

P.-Y.

and Chiang

J.-H.

, Fuzzy regression analysis by support vector learning approach, IEEE Trans on Fuzzy Systems 16(2) (2008), 428–441.

11.

Hao

P.-Y.

, New support vector algorithms with parametric insensitive/margin model, Neural Networks 23(1) (2010), 60–73.

12.

Hsu

C.W.

and Lin

C.J.

, A comparison of methods for multiclass support vector machines, IEEE Trans On Neural Networks 13 (2002), 415–425.

13.

Hull

J.J.

, A database for handwritten text recognition research, IEEE Trans on Pattern Analysis and Machine Intelligence 16 (1994), 550–554.

14.

A.-B.

, Chen

and Hua

, Fuzzy classifier based on fuzzy support vector machine, Journal of Intelligent and FuzzySystems 26 (1) (2013), 421–430.

15.

Kim

K.-J.

, Financial time series forecasting using support vector machines, Neurocomputing 55 (2003), 307–319.

16.

Klir

G.J.

and Yuan

, Fuzzy Sets and Fuzzy Logic: Theory and Applications Prentice-Hall: New Jersey, 1995.

17.

Lin

C.-F.

and Wang

S.-D.

, Fuzzy support vector machines, IEEE Transactions on Neural Networks 13 (2) (1995), 464–471.

18.

Lin

C.-F.

and Wang

S.-D.

, Training algorithms for fuzzy support vector machines with noisy data, Pattern Recognition Letters 25(14) (2004), 1647–1656.

19.

Lin

C.-T.

, Yeh

C.-M.

, Liang

S.-F.

, Chung

J.-F.

and Kumar

, Support vector based fuzzy neural network for pattern classification, IEEE Trans on Fuzzy Systems 14(1) (2006), 31–41.

20.

Michie

, Spiegelhalter

D.J.

and Taylor

C.C.

, Machine Learning, Neural and Statistical Classification, Ellis Horwood, 1994. Online: http://www.maths.leeds.ac.uk/charles/statlog/

21.

Negoita

C.V.

and Ralescu

D.A.

, Application of Fuzzy Sets to Systems Analysis, Birkhauser Verlag, 1975, pp. 12–24

22.

Prechelt

, PROBEN 1 –A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, Fakultat fur Informatik, Universitat Karlsruhe: D-76128, Karlsruhe, Germany, 1994.

23.

Schölkopf

, Smola

A.J.

, Williamson

and Bartlett

P.L.

, New support vector algorithms, Neural Computation 12(5) (1994), 1207–1245.

24.

Tankaka

, Uejima

and Asai

, Linear regression analysis with fuzzy model, IEEE Trans On Syst, Man, and Cyber 12(6) (1982), 903–907.

25.

Vapnik

V.N.

, The Nature of Statistical Learning Theory. Springer-Verlag: New York, 1995.

26.

Vapnik

V.N.

, Statistical Learning Theory Wiley, 1998

27.

Yager

R.R.

, On solving fuzzy mathematical relationships, Inform Contr 41 (1995), 29–55.

28.

Zadeh

L.A.

, The concept of linguistic variable and its application to approximate reasoning— I, Inform Sci 8 (1975), 199–249.