A twin-hyperellipsoidal support vector classifier

Abstract

This paper presents a novel binary classifier based on two best fitting hyperellipsoids in the feature space, called twin-hyperellipsoidal support vector machine (TESVM). The idea of TESVM is inspired by the minimum volume covering ellipsoid together with twin-hypersphere support vector machine (THSVM) which is a variant of the well-known support vector data description (SVDD). Following the concept of THSVM, TESVM constructs two hyperellipsoids where each hyperellipsoid is closest to one class but also as far as possible from the other class in order to form a decision boundary. The construction of hyperellipsoids in the feature space is also enabled through the use of empirical feature mapping. The experimental results on several artificial as well as standard real-world datasets are provided to demonstrate the performance of TESVM. Particularly, TESVM outperforms its spherical counterpart in term of classification accuracy.

Keywords

Kernel minimum volume covering ellipsoid twin-hyperellipsoid twin hypersphere empirical feature mapping

1 Introduction

Performing classification tasks is of paramount importance to distinguish one element from another. Supervised learning machines [32] help achieve such a goal by trying to reveal hidden statistical relationships among different groups of data. Since their emergence, a number of applications have been developed which benefit human beings; such examples include fingerprint recognition, handwriting recognition, and so forth. Indeed, they generate an enormous impact on a variety of fields spanning from medicine to engineering, from economics to justice systems, just to name a few.

Conceptually, a learning machine possesses some sorts of pre-defined set of hypotheses or feasible decision functions in its mind. During the training process, it will opt for the optimal one under some criteria and according to its past experiences. After the best function is selected, the machine uses it to recognize unknown data during the testing process. In research literature, the pre-defined functions given to a learning machine are usually linear as can be seen in the cases of support vector machines (SVM) [6], twin support vector machines (TWSVM) [14], and the best fitting hyperplane classifier [4]. Some studies also worked on decision functions which are quadratic by nature; for example, twin-hypersphere support vector machine (THSVM) [20] utilized two hyperspheres to solve a classification task.

Employing a hypersphere as a data descriptor has long been an active area of research. Fundamentally, it is strongly related to the minimum enclosing ball problem (MEB). In its simplest form, finding the smallest possible circle around m given points was posed as a problem more than 150 years ago [25]. Since the boom of SVM and kernel methods, MEB was also reformulated with kernel tricks and soft margins, where it becomes known as support vector data description (SVDD) [27]. Although SVDD is primarily designed for one-class classification problems, it is also extended to support multiclass cases such as that in the work by Mu and Nandi [19]. Another formulation to implement two hyperspheres to solve a binary classification is from Peng and Xu [20] where it is named THSVM. Although their approach is quite similar to SVDD, the core of THSVM is built around the spirit of TWSVM. Instead of trying to find the best fitting hyperplane around one class in the feature space, THSVM looks for the best fitting hypersphere which is closest to one class but as far as possible from the other class. Peng and Xu [20] also argue that the nonparallel hyperplanes of TWSVM cannot efficiently describe two classes, for example, when the samples are drawn from two distinct Gaussian distributions. As a result, they experimentally show that THSVM is superior to TWSVM in such a case.

In this paper, we are interested in designing a learning machine which is formulated based on hyperellipsoids. An ellipsoid is a tempting candidate since it is the next simple, convex, smooth geometrical shape to a plane and a sphere. The decision boundary generated by two ellipsoids is also quadratic with more degree of flexibility than the one offered by two spheres. One of the earliest interests in the minimum ellipsoid covering a set of points is by Fritz John [15] in 1948 where he determined the significant properties of the minimum ellipsoid. Since then, the research on minimum ellipsoid have become an active area of research [1 , 30].

The idea of using ellipsoids in pattern classification can be traced back to the work by Rosen [21] where the minimum ellipsoid defined by the trace of matrix is considered as a data descriptor. Many researchers also worked on using a single ellipsoid for pattern separation. For example, Titterington [29] proposed the use of the minimum volume covering ellipsoid (MVCE) as trimming devices which can be used for outlier rejection. Glineur [9] presented a method for pattern separation using two concentric ellipsoids with maximal separation ratio. Calafiore [3] also proposed a robust ellipsoidal fitting scheme to construct an ellipsoidal primitive to represent a class of data based on the difference-of-squares geometric error criterion. The possibility of formulating a soft-margin version of MVCE, like in SVM, is also suggested by Sun and Freund [24]. Gotoh and Takeda [10] later show that the soft-margin MVCE can be formulated from the perspective of conditional value at risk (CVaR) minimization.

The majority of works on MVCE usually either focuses on improving numerical optimization or performing one-class classification. Following the concept of SVDD as an extension of MEB, kernel methods are also introduced to MVCE in order to allow classification on more complex data [7 , 34]. However, the formulation of MVCE involves the computation of the outer products among samples. As a result, the regular kernel trick often used in SVM by replacing an inner product with a kernel function is not directly applicable. Instead, the formulation of kernelized MVCE can be achieved through empirical feature mapping or kernel principal component analysis [28].

Although many studies have been conducted on evaluating the performance of the kernelized MEB, the formulation of MVCE with kernel methods has not yet been comparatively displayed with its spherical counterpart, especially on binary classifications. Therefore, in this paper, our objective is to design a supervised learning algorithm which constructs a decision boundary using two kernelized hyperellipsoids following the footprints of THSVM. Precisely, one hyperellipsoid with soft margins is used as a data descriptor for one class and is also located as far as possible from the other class. In addition, the hyperellipsoid is also kernelized through the help of empirical feature mapping. The proposed method is called twin-hyperellipsoidal support vector machine (TESVM) and its performance is also evaluated on both artificial and standard benchmark datasets which are publicly available. Our experimental results show that TESVM outperforms THSVM in many datasets in term of classification accuracy. Moreover, for the average results, TESVM is also an alternative to TWSVM and SVM.

This paper is organized as follows. We overview two classifiers which are fundamentally built on hyperspheres, namely SVDD and THSVM in Section 2. In Section 3, the formulation of the proposed algorithm for binary classification is described in details, followed by some experimental results in Section 4. The concluding remarks are provided in Section 5.

2 Background

This section briefly reviews two learning techniques whose concepts are considered the crucial motivation behind the development of the proposed hyperellipsoid-based classifier.

2.1 Support vector data description

The SVDD considers a hypersphere as its underlying shape for data descriptions. Since the solution of SVDD is always in the form of a closed spherical boundary, SVDD is inherently suitable for one-class classification. Initially, it was proposed for domain description [26] for samples without labeling, and later extended to support the case where each sample has either positive (target) or negative (outlier) label [27]. Given a set of m samples, ${x_{i}}_{i = 1}^{m}$ , where each sample is labeled with y_i ∈ {1, - 1}, the goal of SVDD is to find a hypersphere whose radius is r > 0 and center is at $c \in ℝ^{n}$ with the smallest volume to cover all the samples while yielding some misclassification. The formulation of SVDD with negative samples is as follows. $\begin{matrix} \min_{r, c, ξ} r^{2} + \frac{C}{m} \sum_{i = 1}^{m} ξ_{i} \\ s . t . y_{i} ∥ x_{i} - c ∥^{2} \leq y_{i} r^{2} + ξ_{i}, \\ ξ_{i} \geq 0, i = 1, 2, . . ., m \end{matrix}$ (1) where $ξ \in ℝ^{m}$ is the slack vector consisting of slack variables to form the so-called “soft margins” which is one improved feature of SVDD over the traditional MEB. The given C > 0 is also used as a hyperparameter to control the trade-off between the hypersphere’s volume and the misclassification.

By using the method of Lagrange multipliers, the dual formulation of Equation (1) can be obtained as $\begin{matrix} \min_{α} \sum_{i = 1}^{m} \sum_{j = 1}^{m} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j} - \sum_{i = 1}^{m} α_{i} y_{i} x_{i}^{T} x_{i} \\ s . t . \sum_{i = 1}^{m} α_{i} y_{i} = 1, \\ 0 \leq α_{i} \leq \frac{C}{m}, i = 1, 2, . . ., m \end{matrix}$ (2) where $α \in ℝ^{m}$ is the vector of Lagrange multipliers α_i. Any sample whose $α_{i} \in (0, \frac{C}{m}]$ is also called a “support vector”.

Given a test sample $x \in ℝ^{n}$ , the decision rule of SVDD can be derived with the use of KKT conditions as $f (x) = sign (2 \sum_{i = 1}^{m} α_{i} y_{i} x_{i}^{T} x - x^{T} x + h)$ where $h = r^{2} - \sum_{i = 1}^{m} \sum_{j = 1}^{m} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j}$ . r is the distance from the center to any support vector with $0 < α_{i} < \frac{C}{m}$ . In fact, it is not necessary to explicitly compute the center of the optimal hypersphere. Suppose x_k is a support vector with α_k < C, then r² can be calculated by $r^{2} = x_{k}^{T} x_{k} - 2 \sum_{i = 1}^{m} α_{i} y_{i} x_{i}^{T} x_{k} + \sum_{i = 1}^{m} \sum_{j = 1}^{m} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j}$ .

It can be observed that the dual formulation of SVDD in Equation (2) follows closely with the dual of SVM, including the box constraints as well as the inner products among samples. As a result, SVDD can be naturally extended to handle more complex data pattern with kernel tricks as in the case of SVM by simply substituting the inner product with the desired kernel function.

2.2 Twin-hypersphere support vector machine

One closely related formulation to SVDD which forms two hyperspheres to perform binary classification is called THSVM [20]. Although there exist some research studies, such as in [13, 19], working on an extension of SVDD for binary classification, the concept of THSVM was differently initiated. THSVM was built upon the spirit of TWSVM. Instead of finding the best fitting hyperplane for each class of data where each hyperplane is also located as far as possible from the other class, THSVM replaces the set of hyperplanes with hyperspheres. In other words, TWSVM implements a hyperplane as a data descriptor while THSVM reformulated the problem with hyperspheres. As a result, the formulation of THSVM is rather close to SVDD as both methods implement a hypersphere to describe one class of data.

In a two-class scenario, given a set of m samples, ${x_{i}}_{i = 1}^{m}$ , where each sample corresponds to either class $I_{1}$ or class $I_{2}$ with the numbers of samples in both classes to be m₁ and m₂, respectively. The concept of THSVM is to solve a pair of optimization problems in order to find two hyperspheres in $ℝ^{n}$ where each hypersphere acts as a data descriptor for one class. The single hypersphere of linear THSVM with radius r₁ and center c₁ to represent class $I_{1}$ can be formulated as $\begin{matrix} \min_{r_{1}, c_{1}, ξ} r_{1}^{2} - \frac{ν_{1}}{m_{2}} \sum_{j \in I_{2}} ∥ x_{j} - c_{1} ∥^{2} + \frac{C_{1}}{m_{1}} \sum_{i \in I_{1}} ξ_{i} \\ s . t . ∥ x_{i} - c_{1} ∥^{2} \leq r_{1}^{2} + ξ_{i}, \\ ξ_{i} \geq 0, i \in I_{1} \end{matrix}$ (3) where C₁ > 0 and ν₁ > 0 are the predefined trade-off parameters. We include the subscription 1 to indicate that class $I_{1}$ is the target of the description. THSVM is similar to SVDD in that it adopts the use of slack variables ξ to form the soft margins and also aims at minimizing the size of the hypersphere. The constraints of THSVM are defined such that the optimal hypersphere should cover the samples in class $I_{1}$ but also minimally allow samples to lay outside the hypersphere. It can be observed that the primal formulation of THSVM in Equation (3) is a slightly modified version of SVDD in Equation (1) by completely moving the constraints which try to exclude the negative samples (i.e., class $I_{2}$ in this case) into the second term of the optimization objective of Equation (3). As a result, THSVM also tries to place the covering hypersphere to be far from the opposite class controlling by parameter ν₁.

Let $α \in ℝ^{m_{1}}$ be the vector of Lagrange multipliers. The dual formulation of THSVM can be derived from Equation 3 as

$\begin{matrix} \max_{α} - \sum_{i_{1} \in I_{1}} \sum_{i_{2} \in I_{1}} α_{i_{1}} α_{i_{2}} x_{i_{1}}^{T} x_{i_{2}} \\ + \sum_{i \in I_{1}} α_{i} [\frac{2 ν_{1}}{m_{2}} \sum_{j \in I_{2}} x_{i}^{T} x_{j} + (1 - ν_{1}) x_{i}^{T} x_{i}] \\ s . t . \sum_{i \in I_{1}} α_{i} = 1, 0 \leq α_{i} \leq \frac{C_{1}}{m_{1}}, i \in I_{1} . \end{matrix}$ It is worth noting that when ν is zero, the formulation of THSVM will be reduced to the SVDD formulation with only positive samples (class $I_{1}$ ), i.e., all negative samples (class $I_{2}$ ) are entirely ignored. In addition, one hypersphere of THSVM is also required to solve a smaller optimization problem than SVDD, since it is defined by less number of optimization constraints.

The center of the optimal hypersphere can be computed from $c_{1} = \frac{1}{1 - ν_{1}} (\sum_{i \in I_{1}} α_{i} x_{i} - \frac{ν_{1}}{m_{2}} \sum_{j \in I_{2}} x_{j})$ while the optimal radius r₁ is the distance between the center of the optimal hypersphere and a support vector with $0 < α_{i} < \frac{C_{1}}{m_{1}}$ .

After the optimal hypersphere representing class $I_{1}$ is obtained, the optimal hypersphere for class $I_{2}$ can also be computed in the same fashion, by swapping $I_{1}$ to be $I_{2}$ , and vice versa. And given a test sample $x \in ℝ^{n}$ , the classification rule for THSVM becomes $class k = \arg \min_{k = 1, 2} \frac{∥ x - c_{k} ∥}{r_{k}} .$ (4)

Furthermore, similar to SVDD, we can apply kernel tricks to THSVM by replacing the inner products with the desired kernel function. The radii and centers of the hyperspheres are not required to be explicitly computed as Equation (4) can also be rewritten in terms of inner products between samples.

3 Twin-hyperellipsoidal support vector machine

The SVDD and THSVM create a classification rule by finding the best fitting hypersphere around one class. In practice, the nature of spherical shapes, however, is likely to be too conservative for data descriptions. Therefore, in this section, we present a novel binary classifier which is based on hyperellipsoids to provide more relaxation to the decision boundary. The proposed method is named “twin-hyperellipsoidal support vector machine (TESVM)” as its idea is built upon THSVM.

3.1 TESVM Formulation

Let $E_{E, c}$ denote an ellipsoid $E$ whose orientation is described by $E \in S_{+ +}^{n}$ with the center at $E^{- 1} c \in ℝ^{n}$ . An ellipsoid is a closed convex set defined by $E_{E, c} = {x \in ℝ^{nitsc} : ∥ E x - c ∥ \leq 1} .$ Suppose Γ is the gamma function. The volume of $E_{E, c}$ is equal to $\frac{V_{n}}{det E}$ where V_n is the volume of the unit ball in $ℝ^{n}$ [11].

For a given set of m training samples, ${x_{i}}_{i = 1}^{m}$ , where each sample corresponds to either class $I_{1}$ or class $I_{2}$ , and also let $m_{1} = | I_{1} |$ and $m_{2} = | I_{2} |$ , TESVM solves a pair of optimizations in order to find two minimum volume hyperellipsoids, $E_{E_{1}, c_{1}}$ and $E_{E_{2}, c_{2}}$ , where each hyperellipsoid is closest to one class, but also as far as possible from the other class. The illustration of the TESVM’s ellipsoids and its decision boundary between two different classes are shown in Fig. 1 and its formulation is as follows.

$\min_{E_{1}, c_{1}, r, ξ} r^{2} + 2 log det (E_{1}^{- 1}) - \frac{ν_{1}}{m_{2}} \sum_{j \in I_{2}} ∥ E_{1} x_{j} - c_{1} ∥^{2} + \frac{C_{1}}{m_{1}} \sum_{i \in I_{1}} ξ_{i} s . t . ∥ E_{1} x_{i} - c_{1} ∥^{2} \leq r^{2} + ξ_{i}, ξ_{i} \geq 0, i \in I_{1}$ (5)

$\min_{E_{2}, c_{2}, r, ξ} r^{2} + 2 log det (E_{2}^{- 1}) - \frac{ν_{2}}{m_{1}} \sum_{i \in I_{1}} ∥ E_{2} x_{i} - c_{2} ∥^{2} + \frac{C_{2}}{m_{2}} \sum_{j \in I_{2}} ξ_{j} s . t . ∥ E_{2} x_{j} - c_{2} ∥^{2} \leq r^{2} + ξ_{j}, ξ_{j} \geq 0, j \in I_{2}$ (6)

Fig. 1

Decision boundary of TESVM (solid line).

From Equations (5) and (6), ν₁, C₁, ν₂, and C₂ are hyperparameters. It is also important to note that we also assume that the affine hull of all training samples must span $ℝ^{n}$ in order to avoid a degenerate case, i.e., a hyperellipsoid has zero volume in a particular dimension [24].

Since Equations (5) and (6) differ only which target class is considered, Equation (5) becomes the main focus in this section. The first two terms in the objective represent the volume of $E_{E_{1}, c_{1}}$ to be minimized. It is worth noting that r² is included in order to make the formulation similar to THSVM and also provides more succinct expression in the dual formulation. The constraints are defined such that $E_{E_{1}, c_{1}}$ has to cover only the target class while allowing some samples to lay outside. The optimal hyperellipsoid should also be located as far as possible from the other class as controlled by the third term in the objective.

From Equation (5), the Lagrangian can be formed as $L (E_{1}, c_{1}, r, ξ, α, β) = r^{2} + 2 log det (E_{1}^{- 1}) - \frac{ν_{1}}{m_{2}} \sum_{j \in I_{2}} ∥ E_{1} x_{j} - c_{1} ∥^{2} + \frac{C_{1}}{m_{1}} \sum_{i \in I_{1}} ξ_{i} + \sum_{i \in I_{1}} α_{i} [∥ E_{1} x_{i} - c_{1} ∥^{2} - r^{2} - ξ_{i}] - \sum_{i \in I_{1}} β_{i} ξ_{i}$ (7) where α_i ≥ 0 and β_i ≥ 0 for i = 1, 2,. . . , m₁ are Lagrange multipliers. α and $β \in ℝ^{m_{1}}$ are their corresponding column vectors, respectively.

In order to make the formulation more compact, we assign the index for each sample in class $I_{1}$ to be from 1 to m₁ and class $I_{2}$ to be from m₁ + 1 to m. We also extend α from $ℝ^{m_{1}}$ to $ℝ^{m}$ with $α_{m_{1} + 1} = α_{m_{1} + 2} = . . . = α_{m} = - \frac{ν_{1}}{m_{2}}$ . Hence, Equation (7) can be rewritten as

$\begin{matrix} L (E_{1}, c_{1}, r, ξ, α, β) = 2 log det (E_{1}^{- 1}) \\ + \sum_{i = 1}^{m} α_{i} ∥ E_{1} x_{i} - c_{1} ∥^{2} + r^{2} (1 - \sum_{i = 1}^{m_{1}} α_{i}) \\ + \sum_{i = 1}^{m_{1}} (\frac{C_{1}}{m_{1}} - α_{i} - β_{i}) ξ_{i} . \end{matrix}$ (8)

The first derivatives of Equation (8) are $\frac{\partial L}{\partial r} = 2 r (1 - \sum_{i = 1}^{m_{1}} α_{i})$ (9) $\frac{\partial L}{\partial ξ_{i}} = \frac{C_{1}}{m_{1}} - α_{i} - β_{i}, i = 1, 2, . . ., m_{1}$ (10) $\frac{\partial L}{\partial c_{1}} = 2 (c_{1} \sum_{i = 1}^{m} α_{i} - E_{1} \sum_{i = 1}^{m} α_{i} x_{i})$ (11) $\begin{matrix} \frac{\partial L}{\partial E_{1}} & = - 2 E_{1}^{- 1} + \sum_{i = 1}^{m} α_{i} [(E_{1} x_{i} - c_{1}) x_{i}^{T} \\ + x_{i} (E_{1} x_{i} - c_{1})^{T}] . \end{matrix}$ (12) Under the first-order condition of optimality, Equation (9) yields $\sum_{i = 1}^{m_{1}} α_{i} = 1$ , and Equation (10) yields $0 \leq α_{i} \leq \frac{C_{1}}{m_{1}}$ for i = 1, 2,. . . , m₁. To further simplify the above expressions, we prefer the representation of the problem in a matrix form. Let A = diag (α₁, α₂,. . . , α_m) and X = [x₁, x₂,. . . , x_m], from Equation (11), we obtain $c_{1} = E_{1} X α .$ (13) By substituting Equation (13) into Equation (12), we also have $E_{1}^{- 1} = \frac{1}{2} (E_{1} S + S E_{1})$ (14) where S = XAX^T - Xαα^TX^T. The solution to Equation (14) is $S^{- 1} = E_{1}^{2}$ . Although E₁ may not be unique due to the definiteness of S, the logdet term in our optimization objective Equation (5) can act as a natural barrier function to drive E₁ to be positive definite. When $S \in S_{+ +}^{n}$ , we have $E_{1} = S^{- \frac{1}{2}}$ [36].

By rewriting Equation (8) with KKT conditions, the corresponding dual formulation of Equation (5) becomes $\begin{matrix} \max_{α_{1}, . . . ., α_{m_{1}}} log det (X A X^{T} - X α α^{T} X^{T}) \\ s . t . \sum_{i = 1}^{m_{1}} α_{i} = 1 \\ 0 \leq α_{i} \leq \frac{C_{1}}{m_{1}}, i = 1, 2, . . ., m_{1} . \end{matrix}$ (15)

The optimization Equation (15) can be solved using standard semidefinite programming solvers such as SDPT3 version 4.0 [31] which supports log-determinant in the objective function.

At the optimality, we also have the complementary slackness, for i = 1, 2,. . . , m₁, $0 = β_{i} ξ_{i}$ (16) $0 = α_{i} [∥ E_{1} x_{i} - c_{1} ∥^{2} - r^{2} - ξ_{i}] .$ (17) By choosing one sample with $0 < α_{i} < \frac{C_{1}}{m_{1}}$ , we obtain ξ_i = 0 from Equation (16). As a result, r can later be calculated from Equation (17).

After $E_{E_{1}, c_{1}}$ is found, $E_{E_{2}, c_{2}}$ can similarly be computed by Equation (6). Given a testing sample $x \in ℝ^{n}$ , the decision rule for the TESVM classifier is defined by $class k = \arg \min_{k = 1, 2} \frac{∥ E_{k} x - c_{k} ∥}{r_{k}} .$ (18)

3.2 Empirical feature map

The concept of TESVM is formulated based on constructing hyperellipsoids for class descriptions; however, in practice, data are often unstructured or drawn from unknown distributions. Many linear classification algorithms overcome such a limitation with the help of kernel methods. By replacing, all inner product terms between two samples with a kernel function to alternatively define a similarity between two samples, the methods such as SVM, TWSVM, THSVM, and SVDD, are extended to work seamlessly in the feature space. For TESVM, since its formulation is not expressed in terms of inner products, it is impossible to apply the same kernel method. One possible approach which can be used to extend the capability of TESVM to the feature space is through the use of empirical feature map [22].

Given the training samples $x_{1}, x_{2}, . . ., x_{m} \in ℝ^{n}$ and a kernel function $k : ℝ^{n} \times ℝ^{n} \mapsto ℝ$ . Suppose the image of a sample $x \in ℝ^{n}$ in the finite empirical feature space is $φ (x) \in ℍ$ . The empirical feature map from $ℝ^{n}$ to $ℍ$ is defined as $φ : x \mapsto (Q^{+})^{T} k (x)$ (19) where k (x) = [k (x, x₁) , k (x, x₂) ,. . . , k (x, x_m)] ^T. Q can be obtained by factorizing the kernel matrix, K = Q^TQ, e.g., by eigendecomposition, and (·) ⁺ is the Moore-Penrose pseudoinverse operator. In this paper, in addition to the linear kernel, the RBF kernel defined as $k (x, y) = \exp (- \frac{∥ x - y ∥^{2}}{σ^{2}})$ is also used where σ is the kernel parameter.

3.3 Connection to MVCE with negative samples

Although we consider TESVM to be an extension of THSVM, the formulation of TESVM is also closely related to the MVCE problem with negative samples [28]. This connection is similar to the idea that THSVM is another form of double SVDDs with negative samples. In general, given $I_{1}$ and $I_{2}$ to be the positive and negative classes, respectively, a soft-margin MVCE problem with negative classes can be formulated as $\begin{matrix} \min_{E, c, r, ξ} r^{2} + 2 log det (E^{- 1}) + \frac{C}{m} \sum_{i \in I_{1} \cup I_{2}} ξ_{i} \\ s . t . Y_{i} ∥ E x_{i} - c_{1} ∥^{2} \leq y_{i} r^{2} + ξ_{i} \\ ξ_{i} \geq 0, i \in I_{1} \cup I_{2} \end{matrix}$ (20) where y_i = 1 for $i \in I_{1}$ and y_i = -1 for $i \in I_{2}$ . The solution to Equation (20) is the minimum volume covering hyperellipsoid $E_{E, c}$ which covers the positive class. It can be seen that Equation (20) differs from Equation (5) only the excluding constraints of the negative samples in Equation (20) are moved into the objective function of Equation (5).

The dual formulation of Equation (20) has exactly the same form as Equation (15) but with little modification. That is the dual problem will have m Lagrange multipliers, where α_m₁+1, α_m₁+2,. . . , α_m are not just a constant. The m Lagrange multipliers are also required to be summed to 1 and are confined within the box constraints $0 \leq α_{i} \leq \frac{C}{m}$ . As a result, it can be observed that one subproblem of TESVM is required to solve smaller numbers of optimization variables than one MVCE problem in the dual formulation.

4 Experimental results

In order to evaluate the performance of the proposed TESVM, we conduct some experiments on several artificial and standard datasets and provide the comparison in term of accuracy. Since TESVM is considered as an improvement of THSVM, head-to-head comparisons between the two are the main focus. However, the results from SVM and TWSVM are also presented for further validation. All experiments are performed using 10-fold cross-validation and the cross-validation is repeated for 10 times with randomly shuffled samples. All methods are implemented and evaluated using Matlab 2017a under Ubuntu 17.04 and 2.8 GHz Intel Core i5-6402P with 8 GB of RAM. We use SDPT3 [31] through YALMIP [18] as a semidefinite programming solver for TESVM because of its support for the log-determinant objective function. The quadratic programming in THSVM and TWSVM is solved using the MATLAB’s quadprog solver and SVM is implemented using LibSVM [5].

4.1 Toy Examples

Three 2-dimensional artificial datasets are presented in this section in order to visually show the strength of TESVM. The details of the data are summarized in Table 1. We manually created the first two datasets, called Toy 1 and Toy 2, to test the performance of the linear classifiers where each dataset was generated based on two distinct Gaussian distributions. For the third dataset, Ho & Kleinberg’s checkerboard dataset [12] is used to show the performance of nonlinear classifiers. It contains two classes of samples generated under the uniform distributions to form a 4×4 tiles of checkerboard. The hyperparameters are chosen by performing a grid search over ranges of parameters. The best set of parameters is the one which provides the best 10-fold cross-validation accuracy. In the case of THSVM, we set C₁ = C₂ = C and ν₁ = ν₂ = ν where C ∈ {10ⁱ|i = -5, - 4,. . . , 5} and ν ∈ {0.0, 0.1, 0.2,. . . , 0.9}. In the case of TESVM, we also set C₁ = C₂ = C and ν₁ = ν₂ = ν with C ∈ {2ⁱ|i = -4, - 3,. . . , 4} and ν ∈ {10^-9, 10^-12}. The parameter σ of the RBF kernel for both methods is searched from {2ⁱ|i = -4, - 3,. . . , 5}. For SVM and TWSVM, all parameters including C and σ are searched from {2ⁱ|i = -5, - 4,. . . , 5}.

Table 1
Accuracy (average (standard deviation)) on toy examples reported from ten rounds of 10-fold cross-validations

Dataset m ₁ m ₂ Kernel TESVM THSVM TWSVM SVM

Toy 1 100 100 linear 92.25 (0.42) 79.40 (0.51) 92.40 (0.31) 78.15 (1.08)

Toy 2 50 200 linear 98.76 (0.35) 80.44 (0.54) 97.28 (0.25) 97.36 (0.20)

Checkerboard 514 486 RBF 96.15 (0.37) 96.45 (0.31) 95.99 (0.30) 95.73 (0.55)

Dataset	m ₁	m ₂	Kernel	TESVM	THSVM	TWSVM	SVM
Toy 1	100	100	linear	92.25 (0.42)	79.40 (0.51)	92.40 (0.31)	78.15 (1.08)
Toy 2	50	200	linear	98.76 (0.35)	80.44 (0.54)	97.28 (0.25)	97.36 (0.20)
Checkerboard	514	486	RBF	96.15 (0.37)	96.45 (0.31)	95.99 (0.30)	95.73 (0.55)

In the first toy dataset, each class contains 100 samples. Both classes are randomly generated from two Gaussian distributions with the same covariance matrix but at the different center and rotation as shown in Fig. 2. Precisely, the distributions are from $N ([0, 0], diag (1, 10^{- \frac{3}{2}}))$ and $N ([1, 0], diag (10^{- \frac{3}{2}}))$ with the clockwise rotation angles -45 and 45 degrees, respectively. It is important to note that the classifiers are trained on the normalized data, but Fig. 2 plots the denormalized ones. Consequently, the two circles of THSVM in Fig. 2b appear to be scaled.

Fig. 2

Classification on Toy 1 dataset. The dashed and dash-dotted lines representing the class descriptors, and the solid curves show the decision boundaries. The test accuracies on the training samples are (a) 92.5% (b) 80.5% (c) 92.0% (d) 78.5%.

The purpose of Toy 1 dataset is to illustrate the circumstance when THSVM loses the spirit of TWSVM, even though it is said to be a successor. One benefit of linear TWSVM over the linear SVM is that linear TWSVM is inherently able to deal with the so-called “cross-planes” dataset [23]. However, THSVM has no such an ability as presented in Fig. 2. When the samples from two classes form a cross shape, the accuracy obtained from THSVM on the training data can be as low as 50%. This is not the case for TESVM as it offers less conservative class descriptors. According to Fig. 2, TESVM significantly outperforms THSVM and SVM, while on par with TWSVM. That is TESVM, THSVM, TWSVM, and SVM achieve the test accuracy on the training set at 92.5%, 80.5%, 92.0%, and 78.5%, respectively. As a remark, the decision rules in Fig. 2 are tested on the entire training samples. Therefore, for better generalization, we also provide the test accuracy from the 10-fold cross-validation in Table 1.

For the second toy dataset as depicted in Fig. 3, the samples in each class are generated from two different Gaussian distributions with unbalanced numbers of samples. The first class has 50 samples randomly drawn from $N ([9, 0], diag (10^{\frac{3}{2}}, 10^{- \frac{3}{2}}))$ , while the second class has 200 samples drawn from $N ([10, 0], diag (1, 10^{- \frac{3}{2}}))$ . This dataset is also to demonstrate another weakness of THSVM. Despite the claim that THSVM is superior to TWSVM as the nonparallel hyperplanes of TWSVM cannot efficiently describe two classes when the samples are drawn from two distinct Gaussian distributions [20], Fig. 3 shows that THSVM almost fails in this dataset. From the figure, the accuracy on the training samples from THSVM is 80.0%. This implies that THSVM incorrectly identifies all the training samples to be entirely from the second class. On the other hand, TESVM, TWSVM, and SVM, respectively, achieve the test accuracy at 99.2%, 97.6%, and 97.2% showing no such a weakness as in the case of THSVM. The accuracy from 10-fold cross-validation in Table 1 also further validates the result.

Fig. 3

Classification on Toy 2 dataset. The dashed and dash-dotted lines representing the class descriptors, and the solid curves show the decision boundaries. The test accuracies on the training samples are (a) 99.2% (b) 80.0% (c) 97.6% (d) 97.2%.

Next, the checkerboard dataset is chosen to show the performance of the algorithms for a nonlinear separable case. We use the RBF kernel due to its popularity and success with real-world data. Figure 4 displays the decision boundaries from the four algorithms. It can be observed that the decision boundary obtained from TESVM is rather complex with more curvature than from THSVM. Although it is reported from the figure that TESVM provides better test accuracy on the training data than THSVM, i.e., 99.30% compared with 98.80%, TESVM gives slightly less accuracy from the 10-fold cross-validation. In our view, both methods when used with the RBF kernel are on par with each other in term of performance on this dataset. According to Table 1, all the classifiers provide accuracies with a similar magnitude, 96.15%, 96.45%, 95.99%, and 95.75%, respectively for TESVM, THSVM, TWSVM, and SVM. Therefore, additional datasets are required so as to further evaluate their performance.

Fig. 4

Classification on the checkerboard dataset. The dashed and dash-dotted lines representing the class descriptors, and the solid curves show the decision boundaries. The test accuracies on the training samples are (a) 99.3% (b) 98.8% (c) 98.5% (d) 98.2%.

4.2 Standard benchmark datasets

In this section, publicly available standard datasets from the well-known UCI Machine Learning Repository [17] are used to compare the performance of TESVM, THSVM, TWSVM, and SVM. The details of the datasets are shown in Table 2. All the datasets contain two labels of data, except for Iris and Thyroid which have three classes. Therefore, Iris and Thyroid datasets are formed two-class problems by using one-against-all strategy.

Table 2
Standard benchmark datasets used in the experiments

Dataset n m m ₁ m ₂ Dataset n m m ₁ m ₂

Bupa 6 345 145 200 Iris (3) 4 150 50 100

Diabetes 8 768 500 268 Postop 8 86 62 24

Haberman 3 306 225 81 Sonar 60 208 111 97

Heart 13 270 150 120 Thyroid (1) 5 215 150 65

Hepatitis 19 155 32 123 Thyroid (2) 5 215 180 35

Ionosphere 33 351 126 225 Thyroid (3) 5 215 185 30

Iris (1) 4 150 50 100 Transfusion 4 748 570 178

Iris (2) 4 150 50 100 WDBC 30 569 357 212

Dataset	n	m	m ₁	m ₂	Dataset	n	m	m ₁	m ₂
Bupa	6	345	145	200	Iris (3)	4	150	50	100
Diabetes	8	768	500	268	Postop	8	86	62	24
Haberman	3	306	225	81	Sonar	60	208	111	97
Heart	13	270	150	120	Thyroid (1)	5	215	150	65
Hepatitis	19	155	32	123	Thyroid (2)	5	215	180	35
Ionosphere	33	351	126	225	Thyroid (3)	5	215	185	30
Iris (1)	4	150	50	100	Transfusion	4	748	570	178
Iris (2)	4	150	50	100	WDBC	30	569	357	212

Since THSVM separately finds two separate hyperspheres describing two classes, and TESVM also finds two separate hyperellipsoids, we thus consider two scenarios of hyperparameter selections for TESVM and THSVM. In Scenario I, the hyperparameters for solving two optimization subproblems are set to be the same, while in Scenario II, the hyperparameters are allowed to be different. As a result, it is expected that the second scenario should reflex more flexibility of the decision boundary than the first scenario. The experimental results for each scenario with linear and RBF kernels are shown in Table 3. The best performers between TESVM and THSVM for each dataset, each scenario, and each type of kernels are shown in boldface.

Table 3

The comparisons between TESVM and THSVM on the two scenarios with linear and RBF kernels on the standard datasets

Dataset	Linear kernel				RBF kernel
	Scenario I		Scenario II		Scenario I		Scenario II
	TESVM	THSVM	TESVM	THSVM	TESVM	THSVM	TESVM	THSVM
Bupa	67.73 (1.23)	56.95 (1.02)	68.69 (1.05)	57.39 (1.16)	69.79 (1.28)	69.73 (1.19)	69.79 (1.28)	69.73 (1.40)
Diabetes	72.18 (0.61)	73.17 (1.53)	75.84 (0.74)	74.27 (0.54)	76.18 (0.54)	76.34 (0.57)	76.83 (0.31)	76.34 (0.57)
Haberman	69.05 (1.12)	71.17 (2.36)	75.32 (0.51)	73.03 (0.97)	76.14 (0.37)	74.11 (2.39)	76.20 (0.68)	71.47 (4.01)
Heart	81.29 (1.10)	79.96 (1.64)	81.85 (1.15)	80.40 (2.29)	83.92 (0.60)	83.00 (0.91)	83.92 (0.60)	83.66 (0.53)
Hepatitis	82.90 (1.90)	79.03 (1.02)	82.96 (1.80)	80.19 (1.94)	83.48 (1.26)	81.61 (2.56)	84.70 (1.25)	85.22 (0.71)
Ionosphere	68.74 (0.76)	37.74 (0.45)	92.19 (0.33)	87.72 (0.63)	94.30 (0.32)	95.15 (0.65)	94.44 (0.30)	95.15 (0.65)
Iris (1)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)
Iris (2)	82.93 (0.64)	66.80 (0.42)	88.06 (1.45)	86.66 (1.53)	97.46 (0.28)	96.93 (0.56)	97.66 (0.64)	97.60 (0.71)
Iris (3)	96.66 (0.00)	93.40 (0.66)	96.73 (0.66)	96.80 (0.81)	98.06 (0.49)	96.73 (0.73)	98.06 (0.49)	96.33 (1.05)
Postop	61.39 (2.03)	55.46 (3.46)	69.30 (1.99)	69.18 (1.57)	71.74 (1.23)	69.18 (1.83)	72.79 (0.98)	70.34 (1.47)
Sonar	75.86 (1.44)	69.32 (1.29)	77.16 (1.71)	69.56 (1.84)	89.66 (1.32)	87.21 (1.28)	89.66 (1.32)	88.17 (0.91)
Thyroid (1)	83.67 (1.08)	34.51 (1.70)	92.18 (0.78)	87.95 (3.11)	96.93 (0.44)	96.46 (0.90)	97.20 (0.53)	97.25 (0.40)
Thyroid (2)	98.04 (0.19)	98.37 (0.82)	98.46 (0.44)	98.93 (0.49)	99.48 (0.14)	98.83 (0.32)	99.48 (0.14)	98.51 (0.75)
Thyroid (3)	97.62 (0.26)	97.39 (1.61)	98.04 (0.19)	98.69 (0.75)	98.65 (0.34)	98.51 (0.89)	98.65 (0.34)	98.37 (1.03)
Transfusion	74.15 (0.51)	75.93 (0.73)	76.59 (0.49)	75.58 (0.47)	77.31 (0.70)	76.79 (1.09)	77.43 (0.13)	77.39 (0.56)
WDBC	95.44 (0.40)	94.21 (0.47)	95.78 (0.34)	94.71 (0.87)	97.18 (0.11)	96.50 (0.39)	97.18 (0.20)	96.80 (0.38)

In the first scenario, the search ranges for the best hyperparameters are identical to the ranges specified in Section 4.1. However, for the second scenario, we set c₁ ≠ c₂ and ν₁ ≠ ν₂ for both THSVM and TESVM, where c₁, c₂ ∈ {10ⁱ|i = -3, - 2,. . . , 3} and ν₁, ν₂ ∈ {0.0, 0.1, 0.2,. . . , 0.9} for THSVM, and c₁, c₂ ∈ {2ⁱ|i = -4, - 3,. . . , 4} and ν₁, ν₂ ∈ {10^-9, 10^-12} for TESVM. The kernel parameter σ is always set the same for one pair of optimizations and is searched from the set {2ⁱ|i = -4, - 3,. . . , 5}. In addition, for TESVM to avoid degenerate cases when the matrix inside log-determinant objective function Equation (15) has some zero eigenvalues, we also add a small diagonal matrix γI to it where γ is a small scalar and I is an identity matrix of appropriate dimensions. We consider γ as another hyperparameter and search for its optimal value from the range {10ⁱ|i = -1, - 2, - 3, - 4}.

From Table 3, the results indicate that the second scenario likely provides better classification accuracy since the decision boundary is formed by searching a larger search space of hyperparameters. However, that is not always true, as can be seen, for example, from the case of linear THSVM with Transfusion dataset where Scenario II has lower accuracy than Scenario I. That is because the best hyperparameters are determined based only on one run of 10-fold cross validations while the results are reported from ten independent runs with randomly shuffled data.

Additionally, similar to the results in Section 4.1, it can be observed that linear TESVM tested with the standard datasets also performs better than linear THSVM on both scenarios as reported in Table 3 on most of the datasets. That is 11 and 12 datasets out of 15 datasets for Scenario I and II, respectively. While linear TESVM performs worse in some datasets, the differences are rather small. As a result, we conclude that linear TESVM provides better prediction results than linear THSVM.

The classification accuracy from both THSVM and TESVM with the RBF kernel on the standard datasets also further shows the advantage of using hyperellipsoids over hyperspheres. According to Table 3, we observe that TESVM with RBF kernel provides better accuracy on 13 and 12 datasets out of 15 datasets on the first and the second scenarios, respectively. Although the accuracies are not drastically improved from Scenario I to Scenario II and also from THSVM to TESVM for the case of RBF kernels, slight enhancements in the accuracies in many datasets are still a good indicator to show that the less conservative class descriptors of hyperellipsoids can help squeeze the performance from the less flexible spherical shape of THSVM, even together with the RBF kernel.

Moreover, we further compare both THSVM and TESVM with SVM and TWSVM. The setups of SVM and TWSVM are the same as described in Section 4.1. The comparisons of all four classifiers are provided in Table 4 for both linear and RBF kernels where the best accuracies for each type of kernels are highlighted in bold letters. The accuracy values for TESVM and THSVM in Table 4 are extracted from the best scenario in Table 3 for each dataset and each type of kernels. According to Table 4, TESVM and SVM perform better than THSVM and TWSVM in term of the number of datasets for both linear and RBF kernels. Overall, the average accuracy across all 16 standard datasets shows that TESVM is superior to the other methods. In fact, its performance is close to SVM’s and even better in many datasets.

Table 4

Accuracy of the TESVM, THSVM, TWSVM, and SVM with linear and RBF kernels on the standard datasets

Dataset	Linear kernel				RBF kernel
	TESVM	THSVM	TWSVM	SVM	TESVM	THSVM	TWSVM	SVM
Bupa	68.69 (1.05)	57.39 (1.16)	66.66 (0.61)	69.10 (0.83)	69.79 (1.28)	69.73 (1.40)	73.21 (0.58)	72.60 (0.85)
Diabetes	75.84 (0.74)	74.27 (0.54)	76.94 (0.38)	76.92 (0.31)	76.83 (0.31)	76.34 (0.57)	77.09 (0.77)	77.39 (0.33)
Haberman	75.32 (0.51)	73.03 (0.97)	75.45 (0.58)	73.52 (0.00)	76.20 (0.68)	74.11 (2.39)	73.20 (1.40)	74.24 (0.71)
Heart	81.85 (1.15)	80.40 (2.29)	83.85 (0.58)	84.14 (0.51)	83.92 (0.60)	83.66 (0.53)	83.77 (0.62)	84.37 (0.45)
Hepatitis	82.96 (1.80)	80.19 (1.94)	80.12 (2.31)	79.48 (1.51)	84.70 (1.25)	85.22 (0.71)	84.38 (1.57)	83.48 (1.95)
Ionosphere	92.19 (0.33)	87.72 (0.63)	82.07 (0.72)	87.54 (0.87)	94.44 (0.30)	95.15 (0.65)	93.90 (0.81)	95.21 (0.39)
Iris (1)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)	100.00 (0.00)
Iris (2)	88.06 (1.45)	86.66 (1.53)	74.86 (1.37)	72.00 (1.40)	97.66 (0.64)	97.60 (0.71)	97.26 (0.58)	97.26 (0.21)
Iris (3)	96.73 (0.66)	96.80 (0.81)	94.80 (0.87)	95.80 (0.63)	98.06 (0.49)	96.73 (0.73)	97.20 (0.52)	95.86 (0.52)
Postop	69.30 (1.99)	69.18 (1.57)	70.58 (1.23)	72.09 (0.00)	72.79 (0.98)	70.34 (1.47)	69.06 (3.29)	72.67 (1.25)
Sonar	77.16 (1.71)	69.56 (1.84)	75.52 (1.99)	78.26 (2.06)	89.66 (1.32)	88.17 (0.91)	88.79 (1.22)	89.75 (1.28)
Thyroid (1)	92.18 (0.78)	87.95 (3.11)	81.34 (0.46)	86.65 (0.44)	97.20 (0.53)	97.25 (0.40)	95.16 (0.96)	96.04 (0.54)
Thyroid (2)	98.46 (0.44)	98.93 (0.49)	92.27 (0.82)	98.60 (0.00)	99.48 (0.14)	98.83 (0.32)	99.48 (0.14)	98.97 (0.19)
Thyroid (3)	98.04 (0.19)	98.69 (0.75)	97.34 (0.22)	98.04 (0.42)	98.65 (0.34)	98.51 (0.89)	97.86 (0.24)	97.81 (0.69)
Transfusion	76.59 (0.49)	75.93 (0.73)	77.92 (0.62)	76.20 (0.00)	77.43 (0.13)	77.39 (0.56)	79.21 (0.43)	78.97 (0.63)
WDBC	95.78 (0.34)	94.71 (0.87)	94.67 (0.32)	97.69 (0.19)	97.18 (0.20)	96.80 (0.38)	98.04 (0.19)	97.83 (0.18)
Average	85.57	83.21	82.78	84.13	88.38	87.86	87.98	88.28

One drawback of TESVM that we can observe during the experiments is the proper choices of ν₁ and ν₂. According to Equation (15), the term XAX^T is equivalent to $\sum_{i \in I_{1}} α_{i} x_{i} x_{i}^{T} - \frac{ν_{1}}{m_{2}} \sum_{j \in I_{2}} x_{j} x_{j}^{T}$ which implies that if ν₁ is specified too large, the overall term will become an indefinite matrix. Thus, the optimization cannot be solved. As a result, the values of ν₁ and ν₂ in this paper are kept to be rather small to avoid such a pitfall and the proper selection method is subjected to further research.

It is also important to note that the computational complexity of solving the TESVM’s semidefinite program is unfortunately much higher than the quadratic program of THSVM. Furthermore, the complexity of the algorithms also depends on the number of training samples. Therefore, in this paper, we deliberately chose the test datasets which have rather small numbers of samples. In addition, for the sake of simplicity, TESVM was also implemented using the generic semidefinite programming solver. As a result, an important future work is to improve the computational efficiency of TESVM. For example, it is possible to apply the dual reduced Newton algorithm [24] which also utilizes the structure of MVCE to solve the optimization. Another possibility is to apply active-set strategies [24] or a scheme like that in [33] to select only some subsets of data which are important for the creation of hyperellipsoids.

6 Conclusions

In this paper, we proposed a novel TESVM binary classifier which constructs two hyperellipsoids where each hyperellipsoid acts as a data descriptor for one class and is also as far as possible from the other class. The idea of TESVM can be considered an extension of THSVM as well as a modified version of the soft-margin MVCE problem with negative samples. Although TESVM cannot directly apply kernel tricks as in the case of SVM, TESVM with kernel methods can be achieved through the help of empirical feature mapping which maps all samples into an estimated empirical feature space. We evaluated the performance of the proposed method against THSVM, TWSVM, and SVM on both artificial and standard real-world datasets, and the experimental results support that TESVM provides better prediction accuracy than the other methods on most of the datasets in term of 10-fold cross-validation accuracy. Particularly, as TESVM is a direct improvement of THSVM, the results also show that the ellipsoidal boundary of TESVM is superior to the spherical boundary of THSVM.

Future works may involve practically applying the proposed method to solve real-world multiclass classification problems. As a precursor step, therefore, it is necessary to find a technique which can enable TESVM to handle larger datasets. Since most training time is contributed to the search for the best hyperparameters, a metaheuristic approach such as Cuckoo search algorithm [35] may be one possibility to help reduce the overall training time.

References

Ahipaşaoğlu

S.D.

, Fast algorithms for the minimum volume estimator, Journal of Global Optimization62(2) (2015), 351–370.

Bedrintsev

A.A.

and Chepyzhov

V.V.

, Description of the design space by extremal ellipsoids in data representation problems, Journal of Communications Technology & Electronics61(6) (2016), 688–694.

Calafiore

, Approximation of n-dimensional data using spherical and ellipsoidal primitives, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans32(2) (2002), 269–278.

Cevikalp

, Best fitting hyperplanes for classification, IEEE Transactions on Pattern Analysis and Machine Intelligence39(6) (2017), 1076–1088.

Chang

C.-C.

and Lin

C.-J.

, LIBSVM: A library for support vector machines, ACM Transactions on Intelligent Systems and Technology2 (2011), 27:1–27:27, http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

Cortes

and Vapnik

, Support-vector networks, Machine Learning20(3) (1995), 273–297.

Dolia

, Bie

, Harris

, Shawe-Taylor

, Titterington

D.M.

, The Minimum Volume Covering Ellipsoid Estimation in Kernel-Defined Feature Spaces, In In Machine Learning: ECML Lecture Notes in Computer Science, Vol. 4212, SpringerBerlin Heidelberg, 2006, pp. 630–637 .

Dolia

A.N.

, Harris

C.J.

, Shawe-Taylor

J.S.

and Titterington

D.M.

, Kernel ellipsoidal trimming,Data Analysis}, Computational Statistics & Data Analysis52(1) (2007), 309–324.

Glineur

, Master’s thesis, Faculté Polytechnique de Mons, Mons, Belgium, Pattern Separation via Ellipsoids and Conic Programming, 1998.

10.

Gotoh

J.-Y.

and Takeda

, Conditional minimum volume ellipsoid with application to multiclass discrimination, Computational Optimization and Applications41(1) (2008), 27–51.

11.

Grötschel

, Lovász

, Schrijver

, Geometric Algorithms and Combinatorial Optimization, Algorithms and Combinatorics, Vol. 2, Springer Berlin Heidelberg, Berlin Heidelberg, 1993.

12.

T.K. Ho and E.M. Kleinberg, Checkerboard dataset, http://research.cs.wisc.edu/math-prog/mpml.html.

13.

Huang

, Chen

, Zhou

, Yin

and Guo

, Two-class support vector data description, Pattern Recognition44(2) (2011), 320–329.

14.

Khemchandani

Jayadeva R.

and Chandra

, Twin support vector machines for pattern classification, IEEE Transactions on Pattern Analysis and Machine Intelligence29(5) (2007), 905–910.

15.

John

, Extremum Problems with Inequalities as Subsidiary Conditions, In Traces and Emergence of Nonlinear Programming, Giorgi

and Kjeldsen

T.H.

, eds, Springer, Basel, Basel, 2014, pp. 197–215.

16.

Kumar

and Yildirim

, Minimum-volume enclosing ellipsoids and core sets, Journal of Optimization Theory and Applications126(1) (2005), 1–21.

17.

Lichman

, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml.

18.

Löfberg

, YALMIP: A Toolbox for Modeling and Optimization in MATLAB, 2004 IEEE International Symposium on Computer Aided Control Systems Design, 2004, pp. 284–289.

19.

and Nandi

A.K.

, Multiclass classification based on extended support vector data description, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)39(5) (2009), 1206–1216.

20.

Peng

and Xu

, A twin-hypersphere support vector machine classifier and the fast learning algorithm, Information Sciences221 (2013), 12–27.

21.

Rosen

J.B.

, Pattern separation by convex programming, Journal of Mathematical Analysis and Applications10(1) (1965), 123–134.

22.

Schölkopf

, Mika

, Burges

C.J.C.

, Knirsch

, Müller

K.R.

, Rätsch

and Smola

A.J.

, Input space versus feature space in kernel-based methods, IEEE Transactions on Neural Networks10(5) (1999), 1000–1017.

23.

Shao

Y.-H.

, Zhang

C.-H.

, Wang

X.-B.

and Deng

N.-Y.

, Improvements on twin support vector machines, IEEE Transactions on Neural Networks22(6) (2011), 962–968.

24.

Sun

and Freund

R.M.

, Computation of minimum-volume covering ellipsoids, Operations Research52(5) (2004), 690–706.

25.

Sylvester

J.J.

, A question in the geometry of situation, Quarterly Journal of Pure and Applied Mathematics1 (1857), 79.

26.

Tax

D.M.J.

and Duin

R.P.W.

, Support vector domain description, Pattern Recognition Letters20(11-13) (1999), 1191–1199.

27.

Tax

D.M.J.

and Duin

R.P.W.

, Support vector data description, Machine Learning54(1) (2004), 45–66.

28.

Teeyapan

, Theera-Umpon

and Auephanwiriyakul

, Ellipsoidal support vector data description, Neural Computing and Applications28(Suppl 1) (2017), 337–347.

29.

Titterington

D.M.

, Estimation of correlation coefficients by ellipsoidal trimming, Applied Statistics27(3) (1978), 227–234.

30.

Todd

M.J.

and Yildirim

E.A.

, On Khachiyan’s algorithm for the computation of minimum-volume enclosing ellipsoids, Discrete Applied Mathematics155(13) (2007), 1731–1744.

31.

Tütüncü

R.H.

, Toh

K.C.

and Todd

M.J.

, Solving semi-definite-quadratic-linear programs using SDPT3, Mathematical Programming95(2) (2003), 189–217.

32.

Vapnik

V.N.

, Statistical Learning Theory, Wiley, New York, 1998.

33.

Wang

and Xiao

, Ellipsoidal data description, Neurocomputing238 (2017), 328–339.

34.

Wei

X.-K.

, Li

Y.-H.

, Li

Y.-F.

and Zhang

D.-F.

, Enclosing machine learning: concepts and algorithms, Neural Computing and Applications17(3) (2008), 237–243.

35.

Yang

X.S.

and Deb

, World Congress on Nature Biologically Inspired Computing (NaBIC), Cuckoo Search via Lévy Flights (2009), 210–214.

36.

Zhang

and Gao

, On Numerical solution of the maximum volume ellipsoid problem, SIAM Journal on Optimization14(1) (2003), 53–76.