Abstract
This paper presents a novel binary classifier based on two best fitting hyperellipsoids in the feature space, called twin-hyperellipsoidal support vector machine (TESVM). The idea of TESVM is inspired by the minimum volume covering ellipsoid together with twin-hypersphere support vector machine (THSVM) which is a variant of the well-known support vector data description (SVDD). Following the concept of THSVM, TESVM constructs two hyperellipsoids where each hyperellipsoid is closest to one class but also as far as possible from the other class in order to form a decision boundary. The construction of hyperellipsoids in the feature space is also enabled through the use of empirical feature mapping. The experimental results on several artificial as well as standard real-world datasets are provided to demonstrate the performance of TESVM. Particularly, TESVM outperforms its spherical counterpart in term of classification accuracy.
Keywords
Introduction
Performing classification tasks is of paramount importance to distinguish one element from another. Supervised learning machines [32] help achieve such a goal by trying to reveal hidden statistical relationships among different groups of data. Since their emergence, a number of applications have been developed which benefit human beings; such examples include fingerprint recognition, handwriting recognition, and so forth. Indeed, they generate an enormous impact on a variety of fields spanning from medicine to engineering, from economics to justice systems, just to name a few.
Conceptually, a learning machine possesses some sorts of pre-defined set of hypotheses or feasible decision functions in its mind. During the training process, it will opt for the optimal one under some criteria and according to its past experiences. After the best function is selected, the machine uses it to recognize unknown data during the testing process. In research literature, the pre-defined functions given to a learning machine are usually linear as can be seen in the cases of support vector machines (SVM) [6], twin support vector machines (TWSVM) [14], and the best fitting hyperplane classifier [4]. Some studies also worked on decision functions which are quadratic by nature; for example, twin-hypersphere support vector machine (THSVM) [20] utilized two hyperspheres to solve a classification task.
Employing a hypersphere as a data descriptor has long been an active area of research. Fundamentally, it is strongly related to the minimum enclosing ball problem (MEB). In its simplest form, finding the smallest possible circle around m given points was posed as a problem more than 150 years ago [25]. Since the boom of SVM and kernel methods, MEB was also reformulated with kernel tricks and soft margins, where it becomes known as support vector data description (SVDD) [27]. Although SVDD is primarily designed for one-class classification problems, it is also extended to support multiclass cases such as that in the work by Mu and Nandi [19]. Another formulation to implement two hyperspheres to solve a binary classification is from Peng and Xu [20] where it is named THSVM. Although their approach is quite similar to SVDD, the core of THSVM is built around the spirit of TWSVM. Instead of trying to find the best fitting hyperplane around one class in the feature space, THSVM looks for the best fitting hypersphere which is closest to one class but as far as possible from the other class. Peng and Xu [20] also argue that the nonparallel hyperplanes of TWSVM cannot efficiently describe two classes, for example, when the samples are drawn from two distinct Gaussian distributions. As a result, they experimentally show that THSVM is superior to TWSVM in such a case.
In this paper, we are interested in designing a learning machine which is formulated based on hyperellipsoids. An ellipsoid is a tempting candidate since it is the next simple, convex, smooth geometrical shape to a plane and a sphere. The decision boundary generated by two ellipsoids is also quadratic with more degree of flexibility than the one offered by two spheres. One of the earliest interests in the minimum ellipsoid covering a set of points is by Fritz John [15] in 1948 where he determined the significant properties of the minimum ellipsoid. Since then, the research on minimum ellipsoid have become an active area of research [1, 30].
The idea of using ellipsoids in pattern classification can be traced back to the work by Rosen [21] where the minimum ellipsoid defined by the trace of matrix is considered as a data descriptor. Many researchers also worked on using a single ellipsoid for pattern separation. For example, Titterington [29] proposed the use of the minimum volume covering ellipsoid (MVCE) as trimming devices which can be used for outlier rejection. Glineur [9] presented a method for pattern separation using two concentric ellipsoids with maximal separation ratio. Calafiore [3] also proposed a robust ellipsoidal fitting scheme to construct an ellipsoidal primitive to represent a class of data based on the difference-of-squares geometric error criterion. The possibility of formulating a soft-margin version of MVCE, like in SVM, is also suggested by Sun and Freund [24]. Gotoh and Takeda [10] later show that the soft-margin MVCE can be formulated from the perspective of conditional value at risk (CVaR) minimization.
The majority of works on MVCE usually either focuses on improving numerical optimization or performing one-class classification. Following the concept of SVDD as an extension of MEB, kernel methods are also introduced to MVCE in order to allow classification on more complex data [7, 34]. However, the formulation of MVCE involves the computation of the outer products among samples. As a result, the regular kernel trick often used in SVM by replacing an inner product with a kernel function is not directly applicable. Instead, the formulation of kernelized MVCE can be achieved through empirical feature mapping or kernel principal component analysis [28].
Although many studies have been conducted on evaluating the performance of the kernelized MEB, the formulation of MVCE with kernel methods has not yet been comparatively displayed with its spherical counterpart, especially on binary classifications. Therefore, in this paper, our objective is to design a supervised learning algorithm which constructs a decision boundary using two kernelized hyperellipsoids following the footprints of THSVM. Precisely, one hyperellipsoid with soft margins is used as a data descriptor for one class and is also located as far as possible from the other class. In addition, the hyperellipsoid is also kernelized through the help of empirical feature mapping. The proposed method is called twin-hyperellipsoidal support vector machine (TESVM) and its performance is also evaluated on both artificial and standard benchmark datasets which are publicly available. Our experimental results show that TESVM outperforms THSVM in many datasets in term of classification accuracy. Moreover, for the average results, TESVM is also an alternative to TWSVM and SVM.
This paper is organized as follows. We overview two classifiers which are fundamentally built on hyperspheres, namely SVDD and THSVM in Section 2. In Section 3, the formulation of the proposed algorithm for binary classification is described in details, followed by some experimental results in Section 4. The concluding remarks are provided in Section 5.
Background
This section briefly reviews two learning techniques whose concepts are considered the crucial motivation behind the development of the proposed hyperellipsoid-based classifier.
Support vector data description
The SVDD considers a hypersphere as its underlying shape for data descriptions. Since the solution of SVDD is always in the form of a closed spherical boundary, SVDD is inherently suitable for one-class classification. Initially, it was proposed for domain description [26] for samples without labeling, and later extended to support the case where each sample has either positive (target) or negative (outlier) label [27]. Given a set of m samples,
By using the method of Lagrange multipliers, the dual formulation of Equation (1) can be obtained as
Given a test sample
It can be observed that the dual formulation of SVDD in Equation (2) follows closely with the dual of SVM, including the box constraints as well as the inner products among samples. As a result, SVDD can be naturally extended to handle more complex data pattern with kernel tricks as in the case of SVM by simply substituting the inner product with the desired kernel function.
One closely related formulation to SVDD which forms two hyperspheres to perform binary classification is called THSVM [20]. Although there exist some research studies, such as in [13, 19], working on an extension of SVDD for binary classification, the concept of THSVM was differently initiated. THSVM was built upon the spirit of TWSVM. Instead of finding the best fitting hyperplane for each class of data where each hyperplane is also located as far as possible from the other class, THSVM replaces the set of hyperplanes with hyperspheres. In other words, TWSVM implements a hyperplane as a data descriptor while THSVM reformulated the problem with hyperspheres. As a result, the formulation of THSVM is rather close to SVDD as both methods implement a hypersphere to describe one class of data.
In a two-class scenario, given a set of m samples,
Let
The center of the optimal hypersphere can be computed from
After the optimal hypersphere representing class
Furthermore, similar to SVDD, we can apply kernel tricks to THSVM by replacing the inner products with the desired kernel function. The radii and centers of the hyperspheres are not required to be explicitly computed as Equation (4) can also be rewritten in terms of inner products between samples.
The SVDD and THSVM create a classification rule by finding the best fitting hypersphere around one class. In practice, the nature of spherical shapes, however, is likely to be too conservative for data descriptions. Therefore, in this section, we present a novel binary classifier which is based on hyperellipsoids to provide more relaxation to the decision boundary. The proposed method is named “twin-hyperellipsoidal support vector machine (TESVM)” as its idea is built upon THSVM.
TESVM Formulation
Let
For a given set of m training samples,

Decision boundary of TESVM (solid line).
From Equations (5) and (6), ν1, C1, ν2, and C2 are hyperparameters. It is also important to note that we also assume that the affine hull of all training samples must span
Since Equations (5) and (6) differ only which target class is considered, Equation (5) becomes the main focus in this section. The first two terms in the objective represent the volume of
From Equation (5), the Lagrangian can be formed as
In order to make the formulation more compact, we assign the index for each sample in class
The first derivatives of Equation (8) are
By rewriting Equation (8) with KKT conditions, the corresponding dual formulation of Equation (5) becomes
The optimization Equation (15) can be solved using standard semidefinite programming solvers such as SDPT3 version 4.0 [31] which supports log-determinant in the objective function.
At the optimality, we also have the complementary slackness, for i = 1, 2,. . . , m1,
After
The concept of TESVM is formulated based on constructing hyperellipsoids for class descriptions; however, in practice, data are often unstructured or drawn from unknown distributions. Many linear classification algorithms overcome such a limitation with the help of kernel methods. By replacing, all inner product terms between two samples with a kernel function to alternatively define a similarity between two samples, the methods such as SVM, TWSVM, THSVM, and SVDD, are extended to work seamlessly in the feature space. For TESVM, since its formulation is not expressed in terms of inner products, it is impossible to apply the same kernel method. One possible approach which can be used to extend the capability of TESVM to the feature space is through the use of empirical feature map [22].
Given the training samples
Although we consider TESVM to be an extension of THSVM, the formulation of TESVM is also closely related to the MVCE problem with negative samples [28]. This connection is similar to the idea that THSVM is another form of double SVDDs with negative samples. In general, given
The dual formulation of Equation (20) has exactly the same form as Equation (15) but with little modification. That is the dual problem will have m Lagrange multipliers, where αm1+1, αm1+2,. . . , α
m
are not just a constant. The m Lagrange multipliers are also required to be summed to 1 and are confined within the box constraints
In order to evaluate the performance of the proposed TESVM, we conduct some experiments on several artificial and standard datasets and provide the comparison in term of accuracy. Since TESVM is considered as an improvement of THSVM, head-to-head comparisons between the two are the main focus. However, the results from SVM and TWSVM are also presented for further validation. All experiments are performed using 10-fold cross-validation and the cross-validation is repeated for 10 times with randomly shuffled samples. All methods are implemented and evaluated using Matlab 2017a under Ubuntu 17.04 and 2.8 GHz Intel Core i5-6402P with 8 GB of RAM. We use SDPT3 [31] through YALMIP [18] as a semidefinite programming solver for TESVM because of its support for the log-determinant objective function. The quadratic programming in THSVM and TWSVM is solved using the MATLAB’s quadprog solver and SVM is implemented using LibSVM [5].
Toy Examples
Three 2-dimensional artificial datasets are presented in this section in order to visually show the strength of TESVM. The details of the data are summarized in Table 1. We manually created the first two datasets, called Toy 1 and Toy 2, to test the performance of the linear classifiers where each dataset was generated based on two distinct Gaussian distributions. For the third dataset, Ho & Kleinberg’s checkerboard dataset [12] is used to show the performance of nonlinear classifiers. It contains two classes of samples generated under the uniform distributions to form a 4×4 tiles of checkerboard. The hyperparameters are chosen by performing a grid search over ranges of parameters. The best set of parameters is the one which provides the best 10-fold cross-validation accuracy. In the case of THSVM, we set C1 = C2 = C and ν1 = ν2 = ν where C ∈ {10 i |i = -5, - 4,. . . , 5} and ν ∈ {0.0, 0.1, 0.2,. . . , 0.9}. In the case of TESVM, we also set C1 = C2 = C and ν1 = ν2 = ν with C ∈ {2 i |i = -4, - 3,. . . , 4} and ν ∈ {10-9, 10-12}. The parameter σ of the RBF kernel for both methods is searched from {2 i |i = -4, - 3,. . . , 5}. For SVM and TWSVM, all parameters including C and σ are searched from {2 i |i = -5, - 4,. . . , 5}.
Accuracy (average (standard deviation)) on toy examples reported from ten rounds of 10-fold cross-validations
Accuracy (average (standard deviation)) on toy examples reported from ten rounds of 10-fold cross-validations
In the first toy dataset, each class contains 100 samples. Both classes are randomly generated from two Gaussian distributions with the same covariance matrix but at the different center and rotation as shown in Fig. 2. Precisely, the distributions are from

Classification on Toy 1 dataset. The dashed and dash-dotted lines representing the class descriptors, and the solid curves show the decision boundaries. The test accuracies on the training samples are (a) 92.5% (b) 80.5% (c) 92.0% (d) 78.5%.
The purpose of Toy 1 dataset is to illustrate the circumstance when THSVM loses the spirit of TWSVM, even though it is said to be a successor. One benefit of linear TWSVM over the linear SVM is that linear TWSVM is inherently able to deal with the so-called “cross-planes” dataset [23]. However, THSVM has no such an ability as presented in Fig. 2. When the samples from two classes form a cross shape, the accuracy obtained from THSVM on the training data can be as low as 50%. This is not the case for TESVM as it offers less conservative class descriptors. According to Fig. 2, TESVM significantly outperforms THSVM and SVM, while on par with TWSVM. That is TESVM, THSVM, TWSVM, and SVM achieve the test accuracy on the training set at 92.5%, 80.5%, 92.0%, and 78.5%, respectively. As a remark, the decision rules in Fig. 2 are tested on the entire training samples. Therefore, for better generalization, we also provide the test accuracy from the 10-fold cross-validation in Table 1.
For the second toy dataset as depicted in Fig. 3, the samples in each class are generated from two different Gaussian distributions with unbalanced numbers of samples. The first class has 50 samples randomly drawn from

Classification on Toy 2 dataset. The dashed and dash-dotted lines representing the class descriptors, and the solid curves show the decision boundaries. The test accuracies on the training samples are (a) 99.2% (b) 80.0% (c) 97.6% (d) 97.2%.
Next, the checkerboard dataset is chosen to show the performance of the algorithms for a nonlinear separable case. We use the RBF kernel due to its popularity and success with real-world data. Figure 4 displays the decision boundaries from the four algorithms. It can be observed that the decision boundary obtained from TESVM is rather complex with more curvature than from THSVM. Although it is reported from the figure that TESVM provides better test accuracy on the training data than THSVM, i.e., 99.30% compared with 98.80%, TESVM gives slightly less accuracy from the 10-fold cross-validation. In our view, both methods when used with the RBF kernel are on par with each other in term of performance on this dataset. According to Table 1, all the classifiers provide accuracies with a similar magnitude, 96.15%, 96.45%, 95.99%, and 95.75%, respectively for TESVM, THSVM, TWSVM, and SVM. Therefore, additional datasets are required so as to further evaluate their performance.

Classification on the checkerboard dataset. The dashed and dash-dotted lines representing the class descriptors, and the solid curves show the decision boundaries. The test accuracies on the training samples are (a) 99.3% (b) 98.8% (c) 98.5% (d) 98.2%.
In this section, publicly available standard datasets from the well-known UCI Machine Learning Repository [17] are used to compare the performance of TESVM, THSVM, TWSVM, and SVM. The details of the datasets are shown in Table 2. All the datasets contain two labels of data, except for Iris and Thyroid which have three classes. Therefore, Iris and Thyroid datasets are formed two-class problems by using one-against-all strategy.
Standard benchmark datasets used in the experiments
Standard benchmark datasets used in the experiments
Since THSVM separately finds two separate hyperspheres describing two classes, and TESVM also finds two separate hyperellipsoids, we thus consider two scenarios of hyperparameter selections for TESVM and THSVM. In Scenario I, the hyperparameters for solving two optimization subproblems are set to be the same, while in Scenario II, the hyperparameters are allowed to be different. As a result, it is expected that the second scenario should reflex more flexibility of the decision boundary than the first scenario. The experimental results for each scenario with linear and RBF kernels are shown in Table 3. The best performers between TESVM and THSVM for each dataset, each scenario, and each type of kernels are shown in boldface.
The comparisons between TESVM and THSVM on the two scenarios with linear and RBF kernels on the standard datasets
In the first scenario, the search ranges for the best hyperparameters are identical to the ranges specified in Section 4.1. However, for the second scenario, we set c1 ≠ c2 and ν1 ≠ ν2 for both THSVM and TESVM, where c1, c2 ∈ {10
i
|i = -3, - 2,. . . , 3} and ν1, ν2 ∈ {0.0, 0.1, 0.2,. . . , 0.9} for THSVM, and c1, c2 ∈ {2
i
|i = -4, - 3,. . . , 4} and ν1, ν2 ∈ {10-9, 10-12} for TESVM. The kernel parameter σ is always set the same for one pair of optimizations and is searched from the set {2
i
|i = -4, - 3,. . . , 5}. In addition, for TESVM to avoid degenerate cases when the matrix inside log-determinant objective function Equation (15) has some zero eigenvalues, we also add a small diagonal matrix γ
From Table 3, the results indicate that the second scenario likely provides better classification accuracy since the decision boundary is formed by searching a larger search space of hyperparameters. However, that is not always true, as can be seen, for example, from the case of linear THSVM with Transfusion dataset where Scenario II has lower accuracy than Scenario I. That is because the best hyperparameters are determined based only on one run of 10-fold cross validations while the results are reported from ten independent runs with randomly shuffled data.
Additionally, similar to the results in Section 4.1, it can be observed that linear TESVM tested with the standard datasets also performs better than linear THSVM on both scenarios as reported in Table 3 on most of the datasets. That is 11 and 12 datasets out of 15 datasets for Scenario I and II, respectively. While linear TESVM performs worse in some datasets, the differences are rather small. As a result, we conclude that linear TESVM provides better prediction results than linear THSVM.
The classification accuracy from both THSVM and TESVM with the RBF kernel on the standard datasets also further shows the advantage of using hyperellipsoids over hyperspheres. According to Table 3, we observe that TESVM with RBF kernel provides better accuracy on 13 and 12 datasets out of 15 datasets on the first and the second scenarios, respectively. Although the accuracies are not drastically improved from Scenario I to Scenario II and also from THSVM to TESVM for the case of RBF kernels, slight enhancements in the accuracies in many datasets are still a good indicator to show that the less conservative class descriptors of hyperellipsoids can help squeeze the performance from the less flexible spherical shape of THSVM, even together with the RBF kernel.
Moreover, we further compare both THSVM and TESVM with SVM and TWSVM. The setups of SVM and TWSVM are the same as described in Section 4.1. The comparisons of all four classifiers are provided in Table 4 for both linear and RBF kernels where the best accuracies for each type of kernels are highlighted in bold letters. The accuracy values for TESVM and THSVM in Table 4 are extracted from the best scenario in Table 3 for each dataset and each type of kernels. According to Table 4, TESVM and SVM perform better than THSVM and TWSVM in term of the number of datasets for both linear and RBF kernels. Overall, the average accuracy across all 16 standard datasets shows that TESVM is superior to the other methods. In fact, its performance is close to SVM’s and even better in many datasets.
Accuracy of the TESVM, THSVM, TWSVM, and SVM with linear and RBF kernels on the standard datasets
One drawback of TESVM that we can observe during the experiments is the proper choices of ν1 and ν2. According to Equation (15), the term
It is also important to note that the computational complexity of solving the TESVM’s semidefinite program is unfortunately much higher than the quadratic program of THSVM. Furthermore, the complexity of the algorithms also depends on the number of training samples. Therefore, in this paper, we deliberately chose the test datasets which have rather small numbers of samples. In addition, for the sake of simplicity, TESVM was also implemented using the generic semidefinite programming solver. As a result, an important future work is to improve the computational efficiency of TESVM. For example, it is possible to apply the dual reduced Newton algorithm [24] which also utilizes the structure of MVCE to solve the optimization. Another possibility is to apply active-set strategies [24] or a scheme like that in [33] to select only some subsets of data which are important for the creation of hyperellipsoids.
In this paper, we proposed a novel TESVM binary classifier which constructs two hyperellipsoids where each hyperellipsoid acts as a data descriptor for one class and is also as far as possible from the other class. The idea of TESVM can be considered an extension of THSVM as well as a modified version of the soft-margin MVCE problem with negative samples. Although TESVM cannot directly apply kernel tricks as in the case of SVM, TESVM with kernel methods can be achieved through the help of empirical feature mapping which maps all samples into an estimated empirical feature space. We evaluated the performance of the proposed method against THSVM, TWSVM, and SVM on both artificial and standard real-world datasets, and the experimental results support that TESVM provides better prediction accuracy than the other methods on most of the datasets in term of 10-fold cross-validation accuracy. Particularly, as TESVM is a direct improvement of THSVM, the results also show that the ellipsoidal boundary of TESVM is superior to the spherical boundary of THSVM.
Future works may involve practically applying the proposed method to solve real-world multiclass classification problems. As a precursor step, therefore, it is necessary to find a technique which can enable TESVM to handle larger datasets. Since most training time is contributed to the search for the best hyperparameters, a metaheuristic approach such as Cuckoo search algorithm [35] may be one possibility to help reduce the overall training time.
