Abstract
The support vector machine (SVM) has gained prominence in classification problems. One characteristic of it is that the optimal decision hyperplane is only determined by support vectors (SVs). Inspired by this, we propose two novel sample selection methods based on the idea of potential SVs. One is the distance-based sample selection method (DSSM), it calculates the distance between each pair of samples from different classes, and then selects those nearer to the other class as potential SVs. The other is the K-nearest neighbors (Knn)-based sample selection method (KSSM), which is presented on account of the boundary nearest SVM (BNSVM). The BNSVM regards the nearest neighbors between two classes as potential SVs. While our KSSM extends the neighbor number from one in BNSVM to k. In addition, we present a method to choose an appropriate k. Afterwards, we do classification with these potential SVs. These selected samples contain useful information for classification which ensures the accuracy of classifiers. Moreover, the DSSM and KSSM work fast since they are only related to the distance and do not need to do iterations. Experiments conducted on small datasets and large-scaled ones show the validity of our DSSM and KSSM.
Keywords
Introduction
During the past few decades, support vector machine (SVM) [1–3] has become a powerful tool applied to a variety of real-world problems, like human face detection, text categorization and financial applications. Motivated by statistical learning theory [4], SVM is going to find the optimal separating hyperplane with the maximum margin between two classes. One notable advantage of SVM is its sparsity, which means that only a minority of support vectors (SVs) can determine the separating hyperplane. SVM solves a quadratic programming problem (QPP), and the solution is global and unique once obtained. However, it always takes much time to solve this QPP especially for a large database.
Consequently, many fast algorithms have been proposed to settle this problem. We can roughly divide them into three categories. The first kind of methods is proposed to change the model, among which the twin support vector machine (TSVM) [5] and the least squares support vector machine (LSSVM) [6, 7] are relatively classical and successful. TSVM aims to construct two nonparallel hyperplanes. It gets faster operation speed by solving two smaller QPPs rather than a single large one as in the conventional SVM. LSSVM modifies the primal QPP of SVM in least squares version and solves it with equality constraints. Therefore, the LSSVM finally solves a pair of linear equations, which makes it much more efficient.
Another kind of approaches to accelerate the training speed is at the algorithm level, such as, the decomposition method [8], chunking algorithm [1], LIBSVM [9] and sequential minimal optimization (SMO) [10] algorithm. Each of them decomposes the original problem into several small and simple subproblems, and then solves these subproblems by iteration methods. These methods show good properties on storage and calculation of large matrixes.
The third category is based on reduced sets. These methods subtract superfluous information from training data in advance, and then build the model based on the reduced dataset. The feature selection methods [10–12] remove the irrelevant features to reduce the dimension of input space, thus speed up the computation procedure. However, they help little for the large-scale dataset. The reduced support vector machine (RSVM) [13] uses a rectangular kernel to reduce the size of the kernel matrix. Nevertheless, we should point out that the subset is extracted randomly which may influence the classification accuracy. As for sample selection methods [12, 14], similar to the feature selection methods, they are going to get rid of redundant samples by some reasonable strategies. The purpose is to speed up the computation procedure without lowering the precision. What calls for special attention here is the strategy or theory used to judge the importance of samples. That is to say, what kind of samples should be selected is a key point.
It is well known that the decision hyperplane of SVM greatly depends on the SVs. On this point, many sample selection techniques have been put forward so far. For example, the Core vector machine (CVM) [16] adopts an efficient approximate minimum enclosing ball (MEB) algorithm, then obtains the approximately optimal solutions with core sets. This method can effectively reduce the time and space complexity of the model, but it is an iteration method whose initial value and convergence criterion will greatly influence the algorithm. The Fast condensed nearest neighbor (FCNN) rule in [17] computes a training-set-consistent subset from the nearest neighbor decision rule. It is also an iteration algorithm with incremental learning. It shows good learning efficiency, but the selected samples are probably not typical enough, which will lead to the decrease of classification accuracy.
The boundary nearest support vector machine (BNSVM) proposed by Feng et al. [18] also scales down the training data beforehand. By applying the nearest neighbor method, it finds each sample’s nearest neighbor from the other class, and use them to form a new reduced training set. However, this strategy can’t find enough potential SVs, and the impact of noise samples is too large. Both of them will affect the classification accuracy. Then the author suggest k (k > 1)nearest neighbors to improve the above defects in [19]. Nonetheless, they didn’t give the method for selecting a suitable parameter k, and the Knn is calculated one by one inefficiently. The experiments in that paper only set k a fixed value.
Based on the study above, we suggest two fast and efficient sample selection methods for SVM. Both of them try to find the potential SVs based on the prior knowledge of data distribution. First, we introduce the distance-based sample selection method (DSSM), whose main idea is that samples close to the other class are likely to be SVs. The second algorithm we present is the K-nearest neighbor (Knn)-based sample selection method (KSSM), which is an extension of boundary KNNSVM. We first use the sample incremental curve and the accuracy curve to determine a small range of k, which is important but ignored in boundary KNNSVM. Then, we no longer calculate KNNs from different classes respectively. Instead, we consider the between-class matrix and find k smallest values through each row (column), and then get the potential SVs.
These sample selection methods can be seen as the effective sample pretreatment step beforehand. Firstly, DSSM and KSSM can achieve simply since they have their basis only on the distance from different classes of points. Secondly, they are presented according to the theory of SVs. On one hand, these SVs are few which make the SVM faster. On the other hand, they are decisive for SVM classifier. Overall, these two sample selection algorithms cut down the computational cost of SVM, but do not affect the classification accuracy.
This paper is structured in the following: Section 2 reviews the basic concepts of the SVM, TSVM and LSSVM. Details of DSSM and KSSM are introduced in Section 3, both the linear and nonlinear cases are included. Section 4 analyzes these two sample selection methods. Section 5 discusses the experimental results to investigate the effectiveness of the proposed algorithms. Conclusions drown from this study are in the last section.
Related works
Consider the problem of binary classification wherein a linearly inseparable dataset X ∈ Rl×n, each row denotes one sample with n-dimensional features. The corresponding class label is represented by Y ∈ Rl×1. Let matrix A ∈ Rp×n stand for the data points from class +1 and matrix B ∈ Rq×n denote the data points of class -1.
Support vector machine
SVM shows excellent performance in machine learning and pattern recognition tasks. The central idea of SVM is to find an optimal separating hyperplane by maximizing the margin between two classes of samples. Based on the structural risk minimization principle, SVM finds this optimal hyperplane
For the nonlinear case, SVM maps the training samples into a high dimensional space via a nonlinear map φ (x) : R n → R d . And it is possible to do all the necessary operations in the feature space by using the kernel trick k (x i , x j ) = φ (x i ) T φ (x j ) as k (x i , x j ) is an inner product in the feature space.
TSVM seeks two nonparallel hyperplanes for two classes, such that each hyperplane is closer to the class it belongs and far from the other class. These two nonparallel hyperplanes will be obtained by solving two smaller sized QPPs (ref221) and (ref222) instead of a single large one as in conventional SVM [5],
A new data point is assigned to +1 or -1 depending on which of the two hyperplanes it lies closer to. TSVM could also be extended to the nonlinear case which can be found in [5].
The formulation and solution of LSSVM are much simpler than the classical SVM since it only needs to solve a system of linear equations for both linear and nonlinear cases. The formulation of LSSVM is obtained by modifying inequality constraints to equality ones from conventional SVM. It is as follows [6],
This section gives detailed introduction to our proposed two sample selection methods, i.e., DSSM and KSSM.
Distance-based sample selection method
In this section, we propose our new sample selection algorithm DSSM, whose main idea is to find the potential SVs, and only use them to construct the classifier. We find that one obvious characteristic of SVs is that they are closer to the other class. So we start from the distance between classes to search for these decisive samples.
Specifically, we first calculate the distance between samples from different classes, we will get the following distance matrix D ∈ Rq×p,
We set d
c
as a proper cutoff parameter which we will talk about later. For each class, we define two vectors f
P
and f
N
whose elements are just 0 or 1, to seek the potential SVs,
At last, we delete positive samples corresponding to fP.j = 0 from A and negative samples corresponding to fN.i = 0 in B. For the nonlinear case, the distance d ij will be calculated by kernel tricks as d ij = k (x i , x j ). The DSSM algorithm is described below,
For the KSSM, the Knn method is used to select samples. For each sample, if it is the Knn of any samples from the other class, we will identify it as a potential SV [18, 22]. Here, we adopt a more convenient way to look for these k nearest neighbors between two classes. First, we construct a between-class distance matrix W, which is the same as D in (12). Then, we will generate two nearest neighbor matrices in the input space, i.e. W
P
∈ R
q×p
and W
N
∈ R
p×q
, which are defined as follows,
After sample selection by either DSSM or KSSM, the training sample matrix X is reduced to with only the potential SVs reserved. denotes the selected positive SVs and stands for the negative ones. Then, we use the new to construct SVMs which have been given in Section 2.
The DSSM and KSSM each has a truncation parameter, i.e. d c and k, which is used to control the quantity of selected samples. As is well-known, a suitable parameter is important for the algorithm. However, existing sample selection methods using KNN algorithm or distance do not mention how to select a suitable parameter. Thus, we propose an effective method to settle this problem. The parameter k in KSSM stands for the number of neighbors for each sample. More samples will be selected with the increase of k. And when k reaches to a certain value, the growth of selected samples gradually close to zero.
Thus, we draw a sample incremental figure to find the range of k firstly. Then, the exhaustive method is used to choose the best value.
Take the Musk data as an example, which is shown in Fig. 1. In order to find a suitable range of k, we first draw the sample incremental curve as the growth of k. We can see that this curve convergence to zero. And when k is larger than 30, the sample increment becomes very small. So, we narrow the scope of parameter k from 80 to 30. Further, we display the accuracy tendency of the SVM as k changes form 1 to 30. From it, we find that when k is larger than 5, the classification accuracy tends to stabilize at a high level. That is to say, for the Musk data, k = 5 is the best choice considering both time and accuracy. To illustrate this parameter selection method more convincingly, we test all the datasets appeared in our experiments. Which surprised us is that almost all of the data have such rules, and the range of k is from 1 to 30 or much more restricted. This range is acceptable, because it consumes a small amount of time on grid search for the best parameter, and guarantees a great classification ability.
For the DSSM algorithm, we use the similar strategy. The only difference is that we should choose a proper order of magnitude of d c according to the size of the dataset. This is often easy to reach. For a small dataset, usually multiple of the sample dimension n can meet our requirement, or 1% to 10% of pq will get similar results, the percentage could be orders of magnitude smaller if the dataset is large. For example, for the Musk dataset, we set d c as i/10000 (i = 1, 2, ⋯ , 30) of pq.
Analysis of algorithm
As we can learn that the DSSM and KSSM are both taking motivation from the idea of SVs and they select the potential SVs with their own strategies, respectively. This section will talk about the characteristics of these two algorithms, including their time advantage and prediction accuracy. We point out the deficiency of the two s as well.
Geometric significance of our sample selection methods for SVMs
Our two sample selection algorithms are inspired by the idea of SVs in SVM. SVs are samples located at the border between different classes and they are going to determine the classification hyperplane. For the conventional SVM, the geometric sense of DSSM (KSSM) is similar to do sample selection beforehand since SVM is built based on SVs.
As for TSVM, the first item of the objective function (4) is to minimize the sum of squared distances from positive samples to their corresponding hyperplane, which involves all of the samples from class +1. After sample selection, we only need to consider the distance from potential SVs to the hyperplane. The constraints in (4) becomes making negative SVs (instead of all the negative samples) at least one distance from the positive hyperplane.
The geometric meaning of LSSVM is similar to TSVM. Actually, constructing LSSVM based only on SVs makes it pay more attention to sample distribution around the separating hyperplane. We have known that LSSVM loses the sparsity because LSSVM does not have the meaning of SVs, so our sample selection methods are equivalent to make the LSSVM sparse in advance.
Fig. 2 shows the geometric meanings of three models, i.e., SVM, TSVM and LSSVM with or without two sample selection methods on the linearly separable artificial dataset. And Fig. 3 gives the nonlinear case of SVM and KSSM+SVM on the artificial datasets. From them we find that SVM with DSSM or KSSM focus more on distribution information of SVs, and the selected SVs can well describe the sample distribution of two classes. In addition, the separating hyperplane of LSSVM is influenced most obviously.
Relationship between our sample selection methods and SVs in SVM
We have known that for conventional SVM, SVs play a decisive role on the constructing of the classifier. The DSSM and KSSM sample selection methods are proposed based on the idea of it. Thus, for the conventional SVM, samples selected by our algorithms are just the potential SVs we need. This shows the effectiveness of our algorithms since we remove a large number of non SVs for training, meanwhile guarantee the accuracy of the model.
As for LSSVM, it loses the sparsity and there is no concept of SVs. Then our sample selection methods can be regarded as weighted methods. Samples located near the separating hyperplane are more important for classification, so we give them the weights of 1, while for those far from another class, the weights are 0 which means that they have no contribution to the classification hyperplane.
Computational cost
The standard way to train SVM is to introduce Lagrange multipliers α i and optimize them by solving a dual problem. Because there is a flat part in the loss function, the vector α is usually sparse. The point x i corresponding to α i ¬ =0 is called SV. Let l SV denote the number of SVs. A recent theoretical result [23] shows that l SV grows as a linear function of l. Thus, the training and testing complexities are respectively, and O (l SV ). And the computational complexity of TSVM is around O (l3/4) under the assumption that the patterns in two classes are approximately equal. The training samples will greatly reduced after sample selection which we can see from Fig. 3. Therefore, the computational complexity of SVM becomes , where stands for the number of samples selected. And much less computational cost will be reached for TSVM and LSSVM which are similar to SVM.
Limitation of our DSSM and KSSM
For the conventional SVM, the influence of noisy points on separating hyperplane is always a problem need to be solved. Here, we divide these noisy points into two categories, the outliers and noises within the domain between two classes. In conventional SVM, the second kind of noises are easy to be mistaken as SVs and then affect the classification accuracy. Our DSSM and KSSM can eliminate the influence of outliers for they are far from the decision boundary and will be deleted by our sample selection methods. However, noises around the border area of two classes can not be identified effectively by DSSM or KSSM, and this is the deficiency of our algorithms.
Numerical experiments
Seventeen benchmark datasets
To evaluate our proposed DSSM and KSSM, experiments are performed on seventeen benchmark datasets from the UCI machine learning repository [17]: Hepatitis (H), Ionosphere (I), Australian (A), LSVT (L), Monks (M), Pima (P), Spectf Heart (SH), BCI-1a (Ba), BCI-1b (Bb), Liver Disoder (LD), DBWorld e-mails (D), Sonar (S), Heart (He), Breast Cancer1 (B1), Breast Cancer2 (B2), Breast Cancer3 (B3), and Magic gamma (Mg). The characteristics of seventeen benchmark datasets are shown in Table 1.
For each dataset, we integrate DSSM and KSSM into the SVM, TSVM and LSSVM. Meanwhile we use the 5-fold cross-validation to evaluate the performance of nine algorithms. All experiments are implemented in Matlab R2014a on Windows 7 running on a PC with system configuration Intel(R) Core(TM) 2 Duo CPU E7500 (2.93 GHz) with 4.00 GB of RAM.
For these classifiers, an important thing is to choose the appropriate parameters. Here, we only consider the Gaussian kernel k (x i , x j ) = exp (- ∥ x i -x j ∥ 2/γ2) for its great generalization performance, and the optimal parameters are tuned by a grid search. For convenience, we set c1 = c2 = c for TSVM. We select parameter c and kernel parameter γ in these algorithms from the set {2 i |i = -1, 0, ⋯ , 8}. We have introduced a way to choose the optimal range of the truncation parameters in Section 3. Here we set the parameters uniformly for convenience. The optimal value d c for DSSM here is set to be multiple of the sample dimension n and the proportion is selected from the set {1, 2, ⋯ , 10} (for high dimensional datasets or large scaled data, d c is chosen from 0.1% to 10% of pq) and k for KSSM is from the set {2, 4, 6, 8, 10}.
Experimental results and discussions
The experimental results on sixteen small-sized datasets and one medium size dataset, i.e. Magic-gamma are summarized in Table 2, where Averages ± standard deviations of the training accuracy with SVM, TSVM and LSSVM are shown. “Time" denotes the mean value of the time taken by five experiments, and each consists of training time and testing time. DistSVM means that SVM incorporates with DSSM, KnnSVM represents the combination of SVM and KSSM. The same meanings are for DistTSVM, KnnTSVM, DistLSSVM and KnnLSSVM. The optimal parameters used in the experiments are presented in Table 3.
From the perspective of prediction accuracy, KnnSVMs (KnnSVM, KnnTSVM, and KnnLSSVM) outperform the original SVMs (SVM, TSVM, and LSSVM) and DistSVMs (DistSVM, DistTSVM, and DistLSSVM). We can comprehend that KnnSVM performs better than SVM on 11 datasets out of 16 datasets, and similar results can be found for TSVM and LSSVM. That is to say, select points by KSSM can improve the training accuracy of SVMs. As for another sample selection method, DSSM, we can see that the performance of DistSVM and DistTSVM are slightly lower than SVM and TSVM. And the DistLSSVM has comparable accuracy with LSSVM. Hence, we can conclude that it is effective to do sample selection for training points by finding the potential SVs. In addition, the KSSM finds potential SVs more accurately than DSSM. The main reason lies in that KSSM calculates Knns for every points, which is more careful than DSSM because DSSM only cares about points with small distance to the other class.
In terms of computational time, the DistSVMs work fastest in most cases, KnnSVMs follows, and the original SVMs cost the most running time. The reason is that after sample selection methods, the number of training points decreases to only potential SVs left, while the running time of SVMs is positive correlated to the number of training points. To demonstrate more quantitatively, take the LSSVM as an example, we record the number of samples selected by DistLSSVM and KnnLSSVM on all of the seventeen UCI datasets and display the results (except the Magic-gamma dataset) in Fig. 4. It is obvious that the number of samples significantly decreased after operated by DSSM or KSSM. Moreover, as we all know, LSSVM is fast for it solves a pair of linear equations rather than QPPs. Then our DistLSSVM and KnnLSSVM further accelerate it by deleting more samples and make the LSSVM more efficient.
We also find out that the advantage becomes more apparent with the increase of samples. For example, the running time of SVM on B3 dataset is 1.85s, however, the time of the DistSVM and KnnSVM are only 0.05s and 0.06s, respectively. This result is reasonable because the number of SVs among training samples is minority whether the size of training points is large or small. This further illustrates that our sample selection methods are suitable for SVMs with large-scaled training samples.
In the experiment on dataset Magic-gamma, we find that the SVM, TSVM and LSSVM do not work on above computer system because of lower RAM. In [18], it also testifies the effectiveness on this medium size dataset. The weighted rough ν-TSVM yields the testing accuracy (78.60%) on this dataset, and the running time is 3288.5s. In our experiment, we also use the same Gaussian kernel function and 5-fold cross-validation. First, we use the DSSM (KSSM) algorithm to select the potential SVs from dataset, and then apply the SVM, TSVM and LSSVM, respectively, on these potential SVs. Unfortunately, the KnnSVM does not work for lower RAM. Other results are also listed in Table 2. We will find that the best prediction accuracy is 75.19% for DistSVM, and it takes only 9.22s. It shows the great time advantage of the DSSM. As for the KSSM, the prediction accuracies are 84.7% for KnnTSVM and 84.28% for KnnLSSVM, and their running times are 488.4s and 105.55s, respectively. Both of them are superior to the results in [18]. Thus, we can conclude that the KSSM is more accurate for selecting potential SVs and the DSSM has more advantage in computing time. Therefore they are both suitable for SVMs with large-scaledataset.
Musk dataset
This is a dataset about molecules collected by AI Group at Arris Pharmaceutical Corporation [17]. Our goal is to predict whether any conformation of a new molecule is a musk or non-musk, then determine the label of this new molecule. To generate this data set, all the low-energy conformations of the molecules were generated to produce 6598 conformations. And 1017 conformations are judged by human experts to be musks and the remaining 5581 conformations are judged to be non-musks. Each conformation has 166features.
In this section, we compare the performance of five sample selection methods on SVM, among which the CVM, BNSVM and FCNN are latest fast sample selection methods. We have described details of them in the introduction. For convenience, we set the values of parameters γ as same as the best parameters of SVM. The numerical simulation results are demonstrated in Table 4, all of which are shown as average values with 5-fold cross-validation.
Compare the results of these five algorithms on the Musk dataset, we will find that our KnnSVM takes 11.57 seconds which is the second fastest, and yields the highest testing accuracy than other five methods. The DistSVM obtains slightly better results than FCNN. The CVM performs the worst on this data. Both testing accuracy and training time indicate the superiority of our two algorithms.
Conclusion
Two novel sample selection methods, i.e., DSSM and KSSM, are proposed in this paper, they are both based on the idea of SVs in SVM. The DSSM calculates distance between each pair of points from different classes, and chooses the points closer to the other class as the potential SVs. The KSSM algorithm uses a between-class matrix to find the Knns more efficiently, and regards them as the potential SVs. We also give the approach to choose proper parameters in these two algorithms. After selecting samples, we construct the classifier only by using these potential SVs, which makes SVM much faster under the guarantee of good accuracy. Experimental results on seventeen datasets and the Musk data show the effectiveness of our algorithms. How to find the SVs more efficiently and get rid of the noisy points may be considered potential areas for our future work.
Acknowledgments
The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. This work was supported in part by National Natural Science Foundation of China (No. 61153003) and Chinese Universities Scientific Found.
