Abstract
Kernel based Support Vector Machines, SVM, one of the most popular machine learning models, usually achieve top performances in two-class classification and regression problems. However, their training cost is at least quadratic on sample size, making them thus unsuitable for large sample problems. However, Deep Neural Networks (DNNs), with a cost linear on sample size, are able to solve big data problems relatively easily. In this work we propose to combine the advanced representations that DNNs can achieve in their last hidden layers with the hinge and
Introduction
Kernel Support Vector Machines (SVM; [1]) were very popular in the 1990s and early 2000’s because of the powerful models they provided. However they are somewhat out of fashion in the current big data era. The main reason for this is their cost, at least quadratic (and often worse) with respect to sample size, in a sharp contrast with the linear cost in sample size of Deep Neural Networks (DNNs) that currently dominate big data machine learning (ML).
This cost is hard to avoid in the usual kernel formulation of SVMs. In fact, while at the end, SVMs are linear models on extended representations
Many proposals have appeared in the literature to apply SVMs on large datasets, usually focused on Support Vector classification (SVC) problems. Among them we can mention incremental learning [2], ensemble learning [3] or the cutting planes method [4]. Nevertheless, it can be said that, unless substantial hardware resources are committed, current kernel SVM training methods are not time competitive for datasets with more than about 100,000 patterns. Recently, GPU compliant implementations of SVM training have been proposed [5] but, while greatly accelerating SVM training, the memory limitations of large sample SVMs still remain.
Larger datasets can be handled by working with linear SVMs using the LIBLINEAR library [6] over either the primal or dual problems, often using coordinate descent [7] to solve the minimization problems. However, efficient linear SVM models can only be expected when the original features have very large dimensions so that feature projections are no longer needed. When sample dimension is just moderate, linear SVM models are usually much less powerful that their Gaussian counterparts. Pegasos [8] can be used on a kernel setting but, as with LIBLINEAR, it must work with homogeneous models
However, and in a different vein from these works, we first observe that the non differentiability of the SVM primals is rather mild; in fact, Pegasos exploits subgradient descent essentially by skipping the single non differentiability at 0 of the hinge loss, or at
This opens the way to try to solve the primal problem directly working with an
Notice that, at the end, this point of view also leads to a linear model acting on a hidden representation. However, and although seemingly close, the final models will be quite different. Following the terminology in [13], DNNs belong to the category of iterative summarization methods; on the other hand, Gaussian SVMs, the most popular way to derive non linear SVMs, are placed in [13] within the case-based reasoning methods, jointly for instance with
In any case, Gaussian SVM comparisons are only possible for datasets up to a given size. In fact, as of today, they have substantial train and, more so, hyperparameterization costs with sample sizes above
The difference lies, however, in their respective training goals. On standard DNNs the goal is the estimation of posterior class distributions in classification or to mean squared error minimization in regression. On the other hand, hinge loss minimization in our proposed Deep SVM classifier training means that a maximum margin model is learned on the last hidden layer; similarly, the
We will focus here our attention in the two-class classification and regression performance of fully non linear DNNs versus that of Deep SVMs. We will first work with feed-forward fully connected architectures for classification and regression but we will also consider image classification datasets where we will use problem specific architectures already proposed in the literature and whose results are close to the state of the art. In summary, our main contributions are:
The proposal of Deep SVMs with the same architectures and training procedures of either standard or advanced DNNs but using the hinge and A statistical comparison of the performance of DNNs and deep SVMs with standard, fully connected feedforward architectures in mid size classification and regression problems. A statistical comparison of the performance of DNNs and Deep SVMs over large image datasets, using problem specific advanced deep architectures that are representative of the current state of the art.
In other to streamline the presentation, we will use the acronyms
Putting together our results here and those in [15], a first conclusion is that GSVCs, DNNCs and our proposed DSVCs have similar performances in two-class classification problems. However, the situation for regression problems is different as we will numerically show that DSVRs have an hedge over DNNRs, not only when mean absolute error (MAE) is the cross validation (CV) score used for model hyperparametrization but also when using mean squared error (MSE) as the CV score.
The rest of the paper is organized as follows. We give a short unified exposition to standard SVMs in Subsection 2, where we also address training and test complexity and model sparsity. We present our deep SVM proposal in Section 3, where we also briefly discuss training complexity. For the sake of the reader, we briefly recover our results in [15] and Sections 5 and 6 contain our comparison of standard DNNs versus Deep SVMs. More precisely, in Section 5 we will compare, using the same datasets of Section 4, standard fully connected DNNCs and DSVCs first and then DNNRs and DSVRs. In both cases we will also address the statistical significance of our results. On the other hand, in Section 6 we shall work with three large image datasets, the MNIST, MNIST Fashion and CIFAR 10 problems. For the first two we will use the well known LeNet-5 architecture [16] to compare the cross entropy and hinge losses (i.e., the corresponding DNNC and DSVC models) on the ten 2-class problems of distinguishing on MNIST each of the 0 to 9 digits from the others, and similarly for MNIST Fashion. We will follow the same approach with the ten class CIFAR 10 problems but here the cross entropy and hinge losses will act on top of a ResNet32 V1 model [17], a substantially more complex deep residual network. The comparisons will also include LIBLINEAR models and we will address their statistical significance. Finally, the paper ends with a discussion, conclusions and pointers to further work.
Primal and dual SVC and SVR
We will briefly review here primal and dual SVC and SVR in a unified view following [18]. Assume a sample given as a set of triplets
Now if we take
On the other hand, given an initial sample
which is just the standard formulation of the
While its loss is not differentiable, Eq. (2.1) it is, nevertheless, a convex minimization problem. Because of this, one does not attempt to solve it directly; instead, standard Lagrangian optimization is applied to reduce Eq. (2.1) to the equivalent problem of minimizing the general dual function
where
In any case, a clear drawback of trying to solve Eq. (2.1) directly is that it would result in plain linear models, possibly not powerful enough to deal with the problems at hand. The dual formulation overcomes this as it can be stated in a kernel setting by replacing the inner products
Notice that the dual problem has to be solved in a space with dimension
Moreover, an often cited property of SVMs is that their solution is sparse, i.e., the number of nonzero optimal
SVC (red) and Logistic Regression (blue) classifiers for two class linearly separable problems. The SVC hyperplane has a maximum margin while the Logistic Regresion one is closer to the minority class.
It is well known that large margins usually imply better generalization. For linearly separable problems and a
Of course, the usefulness of linear classifiers is rather small and, as we have just discussed, the standard way to introduce non linearity into SVMs is to use the kernel trick. However, the price to pay for this is the need to solve the dual problem, with a cost usually at least quadratic with respect to sample size. But, in any case, we point out that the loss in Eq. (2.1) can be written as
For classification this is just the hinge loss
with
Observe that in both cases the non differentiability of the primal loss is rather mild and in fact very similar to that of the ReLU activations routinely handled in Deep Neural Network (DNN) packages. Moreover, if
More precisely, consider a DNN with linear outputs, a weight set
where
where
Assume that such a network has been trained, let
over
which is just the problem solved in SVC when
The cost of training them is just that of training a standard DNN, which depends on three inputs. The first is the architecture or, more precisely, the number
cost, which now grows linearly with
In any case, the classification and regression performance of such networks, which we will call Deep SVMs, has to be compared with that of standard Gaussian SVC and SVR. This was done in [15] and we will briefly recall it in Section 4. However, and as mentioned before, even if training Gaussian SVMs on large datasets is, as of today, too costly and deep SVM models can be an alternative, it is clear that in the large sample setting there is an obvious counterpart to Deep SVMs, precisely standard cross entropy or mean squared error DNNs. We will compare them with Deep SVMs in Sections 5 and 6.
In this section we will briefly summarize for the convenience of the reader the comparisons between Gaussian two class classification (GSVC) and regression (GSVR) models and their deep counterparts DSVC and DSVR given in [15], where we used for comparison the classification datasets a4a, a8a, australian, cod-rna, diabetes, german.numer, ijcnn1, w7a and w8a, and the regression datasets abalone, bodyfat, cadata, cpusmall, housing, mg, mpg and space_ga. All are taken from the LibSVM repository [21] (which includes references to papers where they have been used) and their training and, when available, test sample sizes and dimensions are given in Table 1.
Number of train and (when available) test patterns and dimensions in the two-class (top) and regression (bottom) problems
Number of train and (when available) test patterns and dimensions in the two-class (top) and regression (bottom) problems
We compared GSVC and GSVR models against DSVC and DSVR networks with 1, 3 and 5 hidden layers to be trained using the adam optimizer over 200 pattern minibatches; more details are given in [15]. For GSVCs we must choose optimal values of two hyperparameters, the regularization constant
In the classification problems having a separate test set available, GSVC and DSVC model hyperparameters were found by 4-fold stratified cross validation (CV) over the train set using accuracy as the CV score; we then computed test accuracies over the test set. When no test set is available (i.e., in three classification and all regression datasets), model performance was measured by a nested, two loop, 4-fold CV approach, in which each one of the four outer folds is kept for testing and the other three folds are passed to the inner loop, where best hyperparameters are obtained again by 4-fold CV and tested on the outer test folds. The regression score was the mean absolute error (MAE).
We report in Table 2 the GSVC accuracies and those of DSVC models with with 1 (DSVC1), 3 (DSVC3) and 5 (DSVC5) layers. Similarly, we report now in Table 3 the GSVR mean absolute errors (MAEs) and those of DSVR models with 1 (DSVR1), 3 (DSVR3) and 5 (DSVR5) layers. For a better reading, in both tables we give in the last column the accuracy or MAE of the best performing DSVC or DSVR model. As it can be seen, GSVCs give the largest accuracies on the four datasets, a4a, cod-rna, german.numer and w7a while for the other five datasets the highest accuracy is achieved by a DSVC model. In regression, GSVRs give the smallest MAE on the four datasets, bodyfat, cpusmall, housing and mpg and a DSVR model gave the smallest MAE for the other four datasets. As concluded in [15], the performance of GSVC and GSVR models, on the one hand, and DSVC and DSVR models is fairly balanced.
Accuracies in the two-class problems (taken from [15])
MAEs in the regression problems (taken from [15])
Train and test times in minutes for the larger regression (top) and classification (bottom) datasets
To finish this Section we will briefly discuss time comparisons between Gaussian and Deep SVM models. We have performed execution time experiments for both, but we observe first that measuring times is not as straightforward as it may look, for we are using two quite different implementations. For standard SVMs we use the scikit-learn implementation that has at its core the extremely efficient C
Accuracy comparisons of one, three and five hidden layer models when hyperparameterized with hinge (DSVC) and binary cross-entropy (DNNC) scores. Larger accuracies in bold face
With all those caveats, we give in Table 4 for the largest regression (top) and classification (bottom) datasets, timings for train (tr_svm/dnn/mlp columns) and test (ts_svm/dnn/mlp columns), which correspond to LIBSVM, Keras and scikit-learn MLPs, respectively. Model hyperparameters for each dataset are those corresponding to the best cross validation estimators just described. As it can be seen, SVM training times are of the order of magnitude of those of MLPClassifier or MLPRegressor. The exceptions are a4a (a rather small dataset) and w8a (where SVMs take too long). As expected, Keras times are longest but they catch up as sample size increases. But notice also that for all datasets but the relatively small abalone and space_ga, LIBSVM test times are about 10 times longer than Keras’ test times, and near or above 50 times those of MLPClassifier or MLPRegressor. Therefore, and technology and implementation issues aside, these performances essentially confirm what is to be expected on theoretical grounds, as discussed at the end of Sections 2 and 3.
Recall that for large datasets the counterpart to our DSVC and DSVR proposals are standard deep networks for either two-class classification with binary cross entropy as the training loss and sigmoid outputs (i.e., DNNC in our notation) or for regression with the squared loss for training and linear outputs (i.e., DNNR in our notation). In both cases we will use Keras implementations of networks with one, three and five hidden layers, and linear outputs. As in [15], these have 100 units on the intermediate ones and 0.1
Also, and as before, Tikhonov regularization will be applied in all layers, with two different parameters,
Mean absolute errors and
values for DSVR and DNNR models with one, three and five hidden layers. Smaller values in bold face when absolute error distributions are statistically different
Mean absolute errors and
Mean squared errors and
We will work with the same CV based hyperparametrization scheme of the previous section but for classification we will use here the
The situation in regression will be slightly different. Notice that if we retain MAE as the validation score, as done in [15], this would help the DSVR models in detriment of DNNR models that aim to minimize the squared error. And inversely, if mean squared error (MSE) is used as the validation score, DNNRs would have then the advantage over DSVRs. Because of this we will compare DNNRs and DSVRs hyperparametrized using both MAE and MSE for CV scoring. Of course, this will result in possibly different hyperparameters but also on a fairer model comparison.
MAE scores are given in Table 6 and MSE ones in Table 7. Both tables also give
Both tables show in bold face the smallest MAE or MSE values when the null hypothesis of the error distributions being the same can be rejected at the
The situation is more balanced with respect to MSE values. As it can be seen in Table 7, DSVRs with 1 hidden layer give a smaller MSE in 4 datasets, in 1 dataset when they have 3 hidden layers and in 2 datasets when having 5 layers; in contrast, DNNRs gives a smaller MSE in two problems using three hidden layers and in one using five. We put also here in italics the smallest values over the three architectures considered when it is significant at the 0.05 level; this is achieved by a DSVR model in 3 problems and in one by a DNNR one. While DSVR also seem to the advantage here, is clearly smaller than in the case of MAEs.
Summing things up, we can say that the classification performance of DNNC and DSVC models is fairly similar but, on the other hand, DSVR models seem to perform clearly better for regression, when MAE is the preferred metric and also slightly so when MSE is considered. As a possible reason for this is that DSVR models can tune an extra parameter, the
In this Section we will compare the hinge and cross entropy losses on three large image datasets using more advanced DNN architectures. More precisely, we will first apply two class versions of the well known LeNet-5 convolutional neural network [16] on both the MNIST and MNIST Fashion problems. While quite simple for today’s deep proposals, LeNet-5 has still a fairly complex structure, depicted for instance in Fig. 2 of [16]. Next, we will consider a more advanced Residual Network architecture, ResNet32 V1, on the CIFAR 10 image classification problem. The ResNet structure is still more complicated, as it can be seen, for instance, in Fig. 3 of [17].
Basic dataset information is given in Table 8. GSVC models are very costly to hyperparametrize now, but for illustration purposes we will also give results for LIBLINEAR, the ad-hoc implementation for linear SVMs. While in general not competitive with GSVCs, this is less so for high dimensional problems, as the ones under consideration here. While all of them are 10 class problems, we will consider for each datasets ten 2-class problems where each single class has to be distinguished from the remaining nine classes.
Number of train and test patterns and dimension in one versus rest MNIST, MNIST fashion and CIFAR 10 problems
Number of train and test patterns and dimension in one versus rest MNIST, MNIST fashion and CIFAR 10 problems
MNIST is by now a classical character recognition problem that has been widely used as a DNN benchmark, with many architectures being proposed over the years. According to Rodrigo Benenson’s web page [22], the state of the art networks as of 2016 achieve a test error of 0.21% [23]. The error rate when we apply the LeNet-5 implementation on the full MNIST 10 class problem is of 1.11%; a 0.95% error rate is reported in Y. LeCun web page [24] for LeNet-5. (Notice that here and the following experiments different executions may result in slight random changes in accuracy values). In our experiments we will use the MNIST dataset available in [24], which has 60,000 training and 10,000 test patterns given as 28
Our LeNet-5 implementation is fairly standard, with two convolutional layers with 3
Accuracy comparisons of DNNC, DSVC and LibLinear for LeNet5 nets on MNIST (left: train vs test, right: test vs train; larger accuracies in bold face)
Accuracy comparisons of DNNC, DSVC and LibLinear for LeNet5 nets on MNIST (left: train vs test, right: test vs train; larger accuracies in bold face)
We report test accuracies under two different scenarios, a first one were we hyperparametrize and train the networks on the training set and evaluate them on the test subset, and a more difficult one, where we switch the datasets, hyperparametrizing and training now on the 10,000 pattern test set and testing the resulting models on the 50,000 patterns of the train datasets. We also use here
We will use the same LeNet-5 network on the MNIST Fashion dataset [25]. It also contains 60,000 training and 10,000 test patterns in 10 classes of clothing products (trousers, pullovers, …) given as 28
We have performed over the MNIST Fashion dataset the same two-class experiments done for standard MNIST, hyperparameterizing again separately the Tikhonov regularization penalties of the dense hidden layer weights and of the output weights; we have used the same
Accuracy comparisons of DNNC, DSVC and LibLinear for LeNet5 nets on MNIST fashion (left: train vs test, right: test vs train; larger accuracies in bold face)
Accuracy comparisons of DNNC, DSVC and LibLinear for LeNet5 nets on MNIST fashion (left: train vs test, right: test vs train; larger accuracies in bold face)
Our last experiment with advanced deep architectures will deal with the CIFAR-10 dataset. It contains 60,000 32
We shall use the ResNet32 V1 implementation in [27] of the deep residual networks proposed in Subsection 4.2 of [17]. The network has three stacks of five residual blocks, each with two convolution layers plus a residual connection. The convolutional layers at each stack result in 16, 32 and 64 output channels respectively. Originally batch normalization is applied but no regularization takes place; moreover, the outputs of the last convolutional layer are flattened and fed into ten softmax model outputs. This results in a total of about 468,000 trainable weights. In our case we just feed these flattened outputs into either a sigmoid output for the DNNC model or a linear one for the DSVC one. We point out that training here is quite costly, with a single epoch taking about 3 minutes. Because of this we have performed a limited hyperparametrization in our CIFAR 10 experiments, considering only
The test accuracy reported in [17] is 92.49% using data augmentation; the best accuracy in [22] is 96.53%. With our implementation we have achieved a 85.33% accuracy after 200 epochs without data augmentation and 92.04% with it. Test results are given now in Table 11 for the train vs test problem (left) and for the test vs train one (right); the table also shows mean accuracies. Again, for easier reading we put in bold face largest accuracies for each problem. LIBLINEAR results are quite bad, as it seems to assign all patterns to the majority class. Applying a Wilcoxon signed rank test along the previous lines yields a
Accuracy comparisons of DNNC, DSVC and LibLinear for ResNet32 V1 nets on CIFAR 10 (left: train vs test, right: test vs train; larger accuracies in bold face)
Accuracy comparisons of DNNC, DSVC and LibLinear for ResNet32 V1 nets on CIFAR 10 (left: train vs test, right: test vs train; larger accuracies in bold face)
Finally, and as a summary of the previous results, we give in Table 12 the sample sizes, the values of the Wilcoxon test statistics (i.e., the values returned by the Wilcoxon test algorithm) and the
Summary of Wilcoxon’s signed ranked test results for the accuracies in the MNIST, MNIST Fashion and CIFAR 10 problems, as well as for all these problems
We have proposed deep DSVC and DSVR models which jointly learn an optimal last hidden layer representation and a maximum margin hyperplane acting upon it. Our first motivation is that Gaussian SVMs are not time competitive over large datasets. A second motivation is that the mild non differentiability of the hinge and
After reviewing the results in [15], we have compared our proposals against standard DNN cross entropy classifiers or squared error regressors. We do so because the counterparts in large problems of our deep SVM proposals are precisely standard DNNs. Our experiments tentatively show that DNNC and DSVC models perform similarly in classification problems, as we cannot statistically reject the null hypothesis of equal DNNC and DSVC accuracies over two-class problems (with the caveats mentioned in Section 5). The situation is different, however, in regression problems. In principle, results should be now influenced by the scoring function used to hyperparametrize regularization penalties. Indeed, when using the mean absolute error (MAE) for this, we can reject in favor of our DSVR models the null hypothesis of the DSVR and DNNR error distribution being the same. On the other hand, one would expect a better DNNR performance when using mean squared error (MSE) CV scoring; however, this turns out not to be so and DSVR models have also a light advantage here. We point to the extra
As lines for further work,we first mention that large margins for DNNs have been studied in [28, 29, 30, 31], in some cases using sigmoid/softmax outputs and binary or categorical cross entropies. On the other hand, and as mentioned, Deep SVMs naturally maximize the margins achieved in the last hidden layer. It is thus natural to compare both margins for classification problems. A second research line has to do with the kernels induced by Deep SVMs on the last hidden layer, as there has recently been a flurry of activity on possible kernel structures on the last hidden layer of highly overparametrized DNNs [32, 33, 34, 35]). It is thus natural to study the possible relationship between these kernels and those induced by Deep SVMs on the rather wide last hidden layers considered here. Finally, one of the main advantages of the current DNN paradigms is the great flexibility they allow to define neural architectures (see for instance [36, 37, 38, 39, 40, 41]), which results in their wide application [42, 43, 44, 45, 46, 47]; thus, a natural question is how to take advantage of losses such as the hinge or
Footnotes
Acknowledgments
With partial support from Spain’s grants TIN2016-76406-P. Work supported also by the UAM-ADIC Chair for Data Science and Machine Learning. We gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at UAM.
