Abstract
Classification problem is an important research direction in machine learning. Nonparallel support vector machine (NPSVM) is an important classifier used to solve classification problems. It is widely used because of its structural risk minimization principle, kernel trick, and sparsity. When solving multi-class classification problems, NPSVM will encounter the problem of sample noises, low discrimination speed and unrecognized regions, which will affect its performance. In this paper, based on the multi-class NPSVM model, two improvements are made, and a directed acyclic graph fuzzy nonparallel support vector machine (DAG-F-NPSVM) model is established. On the one hand, for the noises that may exist in the data set, the density information is used to add fuzzy membership to the samples, so that the contribution of each samples to the classification is treated differently. On the other hand, in order to reduce the decision time and solve the problem of unrecognized regions, the theory of directed acyclic graph (DAG) is introduced. Finally, the advantages of the new model in classification accuracy and decision speed is verified through UCI machine learning standard data set experiments. Finally, Friedman test and Bonferroni-Dunn test are used to verify the statistical significance of this new method.
Keywords
Introduction
Machine learning is a common research hotspot in the field of artificial intelligence and pattern recognition, and support vector machine (SVM) is a supervised machine learning method. SVM is based on the structural risk minimization principle of statistical learning theory and the VC dimension theory [1]. It is proposed to solve the classification problem with the help of optimization theory. Compared with traditional models based on empirical risk minimization, SVM has better model generalization ability, and stronger learning ability for small sample, non-linear and high-dimensional data.
Initially, SVM was proposed for the binary classification problem. The former Soviet Union scholar Vladimir N. Vapnik and others discussed and established a standard support vector machine (C-SVM) [2]. In practical application research, many improved models have been put forward on the basis of C-SVM [3]. Some scholars proposed some parallel SVMs. For the least squares support vector machine (LSSVM) [4–6] least squares function with equality constraints is used as a loss function instead of the complex quadratic programming problem of SVM, so the training speed is relatively fast. Since the training points in LSSVM all contribute to the decision function, the algorithm is not sparsity. For the ν-support vector machine (ν-SVM) [7–9], a parameter ν lets one effectively control the number of support vectors. However, compared to regular C-SVM, the formulation of ν-SVM is more complicated, so up to now there have been no effective methods for solving large-scale classification problems with ν-SVM. Since the parallel SVMs does not consider the difference in data distribution of samples, some scholars proposed nonparallel SVMs on this basis. For the generalized eigenvalue proximal support vector machine (GEPSVM) [10–12], each plane is as close as possible to one of the data sets and as far as possible from the other one. The nonparallel planes are eigenvectors corresponding to the smallest eigenvalues of two related generalized eigenvalue problems. However, as far as the classification rules of GEPSVM are concerned, classifying the sample to be tested as the nearest hyperplane will inevitably produce a large number of misclassified samples in the intersection area of the two fitted planes. For twin support vector machine (TWSVM) [13–15], it seeks two nonparallel proximal hyperplanes such that each hyperplane is closer to one of the two classes and is at least one distance from the other. The result of this strategy is that TWSVM can solve two smaller quadratic programming problems (QPPs). But it must calculate the inverse of the matrix in the process of solving QPPs, which is difficult or even impossible for large data sets. For NPSVM [16], it seeks two nonparallel hyperplanes, such that each class locates as much as possible in the ɛ-band of the hyperplane and each hyperplane is at least one distance from the other. Compared with GEPSVM and TWSVM, NPSVM has the advantages of structural risk minimization principle, kernel trick and sparsity, so it has become a research hotspot in recent years [17–22].
In practical applications, there are more common multi-class classification problems, and the application scope of NPSVM is limited. Therefore, how to effectively extend NPSVM to multi-class classification has become a research focus in this field. When establishing the multi-class classification model, some scholars consider all samples at one time and solve the optimization problem of a multi-objective function to achieve multi-class classification. Some scholars construct a series of binary classifiers in a certain way to realize multi-class classification. The latter has become a more widely used method by scholars due to its advantages such as reducing classification cost, improving classification accuracy and simplifying objective function. At present, two well-known strategies are One-Versus-Rest (OVR) strategy [23] and One-Versus-One (OVO) strategy [24]. In the training phase, OVR takes samples of a certain class as a separate class, and the samples of the other classes together as another class to form binary classification problems. In all binary classifiers, the class with the maximum value of the classification function is determined as the final result. But it has problems such as unrecognizable regions, unbalanced samples, and long training time. OVO constructs a classifier between every two classes, and evaluates the result of each decision function through voting. The class with the most votes is the final class. In practical applications, OVO is considered to be one of the most effective decomposition strategies due to its small solution scale, high classification accuracy, and fast training speed [25, 26]. Although OVO has these advantages in the training phase, it also has certain shortcomings. As the classes of samples increase, classifiers also increase rapidly, which will affect the decision-making speed. OVO also has problem of unrecognizable areas. In the voting phase, there may be situations that the voting results of several classes are the same, which will affect the performance of the classification. For OVO-NPSVM, when solving the multi-class classification problem, the above situation may be encountered. So this paper introduces the theory of directed acyclic graph (DAG) [27, 28] to solve the problem of unrecognizable area and speed up the decision-making.
Whether it is a binary classification problem or a multi-class one, the classifier will be affected by the noises in the samples, so the classification hyperplane is often not the true optimal one. At present, many scholars use the membership degree idea of fuzzy set theory to assign different samples with different penalty weights, so that the samples have different contributions to the objective function. This reduces the influence of noises to some extent. Lin [29, 30] et al. first proposed the fuzzy support vector machine (FSVM). According to the distance between the samples and the class center, the fuzzy membership degree with distinguishing ability was added to the training samples, but the support vector was not considered enough. Wu [31] et al. proposed a fuzzy multi-class classification support vector machine algorithm for unbalanced data. The algorithm uses the distance from the training samples to the class center and the weighted class overlap method to design fuzzy membership function for the samples. Ju [32] et al. proposed using fuzzy set theory, interval-valued fuzzy set theory and NPSVM theory to construct fuzzy nonparallel support vector machines and interval-valued fuzzy nonparallel support vector machines. Although the classification models with two membership degrees have better classification accuracy, the cost is that it takes time to adjust the parameters. In the current era of big data, a simple and effective method of adding membership should be the first choice. As OVO-NPSVM may be affected by noises when solving multi-class classification problems, this paper establishes a membership degree based on density for samples. The membership degree takes into account the structural information of the samples and effectively distinguishes the importance of different samples.
In summary, a novel nonparallel SVM, termed directed acyclic graph fuzzy nonparallel support vector machine (DAG-F-NPSVM) for multi-class classification is proposed in this paper. And it has the following incomparable over NPSVM. The theory of DAG is introduced to solve the problem of unrecognizable area in OVO-NPSVM and speed up decision-making. The fuzzy membership is added to the samples to solve the problem of noises, thus improving the classification accuracy.
In recent years, a series of achievements have been made in the research of multi-class classification support vector theory and algorithm. Among them, it is of great theoretical significance to study how to reduce the influence of sample noises and speed up decision-making. In addition, the application of multi-class classification model to recommendation algorithm, text classification, image recognition and other fields is also of great practical significance.
In this paper, many acronyms are used. To make it easier to understand, we summarize the important ones in Table 1.
The full names and Acronyms of the main classifiers in the paper
The full names and Acronyms of the main classifiers in the paper
The structure of this article is as follows: Section 2 briefly reviews the basic concepts of NPSVM. The third section introduces the DAG-F-NPSVM method and gives the corresponding algorithm. The fourth section is the results and analysis of numerical experiments. The last section is the conclusion.
In this section, a review of linear and nonlinear NPSVM will be described briefly.
Consider the binary classification problem with the training set:
and
For the linear classification problem, the two nonparallel hyperplanes (1) and (2) are solved by the following two convex QPPs, so that each class of points is located in the ɛ-band of the hyperplane as much as possible, and each hyperplane is at least one distance from the other.
and
In the objective function of (3),

Geometrical illustration of NPSVM in R2.
Introduce Lagrange multiplier vectors
and
By solving the dual problems (5) and (6), w+ and w- can be obtained respectively, and the expressions are:
and
Choose a component of
and
The decision functions are:
and
A new test point x is classified as +1 or –1 depending on which of the two hyperplanes it lies closer to, i.e.
In many cases, the data samples cannot be separated by the linear class boundary. Therefore, we can use the kernel trick to extend the linear NPSVM to the nonlinear case. After the sample are projected into the high-dimensional feature space, they are easier to separate. By introducing kernel function K (x
i
, x
j
) = φ (x
i
) · φ (x
j
) and the corresponding transformation x = φ (x), where
and
Its decision function and discriminant method are also determined in a similar way to linear NPSVM:

Flow chart of DAG-F-NPSVM.
and
The way to identify an unknown sample is similar to the linear problem.
In this paper, on the basis of multi-class classification NPSVM, the idea of fuzzy membership in fuzzy set theory and DAG in graph theory are introduced at the same time, thereby constructing DAG-F-NPSVM model. The introduction of fuzzy membership is to reduce the influence of sample noise on finding the optimal hyperplane. The introduction of the directed acyclic graph theory is to overcome the problem of unrecognizable areas in the multi-class classification and reduce the decision-making time. The detailed flow chart of this method is shown in Fig. 2.
Fuzzy membership based on sample density
Adding membership degree to the samples can improve the classification accuracy of the model to a certain extent. However, it is not advisable to add too complex membership in the classification of large data sets. The membership degree added to the samples in this paper is an improved one related to density. It makes use of the distribution information of samples and treats each sample differently for the importance of decision function. At the same time, it also considers the simplicity and operability in calculation.
Given data set
Let
Obviously,
Establish fuzzy membership based on sample density for x
i
:
The membership degree consists of the product of two ratios. First of all, find the number of homogeneous samples in the circle of samples x with radius r. After that, divide it by the number of homogeneous samples in the densest circle of the class to obtain the first ratio. Then divide it by the number of all samples in the r circle of the samples to the second ratio. The membership degree of this product form indicates that when x i is densely distributed with homogeneous points and relatively sparsely distributed with heterogeneous points, u (x i ) approaches 1. Meanwhile, when x i is sparsely distributed with homogeneous points and relatively densely distributed with heterogeneous points, u (x i ) tends to be 0. In this way, from the membership degree u (x i ), we can see the relative density of x i .
In the training phase, DAG-F-NPSVM uses the pairwise classification method. That is, one class and the other class are used for training. For the M-class classification problem, it is necessary to construct (M-1)/2 NPSVMs and solve M(M-1) QPPs to obtain M(M-1) nonparallel hyperplanes. In the testing phase, this method constructs all classifiers into a two-way DAG, which includes M(M-1)/2 nodes and M leaves. Among them, each node is an NPSVM classifier and is connected to two nodes (or leaves) in the next layer. The vertex contains only one node, which is called the root node. The second layer has two nodes, and so on, the i th layer contains i nodes, the bottom layer contains M leaf nodes, among them the i th node of the j th layer points to the i th and i + 1 th nodes of the j + 1 th layer. When an unknown sample is classified, it starts from the root node at the top. According to the classification results of the root node, the left node or right node of the next layer is used to continue the classification until a leaf at the bottom is reached. The class represented by the leaf is the class of the unknown sample.
Linear DAG-F-NPSVM
Consider the training set for a multi-class classification problem:
We aim to construct M nonparallel hyperplanes
In the class M, if the samples of the i
th
class is trained with the samples of the j
th
class. It is required that samples of the i
th
class should be located in the ɛ-band of its hyperplane as much as possible, and the i
th
hyperplane should be at least one distance from the j
th
hyperplane. Then the linear DAG-F-NPSVM classifier solves following two QPPs:
and
In order to obtain the solutions of (24) and (25), we need to solve the dual problem. Taking formula (24) as an example, construct the Lagrangian function:
From Equation (27) we can get:
Then substituting Equation (39) into Equation (26) and combining Equations (27)–(36), the dual problem of the original problem (24) can be obtained:
When formula (39) is obtained, a component
Similarly, we can also get the dual problem of the original problem (25):
and
When a component
Once we get solutions (w
ij
, b
ij
) and (w
ji
, b
ji
) in problems (24) and (25). A new testing point x is classified as i
th
or j
th
depending on which of the two hyperplanes it lies closer to, i.e.,
The decision method by DAG-F-NPSVM is shown in Fig. 3. If we consider four classes, DAG-F-NPSVM will comprise 4*3/2 = 6 internal nodes. However, a new test point only needs to be judged by three classifiers to get the judgment result.

The decision method of four-class classification of DAG-F-NPSVM.
In the original problems (24) and (25), the kernel functions K (x, x
T
) = φ (x) · φ (x)
T
can be applied directly in order to extend the linear DAG-F-NPSVM to the nonlinear case. The corresponding optimization problems are as follows:
and
The way to identify an unknown sample is similar to the linear problem.
Regardless of the linear or non-linear DAG-F-NPSVM model, if the fuzzy membership takes a constant value of one, it means that all samples are treated equally and given the same weight. The model will degenerate into a directed acyclic graph nonparallel support vector machine (DAG-NPSVM). Therefore, DAG-NPSVM is a special case of DAG-F-NPSVM.
The algorithm for the DAG-F-NPSVM is described as follows:
Training phase
Input the training set A and B
for i = 1 to M, j = j+1 to M
If i th class is trained with j th class.
1. Define fuzzy membership degree for x by the formula (22);
2. Choose appropriate parameters ɛ > 0, C1, C2 > 0 for problem (40), C3, C4 > 0 for problem (42);
3. Choose appropriate kernel K, construct and solve the two convex QPPs (46) and (47), then obtain α(*) and β;
4. Construct the decision functions
and
separately.
end
for a test data, define i = 1 and j = M.
while (i !=j)
If Class i < Class j
j = j - 1
else
i = i + 1
end
end
return Class i
When DAG-F-NPSVM algorithm is used for the M-class classification problem with l samples, M(M-1)/2 binary NPSVM classifiers are constructed to solve M(M-1) QPPs. Assume that samples are evenly distributed among the classes, so that each class contains approximately l / M samples. The DAG-F-NPSVM classifier is trained with a pairwise classification method. That is, all possible binary classifiers are constructed from the training set of M classes, and each classifier is only trained between two classes. Each class provides constraints for another class. In this way, each binary classifier is given l/M constraints. Therefore, the training complexity of the DAG-F-NPSVM classifier is:
Since fuzzy membership degree is added before training, it doesn’t affect training complexity.
Numerical experiment and analysis
Experimental environment and data
In order to verify the performance of DAG-F-NPSVM, we compare it with OVO-SVM, DAG-SVM, OVO-NPSVM, OVO-F-NPSVM and DAG-NPSVM on different types of data sets. These methods are all performed on a PC with Intel(R) Core (TM)i5-3317U 1.70 GHz processor and 4GB memory in MATLAB 2014b. In this paper, from the UCI database [34] these 10 data sets are selected for training and testing, including Iris, Wine, Seeds, Soybeans, E coli, Glass, Vowels, Dermatology, Zoo, and Yeast. The number of samples, feature dimensions, and the number of classes of each data set are shown in Table 2.
Basic data set description
Basic data set description
In order to compare the test results fairly, the five-fold cross-validation method is adopted in this article, that is, the data set is randomly divided into five groups, four of which are used as training samples, and the other group is used as test samples. Repeat the experiment five times to make each group of data equal as a test sample. Different parameter settings have a certain impact on the classification results. The grid search method is used to optimize the parameters, and the kernel function uses Gaussian kernel function (RBF):
Comparison of classification results of several multi-class classification models
Comparison of classification results of several multi-class classification models
Test time comparison of several multi-class classification models (s)
Table 3, it can be found that among the 10 data sets, the classification accuracy of DAG-F-NPSVM ranks first on 8 data sets, two of which are tied for first on one data set. Therefore, the average accuracy of DAG-F-NPSVM ranks first, OVO-F-NPSVM ranks second, DAG-NPSVM ranks third, OVO-NPSVM ranks fourth, DAG-SVM ranks fifth, and OVO-SVM ranks last. DAG-NPSVM is an improvement of DAG-SVM, which increases the average accuracy of DAG-SVM by 3.41%. For OVO-NPSVM and OVO-SVM, the average accuracy of OVO-NPSVM is also 3.44% higher than that of OVO-SVM. It shows that in this experiment, for the same multi-class classification strategy, the classify-cation result of NPSVM is much better than SVM. Compared with the multi-class classification under the DAG and OVO strategies, the average accuracy of DAG-SVM is about 0.1% higher than that of OVO-SVM, the average accuracy of DAG-NPSVM is about 0.1% higher than that of OVO-NPSVM, and the average accuracy of DAG-F-NPSVM is about 0.19% higher than that of OVO-F-NPSVM, which proves that the accuracy of the classifier in DAG strategy is slightly higher than that in OVO strategy. DAG-F-NPSVM is another improvement of DAG-NPSVM. It increases the average accuracy of DAG-NPSVM by 0.75%. And OVO-F-NPSVM is 0.64% higher than that of OVO-NPSVM, indicating that adding fuzzy member-ship will increase the classification accuracy and achieve the purpose of reducing the impact of noises in the sample. A new classification method is effective for data sets containing noises. The corresponding training time is also listed in Table 4. It can be found that:
The training time between DAG-SVM and OVO-SVM is roughly the same. The training time between DAG-F-NPSVM, DAG-NPSVM, OVO-NPSVM and OVO-F-NPSVM is also similar. This shows that DAG strategy and OVO strategy both use pairwise classification approach in the training phase. When comparing the training time of DAG-NPSVM and DAG-SVM, OVO-NPSVM and OVO-SVM, it can be seen that the training time of DAG-NPSVM is about a quarter of that of DAG-SVM. Similarly, the training time of OVO-NPSVM is about a quarter of the time of OVO-SVM. This is because the two-classifier NPSVM solves two smaller convex QPPs, while SVM solves a large convex QPP. In theory, the training time of NPSVM is a quarter of SVM. Although DAG-F-NPSVM is an improvement of DAG-NPSVM, and OVO-F-NPSVM is also an improvement of OVO-NPSVM, their training time does not increase due to the increase of fuzzy membership, which is added before training. It can be shown that adding fuzzy membership in DAG-NPSVM and OVO-NPSVM, respectively, not only improves the classification accuracy to a certain extent, but also does not change the computational complexity of the model.
Table 4 lists the test times of ten data sets under six classifiers. From the Table, we can see that the test times of DAG-NPSVM and DAG-F-NPSVM are almost the same. The average test time of DAG-F-NPSVM is 50.42% less than that of OVO-F-NPSVM, which shows the absolute advantage of DAG in the test phase. Because under the DAG strategy, the classification results can be obtained without accessing all the classifiers, which fully reflects its advantages in the testing phase.
Statistical test
In order to further analyze the performance of the six algorithms, Friedman test [35] and Bonferroni-Dunn test [35] are used to test if there is a significant difference between the experimental results.
Friedman test
Friedman test is proved to be a simple non- parametric statistic method. To compute the Friedman statistic, the average ranking based on classification accuracy in Table 3 is listed in Table 5. Under the null hypothesis that all the algorithms are equivalent, the Friedman statistic can be computed as follows:
Average ranking based on classification accuracy in Table 3
Average ranking based on classification accuracy in Table 3
According to (48) and (49), for the nonlinear case, we can obtain
Now, Bonferroni–Dunn test can be used to further analyze the relative performance among the comparing algorithms. Here, the difference between the average ranks of DAG-F-NPSVM and one baseline is compared with the following critical difference (CD):
We have q α = 2 . 326 at significance level α = 0.1, and thus CD = 1.95 (M = 6, N = 10). To visually show the relative performance of DAG-F-NPSVM comparing with other algorithms, Fig. 4 provides the CD diagram, among them the average ranks of each comparing algorithm are plotted along the axis. The lowest (best) rank on the axis is to the right since we perceive the algorithms on the right side as better. And any comparing algorithm with the average rank within one CD is interconnected with DAG-F-NPSVM. Otherwise, any other algorithm whose average rank is one CD outside DAG-F-NPSVM is considered to have significantly different performance with DAG-F-NPSVM. As shown in Fig. 4, the average ranks of OVO-SVM (5.35-1.65 = 3.7 > 1.95), DAG-SVM (5.05-1.65 = 3.4 > 1.95) are both one CD outside DAG-F-NPSVM, while the average rank of OVO-NPSVM (3.55-1.65 = 1.9 < 1.95) and DAG-NPSVM (3.25-1.65 = 1.6 < 1.95) are less than but close to one CD. Although The average ranks of OVO-F-NPSVM (1.95-1.65 = 0.3 < 1.95) is less than to one CD. This is because compared with this classifier, the new classifier in this paper has a significant advantage in decision-making speed rather than classification accuracy.

Comparison of the DAG-F-NPSVM against other five comparing algorithms with the Bonferroni–Dunn test.
The results show that DAG-F-NPSVM is statistically significantly better than OVO-SVM and DAG-SVM. Although the DAG-F-NPSVM is not statistically superior to OVO-NPSVM, DAG-NPSVM and OVP-F-NPSVM, however, compared to the three methods in Table 5, DAG-F-NPSVM can obtain better accuracy on most data sets. This shows that DAG-F-NPSVM has better generalization ability.
NPSVM is widely used as an important binary classifier for solving classification problems. Moreover, it can be extended to solve multiple classification problems. But the problems of sample noises, unrecognized regions and low decision speed affect its performance. To improve the situation, DAG-F-NPSVM model is established in this paper. This new method proposes improvements in two aspects. Firstly, as the sample noises in the data set affect the determination of the optimal hyperplane, fuzzy membership degree is added based on sample density. In this way, different samples play different roles in determining the optimal hyperplane. Secondly, DAG theory is introduced to solve the problem of unrecognizable regions and speed up the decision-making speed of the model. The experimental results on UCI benchmark data sets show that DAG-F-NPSVM is superior to several other multi-class support vector machines in terms of classification accuracy and decision speed.
Although DAG-F-NPSVM has the above advantages, there are also some disadvantages that need to be improved. First of all, the DAG strategy classification algorithm has the phenomenon of error accumulation in the classification process. How to overcome this influence needs further research. Secondly, it is necessary to optimize the additional parameters when adding fuzzy membership degree to the samples. Although the grid search method can search for ideal parameters, this work requires a certain amount of calculation time. In other words, the improvement of classification accuracy comes at the cost of parameter optimization time. Therefore, how to quickly and effectively select the best parameter is a problem for further study.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (NO. 71771028), Beijing Intelligent Logistics System Collaborative Innovation Center and Major scientific research project of Beijing WUZI university (NO. 2019XJZD06)
