Abstract
In the last decades, numerous optimization-based methods have been proposed for solving classification problems in pattern recognition. These methods mainly construct a straight line or a hyperplane to separate a given data set to be two classes. In this paper, we propose a new nonlinear classifier based on the Choquet integral with respect to a signed efficiency measure, and the boundary is a broken line (two considered attributes) or a Choquet broken-hyperplane (more considered attributes). Firstly, the Choquet distance of two points in n-dimensional space is proposed. Secondly, according to the Choquet distance, two nonlinear classification optimal models are presented. Finally, some experimental results show that the efficiency of the models for solving classification problems. The related results enrich the research of classification problems in pattern recognition.
Introduction
Classification is an important topic in areas of machine learning, statistics, and pattern recognition. Given a data set, which includes n attributes, m classes, l numerical records of each considered attributes, and the corresponding classes, a classification is a procedure to determine which class a new record of attributes belongs to. Essentially, this type of classification is an optimization problem and has been widely applied in knowledge discovery and data mining. Numerous approaches have been proposed for solving the classification problems, such as decision trees [10, 11], Bayesian networks [22], neural networks [21], Support vector machine [3], and etc.
Recently, the classification methods based on nonlinear integrals are studied and some encouraging results have been obtained. In these methods, using a nonlinear integral to fuse the values of a new record and to determine which class it belongs to. Paper [9] uses the fuzzy integral in the field of image processing. Grabisch and Sugeno propose to use fuzzy t-conorm integrals for classification [15]. Paper [16] uses the Choquet integral with respect to the non-additive measure on statistical pattern classification based on possibility theory. An optimization-based classification model is later proposed with a non-additive measure [17]. Paper [14] proposes the 2-additive classification with feature selections. Paper [1] gives a classification method by using Choquet integrals and Logistic regression. Paper [18] uses Choquet integral in supervised classification. Classification can also be achieved by directly separating the date using the weighted Choquet integral projection [13] or using a penalized signed fuzzy measure [6]. A detailed discussion of geometric meaning of the contributions from feature attributes in nonlinear classification can be found in [8]. Paper [2, 24] gives the method for multiple classifiers and using the fuzzy integral as aggregation Operators.
We can use the Choquet integral [5] as a special nonlinear integral and make it to be a key role in classification. In particular, Wang and his collaborators make an outstanding contribution in this field [6, 26].
The advantage of above methods by using nonlinear integrals is that they consider the interaction among attributes’ contributions towards the classification. As we have known, that in many real data sets, such an interaction among the considered attributes exists. Most existing classification methods do not care the such type interaction. The classifying boundary depend on the observed values of attributes. If the above-mentioned interaction exists, we may use it to get more reasonable results.
In this paper, based on the works of Wang, we use the Choquet integral for classification. A new concept “Choquet distance” is proposed and the optimal models based on Choquet distance for data classification are given.
The paper is organized as follows. After the introduction, in Section 2, nonadditive set function and nonlinear integrals are described. Section 3 proposes the Choquet distance. In Section 4, a new optimization-based classification models with the Choquet distance are developed. Section 5 shows the performance of proposed models in experimental results. Finally, Section 6 summarizes some conclusions from this research.
Fundamental concepts
Let X = {x1, x2, ⋯ , x n } be the attributes set. In classification, a data set consisting of l example records, called training set, is given. Each record contains the values of feature attributes and the corresponding class. Positive integer l is called the data size. The values of attributes are numerical, and described by an n-dimensional vector, f = (f (x1) , f (x2) , …, f (x n )), that is, f : X → (- ∞ , ∞). The range of attributes, which is a subset of n-dimensional Euclidean space (simply n-dimensional space), is called the feature space. The set of all possible values of classes, {C1, C2, ⋯ , C m }, is denoted by C, where each C k , k = 1, 2, ⋯ , m, refers to a specified class. Thus, the j-th sample record consists of the j-th observation for all feature attributes and the classifying attribute, and is denoted by (f j , C k j ) = (f j (x1) , f j (x2) , ⋯ , f j (x n ) , C k j ), j = 1, 2, ⋯ , l, where k j belongs to {1, 2, ⋯ , m}.
The objective of classification is to build a classification model expressed by the feature attributes. When a new record of feature attribute is available, we can use the model to determine the class to which the new record belongs. That is to say, based on the training set, the required model should provide mathematical expressions of decision boundaries that divide the feature space into several pieces, each of them correspnds to a class in C.
We present a method of classification based on the Choquet integral [13, 26]. This method can be regarded as the idea of projecting all data points in the feature space onto a real axis through a Choquet integral, and then using a one-dimensional classifier to classify these points according to a certain criterion optimally. At the same time, the classifying attribute is numericalized to be a virtual variable. Good performance of this method comes from the use of the nonadditive measure and the relevant Choquet integral, since the nonadditivity of the set function reflects the importance of the feature attributes, as well as their inherent interaction, toward the discrimination of the points. In fact, each feature attribute plays a respective role in making a decision that points out which class a given observation of feature attributes belongs to. That is, each feature attribute has respective important index reflecting their amounts of contributions toward the decision. Furthermore, the global contribution of several feature attributes to the decision of classification is not just the simple sum of the contribution of each feature to the decision, but may vary nonadditivity. A combination of the feature attributes may have a mutually restraining or a complementary synergy effect on their contributions toward the classification decision. So, the nonadditive set function defined on the power set of all feature attributes is a proper representation of the respective importance of the feature attributes and the interaction among them, and a relevant Choquet integral is a good fusion tool to aggregate the information coming from the feature attributes for the classification. The following are the details of these basic concepts and the mathematics model for the classification problem.
Some related definitions are introduced as follows.
Let P (X) be the power set of X.
μ (∅) =0; A ⊂ B ⇒ μ (A) ≤ μ (B) , ∀ A, B ∈ P (X).
Conditions 1) and 2) imply the nonnegativity. Monotone measure μ is nonadditive in general. If μ (X) =1, then μ is said to be regular.
To calculate the value of the Choquet integral of a given function f on X, the values of f = (f (x1) , f (x2) , ⋯ , f (x
n
)), should be first rearranged into a nondecreasing order, that is
The Choquet integral is a generalization of the Lebesgue-like integral. When μ is a classical measure (i.e., additive), the Choquet integral coincides with the Lebesgue-like integral. The Choquet integral possesses most properties (but not the linearity) of the Lebesgue-like integral and has its naturalsemantics.
The signed efficiency measure defined below is a generalization of monotone measure and is adopted to develop optimization-based nonlinear classification models in general cases.
Nonlinear integrals are used as data aggregation tools to integrate the values of the attributes with respect to a signed efficiency measure. As one of nonlinear integrals, the translatable Choquet integral is more appropriate for applications, such as classification, because it provides very important information in interactions among attributes.
For every d = 1, 2, ⋯ , 2 n - 1, we can use a corresponding bit string d n dn-1 ⋯ d1 to express d, where each d i , i = 1, 2, ⋯ , n is either 1 or 0. For example, when n = 5, 1 is expressed by 00001, 2 is 00010, 3 is 00011, ⋯, and 31 is 11111. Each set E d ∈ P (X) corresponds to a bit string d with E d = ∪ d i =1 {x i }. Thus, μ (E d ) = μ (∪ d i =1 {x i }) can be simply denoted by μ d , d = 1, 2, ⋯ , 2 n - 1. For example, μ31 = μ ({x1, x2, x3, x4, x5}), μ15 = μ ({x1, x2, x3, x4}) , μ10 = μ ({x2, x4}), and so on.
Using the above notation, the translatable Choquet integral can be calculated by:
The validation of (2) is proved as a theorem in [26].
Thus, using (2) we can get:
So
The significance of this alternate formula is that the value of the Choquet integral is now expressed as a linear function of the values of μ. Hence, when the data set of the values of the integrand f and the corresponding integration value are available, an algebraic method can be used to estimate the optimal values of μ.
In the following, the translatable Choquet integral of f with respect to a signed efficiency measure is called the Choquet integral for short. A generalized weighted Choquet integral may be expressed by (C) ∫ (a + bf) dμ where μ is a signed efficiency measure, and a = (a1, a2, ⋯ a n ), b = (b l , b2, ⋯ , b n ) are n-dimensional vectors. We can use it as a projection tools to convert an n-dimensional classification problem to a one dimensional classification problem. We call a and b the matching vectors. They indicate the phase and scaling matching requirements of feature attributes. i.e., they are used to balance the ranges and scales of diverse feature attributes with respective dimensions such that the signed efficiency measure μ can reflect the interaction appropriately.
Based on the Choquet integral, an aggregation tool that projects the feature space onto a real axis is shown. Under the projection, each point in the feature space becomes a value of the virtual variable.
A point f = (f (x1) , f (x2) , ⋯ , f (x n )) in n-dimensional feature space can be regarded as a function defined on X. It represents an observation of the feature attributes.
Equations a1 + b1f (x1) = a2 + b2f (x2) = ⋯ = a n + b n f (x n ) represent a straight line in n-dimensional space, denoted by L, which is just a projection axis. Signed efficiency measure μ on P (X) indicates the importance of feature attributes as well as the interaction. Each point f = (f (x1) , f (x2) , ⋯ , f (x n )) in n-dimensional space is projected to L with value c = (C) ∫ (a + bf) dμ.
If a1 + b1f (x1) ≤ a2 + b2f (x2) ≤ ⋯ ≤ a
n
+ b
n
f (x
n
), then the Choquet broken-hyperplane is (μ (X) - μ ({x2, ⋯ , x
n
})) (a1 + b1f (x1)) + (μ ({x2, ⋯ , x
n
}) - μ ({x3, ⋯ , x
n
})) (a2 + b2f (x2)) + ⋯ + μ ({x
n
})(a
n
+ b
n
f (x
n
)) = c, combining with L : a1 + b1f (x1) = a2 + b2f (x2) = ⋯ = a
n
+ b
n
f (x
n
), then coordinates of projection point of f are
In other cases, we can similarly get the same result, so the coordinates of Q are
From Theorem 1, the following theorem can be obtained.
We can also see that all points whose Choquet integral equals to 1.5 is broken lines L1, and to 1 is broken lines L2, as shown in Fig. 1, they indicate the projection direction on both side of axis L. The broken lines L1 and L2 interact the projection axis L at projection points Q f and Q g respectively.

The illustration of Example 1.
We can also see that all points whose Choquet integral equals to 0.82 is broken-planes S1, and to 0.14 is broken-planes S2, as shown in Fig. 2. The broken-planes S1 and S2 interact the projection axis L at projection points Q f and Q g .

The illustration of Example 2.
We can reveal that the Choquet distance can be considered as the generalization of the distance between two corresponding (n-1)-dimensional hyperplanes.
When n = 3, μ is additive, and

The illustration of Choquet distance when μ is additive and L is orthogonal to Choquet hyperplane in R3.
In order to optimally estimate the values of parameters μ, a, b, the range of the parameters should be classified.
Let u = max (|μ ({x1}) |, |μ ({x2}) |, ⋯ , |μ ({X}) |), v = max (|a1|, |a2|, ⋯ , |a
n
|, |b1|, |b2|, ⋯ , |b
n
|). Then we can get new parameters
From Theorem 3, without loss of generality, we can see that the parameters μ, a, b can be restricted in [–1, 1]. These can make us to solve the optimization model in the next section conveniently.
Now, based on the Choquet distance, we want to find an appropriate aggregation formula that projects the n-dimensional feature space onto a real axis L, such that each point becomes a value of the virtual variable that is optimal with respect to classification. In such a way, the classifying boundary is a Choquet broken-hyperplane.
The weighted Choquet integral projection depends on a, b and signed efficiency measure μ, once these values are available, the n-dimensional classification problem is reduced to a one-dimensional classification problem. We consider a 2-class nonlinear classification problem with classes C1 and C2. Suppose that the learning data consist of l1 sample points f1, f2, ⋯ , f l 1 belonging to class C1 and l2 sample points g1, g2, ⋯ , g l 2 belonging to class C2. Also, suppose that all of these sample points have the same feature attributes x1, x2, ⋯ , x n .
Now we want to find a Choquet broken-hyperplane, denoted by S, has the equation (C) ∫ (a + bf) dμ = c that can separate classes C1 and C2 optimally under a certain criterion. In this model, the signed efficiency measure μ and the projection axis a and b are unknown.
In fact, the successive classification is to find a classifying boundary S, on which the value of Choquet integral of each point is c, such that Choquet broken-hyperplanes {(C) ∫ (a + bf) dμ|f ∈ C1} and {(C) ∫ (a + bg) dμ|g ∈ C2} can be separated by c completely. Furthermore, the best classification boundary S, must make the Choquet distance of farthest right (largest) point (denoted by f) of C1 and the farthest left (smallest) point (denoted by g) of C2 is maximum. In this way, we can get the classifying boundary, a Choquet broken-heperplane, with form {h ∈ R
n
| (C) ∫ (a + bh) dμ = c}, where
Thus, the first optimal classification model can be given as follows:
Maximize
subject to:
Now, another optimal model like the Fisher method can be presented. Once a, b and μ are chosen, let f be the point on L such that (C) ∫ (a + bf) dμ is the arithmetic average of {(C) ∫ (a + bh) dμ|h ∈ C1}. Similarly, g be the point on L such that (C) ∫ (a + bg) dμ is the arithmetic average of {(C) ∫ (a + bh) dμ|h ∈ C2}. The aim of the classification is to find a projection axis L, which is determined by a and b, and signed efficiency measure μ such that the data sets can be well separated when they are projected onto L. Through learning procedure, a projection axis, L, is found such that C (f, g) is maximum and the sum of the Choquet distance from each point in C1 to f and the sum of the Choquet distance from each point in C2 to g are minimized. The second optimal model is expressed as follows.
If f be the point onto L such that (C) ∫ (a + bf) dμ is the arithmetic average of {(C) ∫ (a + bh) dμ|h ∈ C1}. From Theorem 1, the coordinates of f are
When μ is additive and the projection axis is orthogonal to classifying boundary, the second optimal model can be considered as a generalization form of Fisher linear classification.
Applications
In this section, some data classification’s examples, which are nonlinear classification problems, are given to show that the above optimal models based on Choquet integral are effective.

The data of Example 3.

The classification of Example 3.

The data of Example 4.

The classification of Example 4.

The data of Example 5.

The classification of Example 5.
In Fig. 10 the horizontal ordinate is the Choquet integral value of each record, and vertical ordinate is the ordinal number of each record. The dividing line is the Choquet integral value of classifying boundary. From Fig. 10, we can see that the records can be divided effectively.

The classification of Example 5.
Firstly, as shown in Fig. 11, class1 and the other two classes can be classified, but class2 and class3 can not be classified. The detail results are μ ({x1}) = -637/5412, μ ({x2}) = -212/309, μ ({x3}) =445/ 654, μ ({x4}) =1639/2891, μ ({x1, x2}) = -1147/3246, μ ({x1, x3}) =212/419, μ ({x1, x4}) =788/2301, μ ({x2, x3}) = -333/415, μ ({x2, x4}) =524/1635, μ ({x3, x4}) =1355/1508, μ ({x1, x2, x3}) = -2770/3533, μ ({x1, x2, x4}) = -331/727, μ ({x1, x3, x4}) =1168/1821, μ ({x2, x3, x4}) =97/148, μ ({x1, x2, x3, x4}) = -38/32467.

The classification of Iris data for the class1 and class2, class3.
The projection axis L is -586/1759 + 1013/17768 x1 = -1761/3484 + 267/1615 x2 = -699/2588 + 313/4355x3 = -551/2339 + 340/1731x4. The Choquet integral value of the classification boundary is 8/118649. Being the Choquet integral values of the records are very close, so each value times a constant p = 3247/471 in order to show well in Fig. 11.
Secondly, holding the same projection axis and using the optimal model based on Choquet integral again, class2 and class3 can be classified effective in Fig. 12. The detail results are μ ({x1}) =2533/14894, μ ({x2}) = -101/112, μ ({x3}) =1959/4105, μ ({x4}) =901/1721, μ ({x1, x2}) = -1019/2230, μ ({x1, x3}) =305/806, μ ({x1, x4}) =130/3087, μ ({x2, x3}) = -488/1559, μ ({x2, x4}) =590/1489, μ( {x3, x4}) =287/1035, μ ({x1, x2, x3}) =231/2243, μ ({x1, x2, x4}) = -2936/5489, μ ({x1, x3, x4}) =645/1813, μ ({x2, x3, x4}) = -587/641, μ ({x1, x2, x3, x4}) =667/2168. The Choquet integral values of the classifying boundary are –38/803 and 397/11005. Each Choquet integral value times a constant p = 8503/357 in order to show well in Fig. 12.

The classification of Iris data.
In order to reduce the parameters of the problem, the projection axis is considered to across the center of all records. In this restriction, the projection axis can be obtained just solving b, which can reduce 4 unknown parameters. As shown in Fig. 13, class1 and the other two classes can be classified. The detail results are μ ({x1}) = -767/3175, μ ({x2}) = -1837/21899, μ ({x3}) = -968/35353, μ ({x4}) = -157/2588, μ ({x1, x2}) = -403/1145, μ ({x1, x3}) = -82/1781, μ ({x1, x4}) =154/4339, μ ({x2, x3}) = -991/3486, μ ({x2, x4}) =873/7688, μ ({x3, x4}) =61/1850, μ ({x1, x2, x3}) = -485/2292, μ ({x1, x2, x4}) = -47/336, μ ({x1, x3, x4}) =857/2918, μ ({x2, x3, x4}) =2600/12381, μ ({x1, x2, x3, x4}) =1371/1400 . and the projection axis L is -16851pt/1pt2627 + (209/1904) x1 = -111/112 + (377/1163) x2 = -561/1117 + (400/ 2993) x3 = -1165/2696 + (717/1990) x4. The Cho-quet integral value of the classifying boundary is –1039/3415. By a way, being the Choquet integral values of the records are very close, so each value times a constant p = 2309/367 in order to show well in Fig. 13.

The classification of Iris data for the class1 and class2, class3.
Similarly, holding the same projection axis and using the optimal model based on Choquet integral again, class2 and class3 can be classified effective in Fig. 14. The detail results are μ ({x1}) =363/2954, μ ({x2}) = -1185/1207, μ ({x3}) =646/1279, μ ({x4}) = 1109/1838, μ ({x1, x2}) = -440/1021, μ ({x1, x3}) =419/1129, μ ({x1, x4}) =154/3645, μ ({x2, x3}) = -303/968, μ ({x2, x4}) =487/1229, μ ({x3, x4}) =249/872, μ ({x1, x2, x3}) =683/6608, μ ({x1, x2, x4}) = -937/1792, μ ({x1, x3, x4}) = 1373/3840, μ ({x2, x3, x4}) = -596/ 1941, μ ({x1, x2, x3, x4}) =1696/5423 .

The classification of Iris data.
The Choquet integral values of the classifying boundary are –310/4431 and 433/6227. Each Choquet integral value times a constant p = 3963/406 in order to show well in Fig. 14.
In this paper, the Choquet distances between two points in R n are given. According to the Choquet distance, two optimal classification models are proposed. From the applications shown in this work, we can see that the models are effective and useful. Of course, there are some topics to be studied deeply. The number of unknown parameters is large. If the attributes number is n, then the number of unknown parameters may be at the level of 2 n , such that the complexity of the calculation is very high. How to reduce the number of parameters is a very important topic to explore. Using k-interactive measures and/or hierarchical method is a feasible way.
Footnotes
Acknowledgments
This work is partially supported by National Natural Science Foundation of China (Grant No. 11501435), Scholarship Fund for Studying Abroad by China Scholarship Council, Scientific Research Program Funded by Shaanxi Provincial Education Department (Program No. 12JK0878) and Science and Technology Plan Project Funded by Science and Technology Bureau of Xi’an City (Program No. CXY1441(2)).
