Classifier selection method based on clustering and weighted mean

Abstract

Currently, Multiple Classifier System (MCS) attracts more and more attentions and has become one of the research hotspots in the pattern recognition field. Classifier selection is a commonly used strategy for MCS to achieve the final decision. A classifier selection method based on clustering and weighted mean is proposed in this paper. In the method, multiple clusters are selected according to the distances between cluster centers and the input sample. Then, the average performance of each classifier on selected clusters is calculated. The best classifier on the nearest cluster and the classifier with the best average performance are picked out. According to the reliability of their outputs which are estimated by confusion matrix, one of them is selected to make the final decision of the system. A number of benchmark data sets from KDD’99, UCI and ELENA database were used to evaluate the proposed method. It can be seen form the experimental results that the proposed method performs well.

Keywords

Multiple classifier system classifier selection clustering weighted mean

1 Introduction

A single classifier is always being used in a traditional pattern recognition system. However, in recent years, it had been found in many experiments that the samples wrongly classified by distinct classifiers were generally not the same. It means that complementary information about the object to be recognized can be potentially offered by different classifiers and effective combination of the complementary information is expected to considerably improve the performance of a pattern recognition system. After this discovery, Multiple Classifier System (MCS) gains increasing attention and has become a research hotspot of pattern recognition. Through a great deal of experiments and applications, it has been proved that MCS can not only achieve better performance but also promote the efficiency and robustness of the pattern recognition system. Currently, many application areas have adopted MCS, such as remote sensing image recognition [1], image classification [2], character recognition [3], face recognition [4] and so on.

In the research of MCS, an essential issue is to find a suitable strategy which combines the classifiers’ outputs to achieve a final decision. Generally, there are two types of combination strategies: classifier fusion and classifier selection. Classifier fusion supposes that all classifiers are equally “experienced” in the whole feature space. So, all the outputs are fused to achieve the final decision in a certain way. According to the different outputs of classifiers, this type strategy can be further divided into three categories. When the output is in a decision form, majority voting, Naive Bayes, Behavior Knowledge Space (BKS) [5] are typical fusion methods, etc. When the output is in a rank form, the representative fusion method is Borda Count [6]. When the output is in a measure form, there are many fusion methods, such as Max, Min, Mean, Sum, Product, Median, Decision Templates (DT) [7], Dempster-Shafer theory [8], Neural Networks fusion [9] and Support Vector Machine (SVM) fusion [10], etc. While classifier selection considers each classifier has expertise in certain local regions of the feature space. So, it endeavors to find which classifier has the highest local accuracy in the vicinity of an input sample. This classifier is nominated to stand for the system and make the final decision. Clustering and selection (CS) [11] and dynamic classifier selection [12 –14] are typical classifier selection methods.

In this paper, a classifier selection method based on clustering and weighted mean is proposed. In the method, multiple clusters are selected according to the distances between cluster centers and the input sample. Then, the average performance of each classifier on selected clusters is calculated. The best classifier on the nearest cluster and the classifier with the best average performance are picked out. According to the reliability of their outputs which are estimated by confusion matrix, one of them is selected to make the final decision of the system.

This paper is organized as follows: Section 2 illuminates and analyzes the proposed method in detail; Section 3 introduces the experiments including experiment description and analysis; the conclusions of the paper are in Section 4.

2 The classifier selection method

2.1 Basic idea

Clustering and selection (CS) is a commonly used method for classifier selection. The method divides the training sample set into many clusters by a clustering algorithm and defines each cluster as an area. The performance of every classifier is estimated in each area. The input samples which fall into a certain area are classified by the classifier with the highest local accuracy in the area. However, a test sample is not always belonging to the nearest cluster, and CS method making classifier selection based on the performance in the nearest cluster may have a bigger error. In the feature space, there are some overlapping areas which include the feature points from different classes, which can bring error to the classification result. When a test sample lies in an overlapping area, it may be approximately equidistant from some cluster centers. Furthermore, the cluster center can’t fully reflect the element distribution for it is only the average of all the elements in a cluster.

To solve the problem, this paper proposes a method based on clustering and weighted mean and calls it CWM method for short. CWM method selects more than one cluster for a test sample. If the distance between the sample and the nearest cluster is d_near, the clusters at a distance of up to d_near * (1 + r) from the test sample are selected. Then, the average performance of each classifier on the selected clusters is calculated. In the process of performance calculation, weighed value is adopted according to the distances between the test sample and each selected cluster. The test sample is more probable to belong to a nearer cluster, so the weighted value is larger. The best classifier on the nearest cluster and the classifier with the best average performance are picked out. According to the reliability of their outputs which are estimated by confusion matrix, one of them is selected to make the final decision of the system.

2.2 Method description

2.2.1 Relative expressions

Let C = {C₁, C₂, …, C_l} be the set of trained classifiers for selection; ω = {ω₁, ω₂, …, ω_m}, the set of potential class labels; Rⁿ, the n-dimensional feature space; X = [x₁, x₂, …, x_n] ^T, the n-dimensional feature vector, X ∈ Rⁿ. Given an input pattern X, the output of the i-th classifier is denoted as: $C_{i} (X) = [c_{i, 1} (X), c_{i, 2} (X), \dots, c_{i, m} (X)]^{T}$ (1)

Where c_ij (X) stands for the classifier C_i consider the probability that X belongs to the class ω_j.

The result after selection is denoted as:

$\begin{matrix} C (X) & = & C_{i} (X) = S (C_{1} (X), C_{2} (X), \dots, \\ C_{l} (X)), (1 \leq i \leq l) \end{matrix}$ (2)

Where S is the rule of selection.

To judge the class which the input pattern belongs to need a certain rule, it usually adopts the maximum method: $c_{i, y} (X) = {max}_{j = 1}^{m} c_{i, j} (X), then X \in ω_{y}$ (3)

The classification error of the t-th classifier C_t can be represented by a m × m confusion matrixCM^t: ${CM}^{t} = [\begin{matrix} n_{1, 1}^{t} & n_{1, 2}^{t} & \dots & n_{1, m}^{t} \\ n_{2, 1}^{t} & n_{2, 2}^{t} & \dots & n_{2, m}^{t} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ n_{m, 1}^{t} & n_{m, 2}^{t} & \dots & n_{m, m}^{t} \end{matrix}]$ (4)

Where the element $n_{i, j}^{t}, i, j = 1, 2, \dots, m$ represents the number of the i-th class of training samples which are recognized by classifier C_t as the j-th class samples. The sum of the elements of the i-th row in the matrix represents the total amount of the training samples which belong to the i-th class, while the sum of the elements of the j-th column in the matrix represents the total amount of training samples which are recognized as the j-th class sample. The element $n_{i, i}^{t}$ represents the number of training samples which are correctly recognized. Set the total amount of training samples is N, normalize confusion matrix by dividing N, at this time, meanings of various elements change to corresponding percentages. The normalized confusion matrix is represented as: ${CM}^{t / N} = [\begin{matrix} n_{1, 1}^{t / N} & n_{1, 2}^{t / N} & \dots & n_{1, m}^{t / N} \\ n_{2, 1}^{t / N} & n_{2, 2}^{t / N} & \dots & n_{2, m}^{t / N} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ n_{m, 1}^{t / N} & n_{m, 2}^{t / N} & \dots & n_{m, m}^{t / N} \end{matrix}]$ (5)

Here, the reliability of output vector of classifier C_t can be weighed by $n_{h, h}^{t / N}$ , and the value of h should satisfy the following condition: $c_{t, h} = {max}_{j = 1}^{m} (c_{t, j})$ . Thus the reliable output of classifier C_t can be represented as: $C_{t}^{'} (X) = n_{h, h}^{t / N} * C_{t} (X)$ (6)

2.2.2 Training process

Use each classifier to classify the training samples, acquire the confusion matrix CMⁱ (i = 1, 2, …, l) and calculate the reliability of each corresponding classifier $n_{h, h}^{i, N} (h = 1, 2, \dots, m)$ .

Pick the parameter K, the number of clusters and cluster the training sample set Z into K clusters CL_j, j = 1, 2, …, K, using, e.g. K-means clustering algorithm.

Calculate the arithmetic means of the points in the respective clusters as the cluster centers V_j.

For each cluster CL_j, estimate each classifier’s error rate $e_{i}^{j}$ using only its elements.

Return V_j, $e_{i}^{j}$ and $n_{h, h}^{i, N}$ .

2.2.3 Classification process

Given the input sample X ∈ Rⁿ, calculate the distances between it and V_j, and denote them by d_j. Find the shortest distance d_near and the best classifier C_p on the nearest cluster.

Select the cluster CL_j which meets the condition: $d_{j} \leq d_{near} * (1 + r), (r \geq 0)$ (7)

The set of selected clusters is denoted by Ω.

For each CL_j ∈ Ω, calculate its weight: $W_{j} = 1 - \frac{d_{j}}{\sum_{u = 1}^{K} d_{u}}$ (8)

When d_near = 0, $W_{j} = \frac{1}{m}$ , where m is the number of elements in set Ω.

According to the weights, calculate the average performance of each classifier on the set Ω: $e_{i} = \sum W_{j} * e_{i}^{j}$ (9)

Where the classifier with the best average performance is denoted as C_q

Use C_p and C_q to classify X respectively. Let the output vectors be:

$\begin{matrix} C_{p} (X) & = & [c_{p, 1} (X), c_{p, 2} (X), \dots, c_{p, m} (X)]^{T}, \\ C_{q} (X) & = & [c_{q, 1} (X), c_{q, 2} (X), \dots, c_{q, m} (X)]^{T} \end{matrix}$ (10)

According to the Equation (6), the reliable outputs of them can be denoted as:

$\begin{matrix} C_{p} (X) & = & n_{h, h}^{p / N} [c_{p, 1} (X), \\ c_{p, 2} (X), \dots, c_{p, m} (X)]^{T}, \\ C_{q} (X) & = & n_{h, h}^{q / N} [c_{q, 1} (X), \\ c_{q, 2} (X), \dots, c_{q, m} (X)]^{T} \end{matrix}$ (11)

In C_p (X), let the maximum be p_max (X), the second be p_2max (X), in C_q (X), the maximum, q_max (X) and the second, q_2max (X), the rule is:

if p_max (X) - p_2max (X) ≥ q_max (X) - q_2max (X), select C_p (X) as the system output; otherwise, C_q (X) is selected.

3 Experiments

3.1 Experiment description

The samples were selected from KDD’99 intrusion detection database (ftp5c and ftp2c), UCI database (glass, bupa, pima, thyroid, wine, German, heart, vehicle) and ELENA database (clouds, satimage, phoneme). Table 1 is the simple description of 13 data sets. More information of the selected data sets can see in previously [15].

Currently, there are many practical classification algorithms. The k Nearest Neighbor (k-NN) algorithm, as a typical representative of non-parametric algorithms, is used as the base classifier in this paper, for it is simple and reliable. The parameter k is selected automatically and optimally by leave-one-out algorithm.

For KDD data, extracted features can be divided into three categories: intrinsic features, traffic features and content features. So, three k-NN classifiers were trained using distinct category of features on each training set A and tested by the corresponding category of features on each test set B. For UCI and ELENA data, 6-fold cross validation was used: each data set was divided into 6 subsets with similar distribution and size. Each time, a subset was selected as the test set and the remainder were training sets. On each training subset, a k-NN classifier was trained and the corresponding 5 classifiers constructed the classifier set. The test results are the average of 6 times.

3.2 Experiment analysis

CS and CWM methods were tested with different number of clusters (K = 3, 5, 8) and the clustering method was K-means. For K-means method uses random initial cluster center, the clustering result was obtained firstly and then two methods were tested with the same clusters. Each test is repeated for three times and the average is as the experimental result (see Table 2, the numerical value in the brackets is K.). In Table 2, the performance of CWM method is the minimum of test results with different parameter r (r = 0.05, 0.1, 0.15, …, 0.5).

From the Table 2, it can be seen that, for the 13 test data sets (namely total 39 conditions), the error rate of CWM method is lower than that of CS method in 31 conditions, only on the data set wine and K = 3, the error rate of CWM is higher and the same as the CS method in the rest conditions. General speaking, the performance of CWM method is obviously better than that of CS method.

However, Table 2 gives just the best performance of CWM method and the value of r distinctly influences its performance. The smaller r is, the fewer clusters are selected. Extremely, if only a cluster is selected, there is no difference between CWM and CS method. On the contrary, more obviously different clusters are selected also has negative effects on CWM method. So, the key problem of CWM method is how to find an appropriate parameter r.

In order to select appropriate r self-adaptively, the following steps were added to the training process of CWM method: input a set of r and use every value as the parameter to classify the training data set. Then, the parameter r, at the time when CWM method has the lowest error rate, is selected to use in the trained CWM method.

CWM method was tested after adding the above steps and the set of r was {0.05, 0.1, 0.15, …, 0.5}. For it had been found from the above experiments that the performance of CWM method didn’t change with the value of r on two data sets (thyroid and heart) and was lower than that of CS method on the data set wine, the method were tested only on the rest 10 data sets (see Table 3).

For distinct comparison, the experimental results are shown in Fig. 1, where the data are the average results of each method discarding the value of K. In order to display expressly, three figures are used.

From Fig. 1, it can be seen that, for most cases, the CWM method with self-adaptive parameter selection cannot achieve the lowest error rate in Table 2, but its performance is still better than that of the CS method. Therefore, the steps of parameter selection are effective in a sense. On the set German, its performance is equal to the lowest error rate; on the set pima, its error rate even lower than the lowest error rate.

The main reason is because the self-adaptive parameter selection is aiming at training set. In cross validation, the training set was changing every time result in the changing of selected r, while the lowest error rate in Table 2 was achieved with fixed r in the whole process of cross validation, which also shows self-adaptive parameter selection is more flexible. On two KDD data sets, the error rate of CWM method is obviously higher than the lowest error rate. The reason is that the training set is significantly smaller than the test set and the parameter optimization has a bigger difference between them. Adequate training data is the precondition of parameter optimization selection.

4 Conclusion

In this paper, a method of classifier selection, called CWM, is proposed. In the method, multiple clusters are selected according to the distances between cluster centers and the input sample. Then, the average performance of each classifier on selected clusters is calculated. The best classifier on the nearest cluster and the classifier with the best average performance are picked out. According to the reliability of their outputs which are estimated by confusion matrix, one of them is selected to make the final decision of the system. And the experimental results prove that the proposed method is effective.

Footnotes

Acknowledgments

This work is partly supported by the key scientific research project of colleges and universities of Henan Province (No. 15A520017); the soft science research project of Henan Province (No. 152400410513); the science and technology research project of Henan Province (No. 162102210062), the key projects of National Natural Foundation (No. U1261206) and the doctoral fund of Henan Polytechnic University (No. B2010-46).

References

Wozniak

, Graña

and Corchado

, A survey of multiple classifier systems as hybrid systems, Information Fusion16(1) (2014), 3–17.

Chowdhury

, Verma

and Stockwell

, A novel texture feature based multiple classifier technique for roadside vegetation classification, Expert Systems with Applications42(12) (2015), 5047–5055.

Chacko

A.M.M.

and Dhanya

P.M.

, Multiple classifier system for offline malayalam character recognition, Procedia Computer Science46 (2015), 86–92.

De-la-Torre

, Granger

, Radtke

P.V.W.

, Sabourin

and Gorodnichy

D.O.

, Partially-supervised learning from facial trajectories for face recognition in video surveillance, Information Fusion24 (2015), 31–53.

Dainotti

, Pescapé

, Sansone

and Quintavalle

, Using a behaviour knowledge space approach for detecting unknown IP traffic flow, Lecture Notes in Computer Science6713 (2011), 360–369.

Zahid

M.A.

and Swart

, The borda majority count, Information Sciences295 (2015), 429–440.

Bigdeli

, Samadzadegan

and Reinartz

, Fusion of hyperspectral and LIDAR data using decision template-based fuzzy multiple classifier system, International Journal of Applied Earth Observation and Geoinformation38 (2015), 309–320.

Yang

, Han

D.Q.

and Han

C.Z.

, A novel diversity measure of multiple classifier systems based on distance of evidence, Acta Aeronautica et Astronautica Sinica33(6) (2012), 1093–1099.

Abdallah

A.C.B.

, Hichem

and Gader

, Adaptive local fusion with neural networks, Lecture Notes in Computer Science6353 (2010), 486–491.

10.

Zhang

W.B.

, Ji

H.B.

and Wang

, Adaptive weighted feature fusion classification algorithm, Systems Engineering and Electronics35(6) (2013), 1133–1137.

11.

Kuncheva

L.I.

, Switching between selection and fusion in combining classifiers: An experiment, IEEE Transactions on Systems, Man and Cybernetics32(2) (2002), 146–156.

12.

Hao

H.W.

, Wang

Z.B.

, Yin

X.C.

and Chen

Z.Q.

, Dynamic selection and circulating combination for multiple classifier systems, Acta Automatica Sinica37(11) (2011), 1290–1295.

13.

Cavalin

P.R.

, Sabourin

and Suen

C.Y.

, Dynamic selection approaches for multiple classifier systems, Neural Computing and Applications22(3-4) (2013), 673–688.

14.

Britto

A.S.

, Sabourin

and Oliveira

L.E.

, Dynamic selection of classifiers-A comprehensive review, Pattern Recognition47(11) (2014), 3665–3680.

15.

A.Z.

and Huo

Z.Q.

, Experimental comparison of six fixed classifier fusion rules, Procedia Engineering23 (2011), 429–433.