Fuzzy cost-based feature selection using interval multi-objective particle swarm optimization algorithm

Abstract

Cost-based feature selection is an important data preprocessing technique in classification problems. This paper focuses on a real case that the cost that may be associated with features is fuzzy number. First, a fuzzy transforming method is introduced to transform fuzzy cost-based feature selection problems into ones with interval number. Second, an effective feature selection algorithm based on interval multi-objective particle swarm optimization is proposed. In this algorithm, a risk coefficient that decision makers are willing to bear when delete any solution is used to update the archive. Also, an interval crowding distance measure is adopted to evaluate the distribution of non-dominated particles. Finally, feasibility of the presented algorithm is validated by simulation results. The results show that our algorithm is capable of generating excellent approximation of the true Pareto front.

Keywords

Feature selection fuzzy interval multi-objective particle swarm optimization

1 Introduction

Feature selection (FS) is one of the most important issues in the field of data mining and pattern recognition [3 , 10]. Its purpose is to select a subset of features from a larger of features, which can reduce the complexity of the classifier for a successful classification task [8]. Due to effective global search capability, recent years heuristic-based search methods have been used to solve FS, such as genetic algorithm [19], ant colony optimization [18], differential evolution [1, 27], etc.

However, those studies usually assume that the data are already stored in datasets and are available without charge. As we all know, data are not free in real-world applications. There are various costs, such as money, time or other resources that are needed to obtain feature values of objects [6, 15]. Although there have been a few attempts to deal with this cost-based FS problem [6 , 25], all the existing approaches assume that the cost associated to any feature is precise. However, in many practical cases, it is not easy or even be impossible to exactly specify the cost for part features. On the contrary, decision makers can only give their fuzzy values, such as “about 0.9” or “high”. Because some coefficients become a fuzzy number rather than precise values, the existing algorithms are hardly applicable for solving fuzzy cost-based FS.

Particle swarm optimization (PSO) is a global search method inspired by the feeding behavior of birds or fish [9]. Now it has been used in the feature selection problem [4 , 24], as well as in many other complicated problems [16 , 28]. However, there is no work on fuzzy cost-based feature selection problems.

The overall goal of this paper is to solve the kind of feature selection problems by PSO. The main contributions are as follows: 1) a fuzzy transforming method is introduced, by which the fuzzy cost-based feature selection problems are converted into ones with interval number; 2) an effective feature selection algorithm based on multi-objective particle swarm optimization is proposed; 3) some effective operators, such as the interval probability dominance and the interval crowding distance measure, are used to improve the capability of PSO.

2 The proposed method

This paper uses a binary string to represent a solution for the feature selection problem. For a set of data with D features, if a bit is 1, the corresponding feature is chosen into the feature subset; otherwise, it is not. Then, the decision variable of this problem can be described as follows: $X = (x_{1}, x_{2}, \dots, x_{D}), x_{i} \in {0, 1}, i = 1, 2, \dots, D .$ (1)

Without loss of generality, we adopt a trapezoidal fuzzy number ${\tilde{e}}_{i} = (a_{i}, b_{i}, c_{i}, d_{i})$ to express the cost associated with the i-th feature. Where a_i and d_i are the down corner points of the trapezoid, and b_i and c_i are the upper corner points of the trapezoid. The bigger the fuzzy number is, the higher the cost value of the corresponding feature. Thus, the cost values of all D features can be represented as follows:

$\begin{matrix} \tilde{E} = [{\tilde{e}}_{1}, {\tilde{e}}_{2}, \dots, {\tilde{e}}_{D}], {\tilde{e}}_{i} \\ = (a_{i}, b_{i}, c_{i}, d_{i}), i = 1, 2, \dots, D \end{matrix}$ (2)

For any feature subset that contains multiple features, this paper adopts the maximum value among all the cost values of selected features to express the cost of the feature subset, namely: ${\tilde{f}}_{1} (X) = (a^{'}, b^{'}, c^{'}, d^{'}) \underset{x_{i} \neq 0}{= max} {x_{i} \cdot {\tilde{e}}_{i} | i = 1, 2, \dots, D}$ (3) Following the fuzzy transforming method proposed in [11], the fuzzy number ${\tilde{f}}_{1} (X)$ can be transformed into an expected interval as follows: $EI ({\tilde{f}}_{1} (X)) = [{EI}_{1}, {EI}_{2}] = [\frac{(a^{'} + b^{'})}{2}, \frac{(c^{'} + b^{'})}{2}]$ (4)

Next, we consider the classification error rate of a solution, i. e., the second objective function f₂ (X). There are many effective classifiers, such as the Support Vector Machine [2, 23]. Due to easy implementation, this paper adopts the leave-one-out cross-validation (LOOCV) of the one nearest neighbor (1-NN) classifier to evaluate the classification error rate of a solution. This method has successfully been used in [14, 24].

Based on the above discussion, a fuzzy cost-based feature selection problem can be transformed into an interval bi-objective optimization problem, as follows: $min y = F (X) = (EI ({\tilde{f}}_{1} (X)), f_{2} (X))$ (5)

2.1 Particle swarm optimization

PSO adopts the simple speed-displacement model, regards each individual in the group as a particle without size in the search space. Supposing the location of the i-th particle is P_i = (p_i1, p_i2, …, p_iD), the velocity of this particle is V_i = (v_i1, v_i2, …, v_iD), the optimal position that is found by this particle by now (i.e., local leader) is Lbest_i = (lbest_i1, …, lbest_iD), the optimal location find by the swarm by now (i.e., global leader) is Gbest = (gbest₁, gbest₂, …, gbest_D). Then, the new position of this particle is calculated as follows: $\begin{matrix} v_{id} (t + 1) = w \cdot v_{id} (t) + c_{1} \cdot r_{1} \cdot ({lbest}_{id} (t) - p_{id} (t)) \\ + c_{2} \cdot r_{2} \cdot ({gbest}_{d} (t) - p_{id} (t)) \end{matrix}$ (6) $\begin{matrix} x_{id} (t + 1) = x_{id} (t) + v_{id} (t + 1), \\ i = 1, 2, \dots, N, d = 1, 2, \dots, D \end{matrix}$ (7)

where t is the number of iterations, N is the size of the swarm, c₁ and c₂ are acceleration coefficients, r₁ and r₂ are random numbers between [0, 1]; w is the inertia weight.

2.2 The proposed PSO-based algorithm

Since the objective values of the particles are also not a precise vector, the traditional PSO are not applicable to the multi-objective feature selection problem with the interval coefficient. This paper proposes an improved PSO-based multi-objective feature selection algorithm.

2.2.1 Encoding of particle

This paper adopts the probability-based encoding strategy proposed in the previous work [24]. In this strategy, an individual is represented as a vector of probability: $P_{i} = (p_{i, 1}, p_{i . 2}, \dots, p_{i . D}), p_{i, j} \in [0, 1]$ (8)

Where the probability p_i,j > 0.5 means that the j-th feature will be selected into the i-th feature subset.

2.2.2 Update of the archive

In order to update the archive, we introduce a risk coefficient μ that decision makers are willing to bear when delete any solution. This risk coefficient determines the dominance degree among solutions saved in the archive. In each iteration, only new particles that are dominated with less than the probability of μ can get into the archive. Herein, the probability-based dominance relationship proposed in [13] is used to compare two solutions with interval objective values. Generally, the higher the value of μ is, the more the number of the solution satisfies the requirement of the decision maker; the smaller the value of μ is, the better is the convergence of the solutions obtained by an algorithm.

When the number of elements in the archive is more than its maximum capacity N, then elements with the worst distribution are deleted. In traditional multi-objective optimization, the crowding distance has been commonly used to evaluate the solutions’ distribution and, most of all, it is parameter-free. Aimed at a multi-objective optimization problem with interval fitness, in the previous work we proposed an improved crowding distance calculation method, which was called interval crowding distance [29].

Taking an element a_I in the archive as an example, its interval crowded distance can be expressed as: $CD (a_{i}) = \frac{1}{M} \sum_{m = 1}^{M} {CD}_{m} (a_{i})$ (9) $\begin{matrix} {CD}_{m} (a_{i}) \\ = D ({\bar{f}}_{m} (a_{i - 1}), {\bar{f}}_{m} (a_{i})) + D ({\bar{f}}_{m} (a_{i}), {\bar{f}}_{m} (a_{i + 1})) \end{matrix}$ (10) $\begin{matrix} D (\bar{a}, \bar{b}) \\ = e^{- c (\bar{a}, \bar{b})} \times \sqrt{(a^{+} - b^{+})^{2} + (a^{-} - b^{-})^{2}} \end{matrix}$ (11) $\begin{matrix} c (\bar{a}, \bar{b}) \\ = \frac{2 max {0, min {a^{+} - b^{-}, b^{+} - a^{-}, w (\bar{a}), w (\bar{b})}}}{w (\bar{a}) + w (\bar{b})} \end{matrix}$ (12)

Where $D (\bar{a}, \bar{b})$ represents the relative distance between interval $\bar{a}$ and interval $\bar{b}$ . $c (\bar{a}, \bar{b})$ represents the overlap degree between two interval numbers.

Based on the updated method above, on the one hand, when the particle P_i (t + 1) is dominated by elements in the archive Ar (t + 1) with the probability μ> 0, we cannot assume that P_i (t + 1) is poor, but that it is inferior to other elements with the probability μ. Therefore, P_i (t + 1) should also be saved into the archive. On the other hand, due to the computational complexity of comparing probability-dominance relations, it is impractical to save those particles. In order to balance the two aspects, this paper introduces the tolerance coefficient of the decision maker μ.

2.2.3 Leader of particles

The global leader of a particle, Gbest, is the best global position currently found by neighbors of this particle. In this paper we select the global leader of the particle from the external archive based on the distribution density of elements in the archive. Because the objective values of elements in the archive are interval vectors, the interval crowding distance measure [29] is used to measure their distribution density.

The local leader, Pbest, is the best position achieved by a particle itself so far. Supposing that the old local leader of the particle P_i (t) is Lbest_i (t), the new local leader is updated as follows: $\begin{matrix} {Lbest}_{i} (t + 1) \\ = {\begin{matrix} P_{i} (t + 1), if P_{i} (t + 1) ≻ {Lbest}_{i} (t) \\ P_{i} (t + 1), if P_{i} (t + 1) | | {Lbest}_{i} (t) and rand > 0.5 \\ {Lbest}_{i} (t), otherwise \end{matrix} \end{matrix}$

where P_i (t + 1) ≻ Lbest_i (t) represents that P_i (t + 1) dominates Lbest_i (t); P_i (t + 1) ||Lbest_i (t) represents that P_i (t + 1) and Lbest_i (t) are non-dominated each other.

2.2.4 Implement of the proposed algorithm

Combining the above improved operators into PSO, Algorithm 1 shows the detailed steps of the proposed PSO-based multi-objective feature selection algorithm.

Algorithm 1: the proposed algorithm
Parameters: the maximal number of generations, T_max, the
swarm size, N, the archive size N′.
Step 1) Initialize the particles; Set the archive A (0) =∅; Set
the Lbest of each particle to be the particle itself;
Step 2) Let t = 1.
Step 3) Iteration
fori = 1, 2..., N, do
Step 3.1) Calculate the objective values of the particle P_i (t) by
Equations (5);
Step 3. 2) Save the feature subset P_i (t) into A (t), and prune the
archive A (t) by the above method;
Step 3.3) Update the Lbest and Gbest of the particle P_i (t);
Step 3. 4) Update the particle position by the Equations
(6) and (7);
Endfor
Step 4) Stopping criterion: If t < Tmax, let t++, and go to
Step 3; otherwise, stop the algorithm.

3 Comparison results and analysis

To validate the proposed PSO-based multi-objective feature selection algorithm, we compared it with two typical fuzzy multi-objective optimization algorithms: FPDMOPSO [17] and MFACPSO [20]. The three algorithms used the same parameter settings: the size of the particle = 30, the archive capacity = 30, two learning factors, c₁ = c₂ = 2.0, the maximum number of generation = 70, the inertia weight = 0.6. In addition, the value of μ in the proposed algorithm is set to 0.1. Several well-known real-world datasets, including Glass, Vehicle, WDBC and Ionosphere are selected. These datasets are available in the UCI Repository (Http://www.ics.uci.edu/~mlearn/MLRepository.htm).

Without loss of generality, we adopt the triangle function to describe the fuzzy cost of a feature. The vectors E1 and E2 show the midpoints and widths of those triangle fuzzy functions, respectively. Note that the cost vector of each dataset is composed of elements of the vectors E1 and E2, which begins at the first element, 0.52 and 0.05, until the last feature is assigned.

$\begin{matrix} E 1 & = & [0.52, 0.18, 0.26, 0.68, 0.85, 0.68, 0.25, \\ 1, 0.66, 0.92, 0.6, 0.72, 0.26, 0.24, 0.86, \\ 0.96, 0.58, 0.67, 0.65, 0.69, 0.15, 0.31, \\ 0.56, 0.88, 0.22, 0.66, 0.77, 0.64, 0.26, \\ 0.58, 0.92, 0.01, 0.17, 0.36] . \end{matrix}$

$\begin{matrix} E 2 & = & [0.05, 0.1, 0.04, 0.08, 0.06, 0.01, 0.03, \\ 00, 0.09, 0.044, 0.01, 0.05, 0.02, 0.08, \\ 0.1, - 0.96, 0.1, 0.001, 0.05, 0.11, 0.05, \\ 0.089, 0.04, 0.05, 0.01, 0.05, 0.06, 0.01, \\ 0.08, 0.12, 0.05, 0.04, 0.002, 0.07] . \end{matrix}$

Furthermore, the two-set coverage (SC) metric [5] is employed to compare the dominance degree of the two algorithms. However, this paper replaces the traditional Pareto dominance relationship with the interval dominance probability in order to calculate the dominance degree between two solutions. To evaluate the distribution of solutions throughout the Pareto optimal set, the spacing metric (SP) [5] is adopted.

The three algorithms are applied to the four data sets, i.e., Glass, Vehicle, WDBC and Ionosphere. For each data set, each of these algorithms is running 20 times independently. Tables 1 and 2 show SC and SP values obtained by the three compared algorithms.

From Table 1, we can see that for the data set Glass with a small number of features, the proposed algorithm achieves similar SC results when compared to FPDMOPSO and MFACPSO. However, for the data sets Vehicle, WDBC and Ionosphere, the SC average obtained by the proposed algorithm is better than that of FPDMOPSO. Taking Vehicle as an example, over 38% of solutions obtained by FPDMOPSO are dominated by those of the proposed algorithm. Also, the proportion of MFACPSO dominated by the proposed algorithm is more than 19%. However, the proportions of the proposed algorithm dominated by FPDMOPSO and MFACPSO are both less than 3%. Hence, for the four data sets, the proposed algorithm shows the best convergence when compared with FPDMOPSO and MFACPSO.

From Table 2 we can see that for the data sets Glass, Vehicle and WDBC, all the Pareto solution sets obtained by the proposed algorithm have the best distribution when compared with FPDMOPSO and MFACPSO. For the data set Ionoshpere, the proposed algorithm also has the second best distribution, whose SP average is slightly bigger than that of FPDMOPSO. Overall, the proposed algorithm also has good distribution when compared with FPDMOPSO and MFACPSO.

Furthermore, Fig. 1 displays the Pareto fronts produced by the three compared algorithms. In each figure, the horizontal axis means the classification error rate, and the vertical axis represents the cost values of the feature subset. We can see from Fig. 1 that, for the dataset Vehicle, the proposed algorithm and FPDMOPSO are better than MFACPSO. Both the proposed algorithm and FPDMOPSO find the boundary point with 0 cost value. For the dataset WDBC, the proposed algorithm and FPDMOPSO both have good Pareto fronts, which dominate pat solutions of MFACPSO. For the dataset Ionosphere, compared with the proposed algorithm and FPDMOPSO, the solutions produced by MFACPSO are over-concentrated in the left part of the Pareto fronts (f2 < 0.07).

4 Conclusion

Aimed at the fuzzy cost-based feature selection problems, this paper proposed an improved multi-objective particle swarm algorithm. Applying the proposed algorithm to several data sets, and comparing with two fuzzy multi-objective particle swarm optimization algorithms, experimental results showed that the proposed algorithm has remarkable superiority on the convergence and the distribution.

Footnotes

Acknowledgments

This work was jointly supported by the National Natural Science Foundation of China (No. 61473299, 61473298, 61573361), Outstanding Innovation Team of China University of Mining and Technology (No. 2015QN003), and the Natural Science Foundation of Jiangsu Province (No. BK20130207).

References

Al-Ani

, Alsukker

and Khushaba

, Feature subset selection using differential evolution and a wheel based search strategy, Swarm and Evolutionary Computation9 (2013), 15–26.

and Sheng

V.S.

, A robust regularization path algorithm for ν-support vector classification, IEEE Transactions on Neural Networks and Learning Systems (2016), DOI: 10.1109/TNNLS.2016.2527796 pp. 1–8.

Chen

B.J.

, Shu

H.Z.

, Coatrieux

et al., Color image analysis by quaternion-type moments, Journal of Mathematical Imaging and Vision51(1) (2015), 124–144.

Xue

, Zhang

and Browne

, Particle swarm optimization for feature selection in classification: A multi-objective approach, IEEE Transactions on Cybernetics43(6) (2013), 1656–1671.

Zitzler

and Thiele

, Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach, IEEE Transactions on Evolutionary Computation3(4) (1999), 257–271.

Min

, Hu

Q.H.

and Zhu

, Feature selection with test cost constraint, International Journal of Approximate Reasoning55 (2014), 167–179.

Liu

and Motoda

, Less is more, in: Comutational Methods of Feature Selection, Liu

, Motoda

(Eds.), Taylor and Francis Group, LLC, New York, USA, 2008, pp. 3–17.

Roberto

H.W.

, George

D.C.

, Renato

F.C.

et al., A global-ranking local feature selection method for text categorization, Expert System with Applications39(17) (2012), 12851–12857.

Kennedy

and Eberhart

R.C.

, Particle swarm optimization, Proceedings of the IEEE International Conference on Neural Networks (1995), 1942–1948.

10.

, Li

X.L.

, Yang

and Sun

X.M.

, Segmentation-based image copy-move forgery detection scheme, IEEE Transactions on Information Forensics and Security10(3) (2015), 507–518.

11.

Mariano

, Mar

, Amelia

and Victoria Rodríguez

, Linear programming with fuzzy parameters: An interactive method resolution, European Journal of Operational Research177 (2007), 1599–1609.

12.

Chen

L.F.

, Su

C.T.

, Chen

K.H.

and Wang

P.C.

, Particle swarm optimization for feature selection with application in obstructive sleep apnea diagnosis, Neural Computing & Applications21(8) (2012), 2087–2096.

13.

Philipp

and Daniel

E.S.

, An optimization algorithm for imprecise multi-objective problem functions, Proceedings of the IEEE Congress on Evolutionary Computation (CEC 2005) (2005), 459–466.

14.

Chuang

L.Y.

, Yang

C.H.

and Li

J.C.

, Chaotic maps based on binary particle swarm optimization for feature selection, Applied Soft Computing11(1) (2011), 239–248.

15.

Turney

P.D.

, Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of Artificial Intelligence Research2 (1995), 369–409.

16.

Cheng

and Jin

, A competitive swarm optimizer for large scale optimization, IEEE Transactions on Cybernetics45(2) (2015), 191–205.

17.

Ganguly

, Sahoo

N.C.

and Das

, Multi-objective particle swarm optimization based on fuzzy-Pareto-dominance for possibilistic planning of electrical distribution systems incorporating distributed generation, Fuzzy Sets and Systems213 (2013), 47–73.

18.

Tabakhi

and Moradi

, Relevance-redundancy feature selection based on ant colony optimization, Pattern Recognition48(9) (2015), 2798–2811.

19.

Jing

S.Y.

, A hybrid genetic algorithm for feature subset selection in rough set theory, Soft Computing18(7) (2014), 1373–1382.

20.

Niknam

, Meymand

H.Z.

and Mojarrad

H.D.

, A novel multi-objective fuzzy adaptive chaotic PSO algorithm for optimal operation management of distribution network with regard to fuel cell power plants, European Transactions on Electrical Power21(7) (2011), 1954–1983.

21.

Bolón-Canedo

, Porto-Díaz

et al., A framework for cost-based feature selection, Pattern Recognition47 (2014), 2481–2489.

22.

Wang

, Yang

, Teng

, Xia

and Jensen

, Feature selection based on rough sets and particle swarm optimization, Pattern Recognition Letters28(4) (2007), 459–471.

23.

Wen

X.Z.

, Shao

, Xue

and Fang

, A rapid learning algorithm for vehicle classification, Information Sciences295(1) (2015), 395–406.

24.

Zhang

and Gong

D.W.

, Feature selection algorithm based on bare bones particle swarm optimization, Neurocomputing148 (2015), 150–157.

25.

Zhang

, Gong

D.W.

and Cheng

, Multi-objective particle swarm optimization approach for cost-based feature selection in classification, IEEE/ACM Transactions on Computational Biology and Bioinformatics2015. DOI: 10.1109/TCBB.2015.2476796 pp. 1–13.

26.

Zhang

, Gong

D.W.

and Zhang

J.H.

, Robotic path planning in uncertain environment using multi-objective particle swarm optimization, Neurocomputing103 (2013), 172–185.

27.

Zhang

, Gong

D.W.

and Rong

, Multi-objective differential evolution algorithm for multi-label feature selection in classification, Lecture Notes in Computer Science9140 (2015), 339–345.

28.

Zhang

, Gong

D.W.

and Ding

Z.H.

, A bare-bones multi-objective particle swarm optimization algorithm for environmental/economic dispatch, Information Sciences192 (2012), 212–227.

29.

Zhang

, Zhang

W.Q.

, Guo

et al., An improved PSO algorithm for interval multi-objective optimization systems, IEICE Transactions on Information and SystemsE99-D(9) (2016), 2381–2384.