Intelligent fuzzy rough set based feature selection using swarm algorithms with improved initialization

Abstract

This paper focuses on Fuzzy rough set, which is the fusion of fuzzy sets and rough sets theory for doing feature selection. For selecting the appropriate feature subset, swarm algorithms are used. The fitness function used here is Fuzzy Rough Dependency Measure. This paper demonstrates that by optimizing the fitness function, swarm algorithms are capable to select the best subset of features. Further, in this paper, an attempt has been made to improve the capability of the swarm based algorithms such as Intelligent Dynamic Swarm (IDS) and Particle Swarm Optimization (PSO) through modified initialization of solutions, for picking the appropriate features for the feature selection task. Improvement in the size of reducts and classification accuracy of these reducts are observed when initialization is done using the proposed method. Statistical t-tests have also been performed for the validation of the results.

Keywords

Feature selection fuzzy rough set rough set particle swarm optimization intelligent dynamic swarm classification accuracy t-test

1 Introduction

Feature Selection is the key task nowadays in the area of pattern recognition and many other applications. It is an important area of Computer Science, a preprocessing task of almost any application area; it also defies the curse of dimensionality and facilitates only the useful and significant features of the problem at hand.

In datasets where large number of features are involved, there may exist irrelevant (noisy) features and/or redundant features [1]. Since datasets containing such features may deteriorate the computational efficiency, a process known as feature selection or attribute reduction becomes necessary for further proceeding in this case, which detects and discards these irrelevant (noisy) and redundant features while maintaining acceptable classification accuracy. Thus, the feature selection technique is expected to select a reduced subset of feature having relevant information from a dataset [5 , 26].

The correspondence between the resulting reduced features and class label features should be the reflection of the correspondence existing between unreduced features and the class label features, i.e. the underlying meaning of dataset should not get altered. Thus, feature selection reduces the dimension of the problem at hand and optimizes the computational time and space complexity. There are a number of feature selection techniques [2 , 26] (and many more in literature) that can be used for selection of feature, but no single technique is able to guarantee the production of an optimal feature set.

In order to select features, fuzzy rough set and dependency measure are used, as a metric to evaluate dependency of the decision features on the selected set of features. The fitness function used here is fuzzy rough dependency measure. We maximize the fuzzy rough dependency measure using swarm algorithms. This paper demonstrates that by optimizing the fitness function, swarm algorithms are capable to produce the best subset of features.

In a large dataset, having a moderate number of features, the possible number of feature sets in the search space is very high (2^N, where ‘N’ denotes number of input features), hence swarm intelligence (population based) techniques such as Particle Swarm Optimization (PSO) [25] and Intelligent Dynamic Swarm (IDS) [2] have been used in literature. These techniques employ random initialization of population.

A new alternative initialization method, for initial population to search optimal set of features, is being attempted in this paper. This optimal set of features is referred as reduct in the literature.

The preliminary finding obtained after applying Fuzzy Rough Set theory and PSO and IDS techniques to search for optimal features (reducts) using proposed initialization, was published in [23] by the authors of this paper. The wider applicability and validity of the proposed methods were required on the test datasets having larger number of features and objects. Also, a statistical analysis showing the significance of the obtained results was needed to establish that the results are not obtained by chance. Hence, in this paper, statistical t-test and non-parametric tests (Wilcoxon and Friedman test) have been performed to establish the statistical significance of the proposed method. Also, additional datasets such as Cleveland, Glass, Lung, and Soyabean Small were used to show its general applicability. Further, an additional classifier, PART,has also been used to verify the classification capability of the resulting reducts. The proposed method has additionally been compared with Genetic Algorithm (GA). It is established that, by applying proposed method the reducts improve and acceptable mapping of features to class labels is maintained.

2 Background

2.1 Fuzzy rough sets [14, 18]

Fuzzy rough sets, as a fuzzy generalization of rough sets [29], was introduced by Dubois and Prade [4]. Fuzzy sets are a continuous generalization of set characteristic functions, while rough sets are a calculus of partition. Combining these two notations leads to the consideration of rough approximations of fuzzy sets [1]. A more general approach has been suggested by Radzikowska and Kerre [1]. They defined a family of fuzzy rough sets, each one is known as (I, T) fuzzy rough set, is evaluated by an fuzzy implicator I and a triangular norm T [1].

Fuzzy-rough set deals with vagueness (for fuzzy sets) and the indiscernibility (for rough sets). Each of these concepts occur as results of uncertainty in knowledge [14, 18].

2.1.1 Fuzzy Lower approximation based Fuzzy Rough Feature Selection (L-FRFS)[18]

In the process of dimensionality reduction, membership grades of feature values to fuzzy sets are not exploited. It is possible to use this information to better guide feature selection, while using fuzzy rough sets. Fuzzy rough method for feature selection alleviates the problems encountered by rough set feature selection, such as real-valued features and dealing with noise [12]. Discretization of dataset are not necessary and fuzzy rough set can be directly applied to the reduction of continuous or numerical attributes [3].

A different method for feature selection has been introduced in [1, 18], named as L-FRFS. L-FRFS uses fuzzy partitioning of the input space. Fuzzy lower and fuzzy upper approximations are defined as under: $μ_{\underline{R_{P}} A} (a) = inf_{b \in U} I (μ_{R_{P}} (a, b), μ_{A} (b))$ (1) $μ_{\bar{R_{P}} A} (a) = sup_{b \in U} I (μ_{R_{P}} (a, b), μ_{A} (b))$ (2)

Here, fuzzy implicator, I defined as (min (1, 1 - a + b)) and Fuzzy Similarity Matrices (FSM, denoted R_p) induced by the subset of features P as described below [14]

$μ_{R_{P}} (a, b) = T_{e \in P} {μ_{R_{e}} (a, b)}$ (3) where T is a t-norm (max (a + b - 1, 0)), μ_{R
_e} (a, b); it is the degree of similarity for feature e between objects a and b.

FSM can be constructed using the equation. $\begin{matrix} μ_{R_{e}} (a, b) = \max (\min (\frac{e (b) - e (a) + σ_{e}}{σ_{e}}, \\ \frac{e (a) - e (b) + σ_{e}}{σ_{e}}), 0) \end{matrix}$ (4) where σ denotes standard deviation of attribute e.

Fuzzy positive region may be defined as:

$μ_{{POS}_{R_{P}} (Q)} (a) = sup_{A \in U / Q} μ_{\underline{R_{P}} A} (a)$ (5)

The measure of fuzzy rough dependency $γ'_{P} (Q)$ may be defined as: $γ'_{P} (Q) = \frac{Σ_{a \in U} μ_{P O S_{R_{P}} (Q)} (a)}{| U |}$ (6)

A fuzzy rough reduct, R, should preserves, degree of dependency of the entire dataset, i.e. $γ'_{R} (D) = γ'_{C} (D)$ where C is the set of conditional features.

The fuzzy connectives chosen throughout this paper and in this example are Lukasiewicz fuzzy implicator (min (1, 1 - a + b)) and Lukasiewicz t-norm (max (0, a + b - 1)).

Equation (3) can be used for finding other FSM induced by the subset of features P. For example if P ={m,e}, then $μ_{R_{me}} (a, b) = T {μ_{R_{m}} (a, b), μ_{R_{e}} (a, b)}$

In order to find the detailed description of these computations, readers may refer to [18].

3 L-FRFS using PSO and IDS

In this work PSO and IDS have been applied for feature reduction and the function of fitness have been computed using dependency measure of L-FRFS suggested in [14 , 22]. This method gives the combination of selected set of features, which results in maximum fitness function. The string of 0s and 1s is considered as a solution. Here ’1’ represents selected feature and ’0’ represents dropped features.

All the particles are initialized as a combination of random sequence of 0 and 1, and for fitness function, first of all fuzzy lower approximation are computed for all classes. Consequently fuzzy positive regions and the fuzzy rough dependency measure is computed.

In the next section, PSO and IDS techniques will be introduced.

3.1 Particle Swarm Optimization (PSO)

In this section, the traditional PSO algorithm and its modified version suitable for the feature selection task, have been discussed.

3.1.1 Traditional PSO

PSO algorithm mimics the social behavior of bird flocking [16 , 28]. In PSO a set of solutions, called population, are randomly initialized. Each solution, X_i, tries to improve itself with the help of a velocity term, V_i, expressed as a linear function of best solutions achieved so far since the beginning of the algorithm, and the best solution in the current iteration has been described below. $V_{i} = w * V_{i} + r_{1} * C_{1} (P_{i} - X_{i}) + r_{2} * C_{2} (P_{G} - X_{i})$ (7) $X_{i} = X_{i} + V_{i}$ (8) where w is inertia weight, r₁ and r₂ are random numbers, C₁ and C₂ are constants and have equal value. P_i is the personal best for an individual i itself, and P_G is the global best solution. X_i is the position of ith solution.

3.1.2 Modified PSO for feature selection

PSO is being modified and transformed to Binary PSO, to suit the need for selecting appropriate features, as it was suggested in [25], according to which, velocity V_i, is obtained using the following equation. $\begin{matrix} V_{i} = round (w * V_{i} + r_{1} * C_{1} * velocityupdate \\ (P_{i} - X_{i}) + r_{2} * C_{2} * velocityupdate (Gbest - X_{i})) \end{matrix}$ (9) where V_i is an integer value.

Function velocityupdate (P_G - X_i) returns an integer.

The V_i is updated as follows. $V_{i} = {\begin{matrix} 1; if V_{i} \leq 1 \\ round (N / 3); if V_{i} > N / 3 \\ unchanged; otherwise \end{matrix}$ (10)

Using this value of V_i, solution X_i of length N where N is the total number of features is updated using the following equation. $X_{i} = positionupdate (P_{G}, X_{i}, V_{i})$ (11)

Further for the details of different functions and algorithms, readers may refer [25].

3.1.3 Algorithm for PSO

Initialize the population randomly. Initialize parameter values of equation (9).

Using equation (15) calculate the function of fitness for each and every particle.

pbest value is being compared with current fitness value, for each particle. If the current value is better than the pbest value, then replace this value as the pbest and the current solution X_i as P_i.

Recognize the particle having best fitness value till that instant, and denoted as gbest and P_g is its position.

Update the velocities and positions of all the particles using equation (9) and equation (11).

Repeat steps 2 to 5 until a stopping criterion is met.

3.2 Intelligent Dynamic Swarm (IDS)

IDS is closely related to PSO and it is a kind of adaption of PSO for problems involving discrete variables. The IDS implements the PSO procedure by using following expression for the jth element, X_ij of individual X_i described in Equations (7) and (8). $X_{ij} = {\begin{matrix} P_{i, j}, if R_{i, j} \in (C_{w}, C_{p}) \\ P_{Gj}, if R_{i, j} \in (C_{p}, C_{g}) \\ x, if R_{i, j} \in (C_{g}, 1) \\ X_{ij}, if R_{i, j} \in (0, C_{w}) \end{matrix}$ (12)

From the above equation it may be said that element X_ij can take corresponding jth value of P_i (personal best of individual i) or P_G (Global best of population) or a random number x or will remain unchanged depending on the range in which the random number R_ij falls. The values of the parameters C_w, C_p and C_g in equation (12) are selected such that 0 ≤ C_w ≤ C_p ≤ C_g ≤ 1.

IDS is another swarm based algorithm. Parameters used in IDS are described in Subsection 5.3.

Algorithm for IDS is as follows:

Initialize the population randomly. Initialize parameter values.

Compute the function of fitness for each solution.

pbest value is being compared with current fitness value, for each particle. If the current value is better than the pbest value, then replace this value as the pbest

Recognize the particle having best fitness value till that instant, and denoted as gbest.

For each particle a random number r has been generated, between 0 and 1.

Current particle will be kept as it was, if random number r lies between zero and C_w.

Otherwise current particle will be replaced by the particle having pbest value, if random number r lies between C_w and C_p.

Otherwise current particle will be replaced by the particle having gbest value, if random number r lies between C_p and C_g.

Otherwise current particle will be replaced by the particle generated randomly, if random number r lies between C_g and 1.

Repeat steps 2 to 5 until a stopping criterion is met.

4 Proposed methods

Augmenting the performance of PSO and IDS, a new method for the initialization of the population is proposed.

In this paper, a population of such solutions is generated using a new proposed initialization technique called Distributed Sampled (DS) initialization, the proposed technique distributes the search space in three parts and then takes solutions-samples randomly from these search spaces.

The effectiveness of proposed DS-initialization technique has been implemented for Particle Swarm Optimization (PSO) and Intelligent Dynamic Swarm (IDS). These methods incorporating DS-initialization have been implemented for feature reduction on few benchmark datasets [10] proposed in the literature.

The proposed method of initialization facilitates better variation in selected number of features i.e. too low, medium, too high. In this method of initialization (see Table 1) the first one-third population is initialized, which is inclined towards selecting least number of features, the second one-third population is initialized and inclined towards selecting medium number of features, and the last one-third population is initialized, which is inclined towards selecting high number of features.

Table 1
Initialization of solutions for random and proposed methods

Proposed DS-initialization Random initialization

Part of population For every solutions Value assigned For the whole population For every solutions Value assigned

If rand > p₁ 1

p₁ to p_(p/3) otherwise 0 If rand > p_i 1

If rand > p₂ 1 P

p_(p/3)+1 to p_(2p/3) otherwise 0

If rand > p₃ 1 otherwise 0

p_(2p/3)+1 to p_p otherwise 0

Where, 0 < p₁ < p₂ < p₃ < 1 Where, 0 < p_i < 1

Proposed DS-initialization	Random initialization
	If rand > p₁	1
p₁ to p_(p/3)	otherwise	0		If rand > p_i	1
	If rand > p₂	1	P
p_(p/3)+1 to p_(2p/3)	otherwise	0
	If rand > p₃	1		otherwise	0
p_(2p/3)+1 to p_p	otherwise	0
Where, 0 < p₁ < p₂ < p₃ < 1	Where, 0 < p_i < 1

In the present work, PSO and IDS have been implemented using the DS-initialization, results have been compared with that of RANDOM version, and verified using t-test. Presence of near-optimal solutions in the population would be having higher probability, due to DS-initialization, and consequently, iteration-by-iteration convergence of population towards optimal solutions would be faster, which reduce the chance of getting stucked in the local minima.

5 Experimental evaluation

Implementation of all the algorithms used in this paper is done in MATLAB, significant constituent of experiments are discussed consequently. Parameters and assumptions considered regarding experiments and data are discussed as follows:

5.1 Dataset

All the datasets are accessed from the UCI data repository, [10]. Characteristics of datasets used in the present work, are described in Table 2.

Table 2
Charateristics of datasets

S. No. Dataset No. of Objects No. of Features

1 Cleveland 303 13

2 Ecoli 336 7

3 Glass 214 9

4 Ionosphere 351 34

5 Lung 32 56

6 Soybean small 47 35

7 Wine 178 13

8 LSVT 126 310

S. No.	Dataset	No. of Objects	No. of Features
1	Cleveland	303	13
2	Ecoli	336	7
3	Glass	214	9
4	Ionosphere	351	34
5	Lung	32	56
6	Soybean small	47	35
7	Wine	178	13
8	LSVT	126	310

5.2 Data normalization

In this work all dataset using the following equation are scaled into range [0,1]. $E_{i}^{norm} = \frac{E_{i}^{t} - E_{i}^{\min}}{E_{i}^{\max} - E_{i}^{\min}}$ (13) where $E_{i}^{norm}$ is a normalized value of the ith element of a given feature, $E_{i}^{\min}$ and $E_{i}^{\max}$ are respectively least and largest values of all the elements of the corresponding feature.

Further, FSMs have been computed separately using equation (4) for each of the individual feature of the corresponding dataset individually. These FSMs are utilized to compute dependency measure corresponding to the feature set of more than one feature.

5.3 Parameter Setting

Parameters used for PSO [25] equation (7); the inertia weight w decreases from 1.4 to 0.4 and C₁ and C₂ will have their frequently used value 2. $\begin{matrix} w = & (w - 0.4) * (\frac{Max_iter - Current_iter}{Max_iter}) \\ + 0.4 \end{matrix}$ (14)

Parameters used for IDS [2], are set as; C_w = 0.1, C_p = 0.4, C_g = 0.9.

5.4 Fitness value

The value of fitness will depend upon fuzzy rough dependency measure $γ'_{P} (Q)$ as shown in equation (6). The value of fitness is computed using the following formula [25]. $F i t n e s s V a l u e = 0.9 * γ'_{P} (Q) + 0.1 * (\frac{| C | - | R |}{| C |})$ (15)

Where $γ'_{P} (Q)$ is the dependency measure of the selected Reduct R. C and D are total number of features and decision feature in the dataset.

5.5 Classification accuracy

In this work J48, JRip and PART classifiers are used on each of the dataset, to calculate stratified tenfold cross validation accuracy for comparison purposes. J48 is the Java version of the decision tree based classifier C4.5 [8, 9], classifier JRip [24] is based on learning propositional rules and PART [7] generates rules by means of repeatedly creating partial decision trees from datasets.

We computed the classification accuracy of J48 [9], JRip [24] and PART [7] using data mining workbench WEKA [7, 11].

5.6 Statistical analysis

In order to validate results, statistical t-tests are performed for both classification accuracy and for the reduced subset size, with respect to DS initialization of PSO and IDS methods of feature selection. Statistical analysis ensures that the results found are not by chance. The t-test is a parametric test based on the assumption that the subject data groups under comparison are drawn from the normal distribution. In general cases, the normality of data is assumed rather than verified and therefore, the validity of t-test under such circumstances is not reliable. In view of this non-parametric tests such as the Wilcoxon and Friedman tests are used. Therefore, results have also been validated using Wilcoxon and Friedman tests. In t-tests and Wilcoxon test, significance value of 0.05 is taken. The symbol “*” denotes that the proposed methods performs worse than the indicated method, “-” denotes that the proposed methods performs equally well as compared to the indicated method and “v” denotes that the proposed methods performs better than the indicated method. For example in Table 3, Lung(56) dataset, the symbol “v” marked against IDS-RANDOM indicates that IDS-DS performs better than IDS-RANDOM. Similarly, the symbol “-” marked against PSO-RANDOM indicates that performance of IDS-DS is equally good to that of PSO-RANDOM. Thus, more number of “v” or “-” indicates that IDS-DS is either better or equally good as compared to other methods given in the Table 4.

6 Result and discussion

Randomly initialized and Distributed Sampled (DS, the proposed) initialized PSO and IDS have been executed 25 times for each of the benchmark datasets. Each run is of 100 generations with a population size of 100. Classification accuracies are computed using J48, JRip and PART classifiers in terms of their best, mean and s.d. values. Statistical t-tests are also performed for classification accuracy and for the reduced subset size, with respect to DS initialized PSO and IDS.

Our objective is to reduce the features maintaining high dependency measure. PSO-DS and IDS-DS are better in terms of the above objective as compared to the random PSO, the random IDS and GA. Further, the effect of optimized reducts on classification accuracy has been investigated.

Table 3 shows results of t-test and Wilcoxon test for PSO-DS method and Table 4 shows results of t-test and Wilcoxon test for IDS-DS in terms of classification accuracy. In these tables, statistical significance of PSO-DS and IDS-DS are compared to the PSO-random, the IDS-random and GA has been tabulated. From these values it is evident that the proposed PSO-DS and IDS-DS are always comparable or better in terms of statistical significance for all the three classifiers used in this work.

Table 3
Comparison of Selected number of features and Classification accuracy using different classifiers for PSO-DS method along with Statistical significance using t-tests and Wilcoxon tests

Feature Feature Classification accuracy

Dataset Selection Subset size Classifier: J48 Classifier: JRip Classifier: PART

Method Best Mean(s.d.) T W Best Mean(s.d.) T W Best Mean(s.d.) T W Best Mean(s.d.) T W

Cleveland(13) PSO-RANDOM 6 6.26(0.46) - v 52.47 52.09(0.7) - v 53.79 53.54(0.5) - v 52.14 51.24(1.89) - v

IDS-RANDOM 6 6.51(0.53) v v 52.47 51.69(0.8) - v 53.79 53.52(0.29) - v 52.14 51.14(1.87) - v

GA 7 7.5(0.51) v v 55.77 53.33(1.63) - * 55.77 54.11(1.03) - * 54.45 51.33(1.87) - v

PSO-DS 6 6(0) 52.47 52.47(0) 53.79 53.79(0) 52.14 52.14(0)

Ecoli(7) PSO-RANDOM 5 5(0) - - 82.44 82.44(0) - - 81.25 81.25(0) - - 80.65 80.65(0) - -

IDS-RANDOM 5 5(0) - - 82.44 82.44(0) - - 81.25 81.25(0) - - 80.65 80.65(0) - -

GA 5 5(0) - - 82.44 82.44(0) - - 81.25 81.25(0) - - 80.65 80.65(0) - -

PSO-DS 5 5(0) 82.44 82.44(0) 81.25 81.25(0) 80.65 80.65(0) -

Glass(9) PSO-RANDOM 8 8(0) - - 64.49 64.49(0) - - 69.16 69.16(0) - - 68.69 68.69(0) - -

IDS-RANDOM 8 8(0) - - 64.49 64.49(0) - - 69.16 69.16(0) - - 68.69 68.69(0) - -

GA 8 8(0) - - 66.35 64.74(0.65) - * 69.62 68.98(0.66) - v 68.69 68.58(0.29) - v

PSO-DS 8 8(0) 64.49 64.49(0) 69.16 69.16(0) 68.69 68.69(0)

Ionosphere(34) PSO-RANDOM 6 7.34(0.89) - v 91.74 89.51(1.34) - v 91.45 89.63(1.24) - - 91.46 89.58(1.67) - -

IDS-RANDOM 7 8.17(0.72) - v 93.73 90.2(2.05) - * 92.02 89.18(1.87) - - 93.73 89.37(2.42) - v

GA 7 9.75(2.16) v v 93.16 89.53(1.97) - v 91.45 88.92(1.49) - v 92.59 88.94(1.41) - v

PSO-DS 6 7(0.6) 91.74 89.57(1.72) 91.74 89.06(1.78) 91.7 89.53(1.53)

Lung(56) PSO-RANDOM 5 6.51(1.32) v v 71.87 65.89(3.88) v v 78.12 66.93(4.89) v v 84.37 66.91(8.26) v v

IDS-RANDOM 13 13.42(0.68) v v 87.5 71.62(8.99) - v 84.37 72.14(8.66) - v 81.25 66.13(12.83) - v

GA 16 20.33(3.15) v v 84.37 70.58(8.97) - v 84.37 70.32(9.17) - v 84.37 70.82(7.32) - v

PSO-DS 4 4.91(0.79) 87.5 74.2(7.77) 87.5 76.28(7.1) 87.5 75.26(7.94)

Soybean small(35) PSO-RANDOM 2 2.59(0.5) - v 100 99.47(1.33) - v 100 99.12(1.43) - v 100 99.63(0.81) - -

IDS-RANDOM 3 3.75(0.63) v v 100 99.12(1.92) - v 100 98.94(1.44) - v 100 99.28(1.87) - v

GA 5 6.57(1.09) v v 100 98.22(1.51) - v 100 97.86(2.04) - v 100 97.88(2.03) - v

PSO-DS 2 2.25(0.45) 100 99.65(0.82) 100 99.65(0.83) 100 99.65(0.83)

Wine(13) PSO-RANDOM 4 4(0) - - 93.82 91.55(2.93) - v 92.13 90.39(1.7) - v 93.82 91.74(2.42) - v

IDS-RANDOM 4 4(0) - - 93.82 93.15(0.67) - - 92.13 90.62(1.03) - v 93.82 92.73(1.22) - -

GA 5 5(0) v v 94.38 89.54(2.56) v v 92.69 88.23(2.05) - v 94.38 89.75(2.52) - v

PSO-DS 4 4(0) 93.82 93.08(0.81) 92.13 90.92(1.07) 93.82 92.77(1.13)

LSVT(310) PSO-RANDOM 113 120.95(4.35) v v 80.15 75.13(3.31) - * 82.53 77.16(2.67) - * 81.75 75.39(3.84) - *

IDS-RANDOM 119 122.8(1.98) v v 79.37 74.96(3.13) - * 86.51 78.72(2.96) - * 81.75 75.21(3.33) - *

GA 131 138.34(5.02) v v 84.12 75.33(3.49) - * 84.92 77.47(4.78) - * 82.53 75.57(4.61) - *

PSO-DS 13 16.51(2.12) 82.54 73.15(5.13) 80.95 74.21(3.94) 80.95 73.42(5.52)

	Feature	Feature	Classification accuracy
Cleveland(13)	PSO-RANDOM	6	6.26(0.46)	-	v	52.47	52.09(0.7)	-	v	53.79	53.54(0.5)	-	v	52.14	51.24(1.89)	-	v
	IDS-RANDOM	6	6.51(0.53)	v	v	52.47	51.69(0.8)	-	v	53.79	53.52(0.29)	-	v	52.14	51.14(1.87)	-	v
	GA	7	7.5(0.51)	v	v	55.77	53.33(1.63)	-	*	55.77	54.11(1.03)	-	*	54.45	51.33(1.87)	-	v
	PSO-DS	6	6(0)	52.47	52.47(0)	53.79	53.79(0)	52.14	52.14(0)
Ecoli(7)	PSO-RANDOM	5	5(0)	-	-	82.44	82.44(0)	-	-	81.25	81.25(0)	-	-	80.65	80.65(0)	-	-
	IDS-RANDOM	5	5(0)	-	-	82.44	82.44(0)	-	-	81.25	81.25(0)	-	-	80.65	80.65(0)	-	-
	GA	5	5(0)	-	-	82.44	82.44(0)	-	-	81.25	81.25(0)	-	-	80.65	80.65(0)	-	-
	PSO-DS	5	5(0)	82.44	82.44(0)	81.25	81.25(0)	80.65	80.65(0)	-
Glass(9)	PSO-RANDOM	8	8(0)	-	-	64.49	64.49(0)	-	-	69.16	69.16(0)	-	-	68.69	68.69(0)	-	-
	IDS-RANDOM	8	8(0)	-	-	64.49	64.49(0)	-	-	69.16	69.16(0)	-	-	68.69	68.69(0)	-	-
	GA	8	8(0)	-	-	66.35	64.74(0.65)	-	*	69.62	68.98(0.66)	-	v	68.69	68.58(0.29)	-	v
	PSO-DS	8	8(0)	64.49	64.49(0)	69.16	69.16(0)	68.69	68.69(0)
Ionosphere(34)	PSO-RANDOM	6	7.34(0.89)	-	v	91.74	89.51(1.34)	-	v	91.45	89.63(1.24)	-	-	91.46	89.58(1.67)	-	-
	IDS-RANDOM	7	8.17(0.72)	-	v	93.73	90.2(2.05)	-	*	92.02	89.18(1.87)	-	-	93.73	89.37(2.42)	-	v
	GA	7	9.75(2.16)	v	v	93.16	89.53(1.97)	-	v	91.45	88.92(1.49)	-	v	92.59	88.94(1.41)	-	v
	PSO-DS	6	7(0.6)	91.74	89.57(1.72)	91.74	89.06(1.78)	91.7	89.53(1.53)
Lung(56)	PSO-RANDOM	5	6.51(1.32)	v	v	71.87	65.89(3.88)	v	v	78.12	66.93(4.89)	v	v	84.37	66.91(8.26)	v	v
	IDS-RANDOM	13	13.42(0.68)	v	v	87.5	71.62(8.99)	-	v	84.37	72.14(8.66)	-	v	81.25	66.13(12.83)	-	v
	GA	16	20.33(3.15)	v	v	84.37	70.58(8.97)	-	v	84.37	70.32(9.17)	-	v	84.37	70.82(7.32)	-	v
	PSO-DS	4	4.91(0.79)	87.5	74.2(7.77)	87.5	76.28(7.1)	87.5	75.26(7.94)
Soybean small(35)	PSO-RANDOM	2	2.59(0.5)	-	v	100	99.47(1.33)	-	v	100	99.12(1.43)	-	v	100	99.63(0.81)	-	-
	IDS-RANDOM	3	3.75(0.63)	v	v	100	99.12(1.92)	-	v	100	98.94(1.44)	-	v	100	99.28(1.87)	-	v
	GA	5	6.57(1.09)	v	v	100	98.22(1.51)	-	v	100	97.86(2.04)	-	v	100	97.88(2.03)	-	v
	PSO-DS	2	2.25(0.45)	100	99.65(0.82)	100	99.65(0.83)	100	99.65(0.83)
Wine(13)	PSO-RANDOM	4	4(0)	-	-	93.82	91.55(2.93)	-	v	92.13	90.39(1.7)	-	v	93.82	91.74(2.42)	-	v
	IDS-RANDOM	4	4(0)	-	-	93.82	93.15(0.67)	-	-	92.13	90.62(1.03)	-	v	93.82	92.73(1.22)	-	-
	GA	5	5(0)	v	v	94.38	89.54(2.56)	v	v	92.69	88.23(2.05)	-	v	94.38	89.75(2.52)	-	v
	PSO-DS	4	4(0)	93.82	93.08(0.81)	92.13	90.92(1.07)	93.82	92.77(1.13)
LSVT(310)	PSO-RANDOM	113	120.95(4.35)	v	v	80.15	75.13(3.31)	-	*	82.53	77.16(2.67)	-	*	81.75	75.39(3.84)	-	*
	IDS-RANDOM	119	122.8(1.98)	v	v	79.37	74.96(3.13)	-	*	86.51	78.72(2.96)	-	*	81.75	75.21(3.33)	-	*
	GA	131	138.34(5.02)	v	v	84.12	75.33(3.49)	-	*	84.92	77.47(4.78)	-	*	82.53	75.57(4.61)	-	*
	PSO-DS	13	16.51(2.12)	82.54	73.15(5.13)	80.95	74.21(3.94)	80.95	73.42(5.52)

Table 4

Comparison of Selected number of features and Classification accuracy using different classifiers for IDS-DS method along with Statistical significance using t-tests and Wilcoxon tests

	Feature	Feature				Classification Accuracy
Dataset	Selection	Subset Size				Classifier: J48				Classifier: JRip				Classifier: PART
	Method	Best	Mean(s.d.)	T	W	Best	Mean(s.d.)	T	W	Best	Mean(s.d.)	T	W	Best	Mean(s.d.)	T	W
Cleveland(13)	PSO-RANDOM	6	6.26(0.46)	-	*	52.47	52.09(0.7)	-	*	53.79	53.54(0.5)	-	-	52.14	51.24(1.89)	-	*
	IDS-RANDOM	6	6.51(0.53)	v	v	52.47	51.69(0.8)	-	-	53.79	53.52(0.29)	-	-	52.14	51.14(1.87)	-	*
	GA	7	7.5(0.51)	v	v	55.77	53.33(1.63)	-	*	55.77	54.11(1.03)	-	*	54.45	51.33(1.87)	-	*
	IDS-DS	6	6.41(0.51)	52.47	51.8(0.82)	53.79	53.53(0.3)	52.14	50.41(2.39)
Ecoli(7)	PSO-RANDOM	5	5(0)	-	-	82.44	82.44(0)	-	-	81.25	81.25(0)	-	-	80.65	80.65(0)	-	-
	IDS-RANDOM	5	5(0)	-	-	82.44	82.44(0)	-	-	81.25	81.25(0)	-	-	80.65	80.65(0)	-	-
	GA	5	5(0)	-	-	82.44	82.44(0)	-	-	81.25	81.25(0)	-	-	80.65	80.65(0)	-	-
	IDS-DS	5	5(0)	82.44	82.44(0)	81.25	81.25(0)	80.65	80.65(0)
Glass(9)	PSO-RANDOM	8	8(0)	-	-	64.49	64.49(0)	-	-	69.16	69.16(0)	-	-	68.69	68.69(0)	-	-
	IDS-RANDOM	8	8(0)	-	-	64.49	64.49(0)	-	-	69.16	69.16(0)	-	-	68.69	68.69(0)	-	-
	GA	8	8(0)	-	-	66.35	64.74(0.65)	-	*	69.62	68.98(0.66)	-	v	68.69	68.58(0.29)	-	v
	IDS-DS	8	8(0)	64.49	64.49(0)	69.16	69.16(0)	68.69	68.69(0)
Ionosphere(34)	PSO-RANDOM	6	7.34(0.89)	-	*	91.74	89.51(1.34)	-	v	91.45	89.63(1.24)	-	-	91.46	89.58(1.67)	-	*
	IDS-RANDOM	7	8.17(0.72)	-	v	93.73	90.2(2.05)	-	*	92.02	89.18(1.87)	-	-	93.73	89.37(2.42)	-	*
	GA	7	9.75(2.16)	v	v	93.16	89.53(1.97)	-	v	91.45	88.92(1.49)	-	v	92.59	88.94(1.41)	-	v
	IDS-DS	7	7.83(0.57)	93.73	89.87(2.16)	92.02	89.32(1.83)	93.73	89.06(2.4)
Lung(56)	PSO-RANDOM	5	6.51(1.32)	v	v	71.87	65.89(3.88)	v	v	78.12	66.93(4.89)	v	v	84.37	66.91(8.26)	v	v
	IDS-RANDOM	13	13.42(0.68)	v	v	87.5	71.62(8.99)	-	v	84.37	72.14(8.66)	-	v	81.25	66.13(12.83)	-	v
	GA	16	20.33(3.15)	v	v	84.37	70.58(8.97)	-	v	84.37	70.32(9.17)	-	v	84.37	70.82(7.32)	-	v
	IDS-DS	5	6.25(0.62)	87.5	78.13(7.18)	87.5	75.25(10.94)	87.5	74.74(8.05)
Soybean small(35)	PSO-RANDOM	2	2.59(0.5)	-	*	100	99.47(1.33)	-	-	100	99.12(1.43)	-	v	100	99.63(0.81)	-	-
	IDS-RANDOM	3	3.75(0.63)	v	v	100	99.12(1.92)	-	v	100	98.94(1.44)	-	v	100	99.28(1.87)	-	v
	GA	5	6.57(1.09)	v	v	100	98.22(1.51)	-	v	100	97.86(2.04)	-	v	100	97.88(2.03)	-	v
	IDS-DS	2	2.75(0.62)	100	99.29(1.38)	100	99.64(0.82)	100	99.64(0.82)
Wine(13)	PSO-RANDOM	4	4(0)	-	-	93.82	91.55(2.93)	-	v	92.13	90.39(1.7)	-	-	93.82	91.74(2.42)	-	v
	IDS-RANDOM	4	4(0)	-	-	93.82	93.15(0.67)	-	v	92.13	90.62(1.03)	-	*	93.82	92.73(1.22)	-	-
	GA	5	5(0)	v	v	94.38	89.54(2.56)	v	v	92.69	88.23(2.05)	-	v	94.38	89.75(2.52)	-	v
	IDS-DS	4	4(0)	93.82	93.55(0.51)	91.57	90.33(0.75)	93.82	93.12(1.26)
LSVT(310)	PSO-RANDOM	113	120.95(4.35)	v	v	80.15	75.13(3.31)	-	v	82.53	77.16(2.67)	-	*	81.75	75.39(3.84)	-	*
	IDS-RANDOM	119	122.8(1.98)	v	v	79.37	74.96(3.13)	-	*	86.51	78.72(2.96)	-	*	81.75	75.21(3.33)	-	*
	GA	131	138.34(5.02)	v	v	84.12	75.33(3.49)	-	v	84.92	77.47(4.78)	-	*	82.53	75.57(4.61)	-	*
	IDS-DS	16	20.76(2.87)	80.95	75.54(4.3)	83.33	75.15(3.84)	78.57	72.61(3.75)

It is observed from the Tables 3 and 4, that Fuzzy rough set methodology facilitates the reduction in the size of the feature subset, with acceptable and comparable classification accuracies. Feature sets are able to manage the high fuzzy rough dependency measures even with this reduced subset of features. Using random and proposed DS initialization, PSO and IDS are used to perform the feature selection task.

All the datasets are benchmark dataset and the significance of DS initialization is visible in the reasonably large dataset like Soybean small, Lung and LSVT. For example in case of LSVT, PSO-DS gives 16.51 subset size, and IDS-DS gives 20.76 subset size, as compared to 120.95 subset size given by PSO-random and 122.8 given by IDS-random, where the unreduced feature size is 310, and the feature subsets(reducts) evaluated from all these method produce the comparable classification accuracy.

Similar is the case of Lung dataset, where PSO-DS and IDS-DS produce the smaller reducts with high classification accuracies, when compared to random version (Tables 3 and 4). In Soybean small, accuracies achieved are very near to 100 percent with smallest reducts.

In datasets Cleveland and Ionosphere also, DS initialized PSO and IDS reduce the size of the reduct with acceptable classification accuracies. In smaller datasets Ecoli, Glass and Wine, all the methods provide the stable sized reducts, i.e. s.d = 0.

Through these investigations, it has been demonstrated that in almost all of the cases, classification accuracy does not suffer.

It is observed from Tables 3 and 4 that PSO-DS and IDS-DS are also always comparable to or better than random version of PSO, IDS and GA in terms of statistical significance of classification accuracy.

Thus, it is established that optimal reducts obtained using PSO-DS and IDS-DS, having maximum possible fuzzy rough dependency measure have acceptable accuracies, with relatively smaller reduct size than in the case of RANDOM version of PSO, IDS and GA.

Table 5 shows ranking obtained from Friedman test. It is observed from the Table 5 that performance of PSO-DS and IDS-DS are top ranked among PSO-RANDOM, IDS-RANDOM, and GA.

Table 5

Friedman test. (FR: Friedman Rank)

Method	FR	Rank
PSO-DS	2.250	1
IDS-DS	2.750	2
PSO-RANDOM	2.969	3
IDS-RANDOM	3.281	4
GA	3.750	5

7 Conclusion

DS-initialized PSO and IDS outperform random initialized PSO, IDS and GA for large datasets, without compromising with classification accuracy. Due to distributed sampled seed population, DS-initialized swarm algorithms always produce smaller reducts, as they are able to select the appropriate reducts in early iteration. Thus, using Fuzzy Rough Set, DS-initialized swarm algorithms, in general, achieve better performance as compared to randomly initialized PSO, IDS and GA, which has been established using t-test, Wilcoxon test and Friedman test.

References

Radzikowska

A.M.

and Kerre

E.E.

, A comparative study of fuzzy rough sets, Fuzzy Sets and Systems 126(2) (2002), 131–155.

Bae

, Yeh

W.-C.

, Chung

Y.Y.

and Liu

S.-L.

, Feature selection with intelligent dynamic swarm and rough set, Expert Systems with Applications 37(10) (2010), 7026–7032.

Wang

, Qi

, Shao

, Hu

, Chen

, Qian

and Lin

, A fitting model for feature selection with fuzzy rough sets, IEEE Transactions on Fuzzy Systems 2016.

Dubois

and Prade

, Rough fuzzy sets and fuzzy rough sets, International Journal of General System 17(2-3) (1990), 191–209.

Ren

and Ma

A.Y.

, Research on feature extraction from remote sensing image, International Conference in Computer Application and System Modeling 1 (2010), 44–48.

Guyon

and Elisseeff

, An introduction to variable and feature selection, The Journal of Machine Learning Research 3 (2003), 1157–1182.

Witten

and Frank

, Data mining: Practical machine learning tools with java implementations, M. Kaufmann, San Francisco, 2000.

Quinlan

J.R.

and 5:

C4.

, Programming for machine learning, Morgan Kauffmann, 1993.

Quinlan

J.R.

, Induction of decision trees, Machine Learning 1(1) (1986), 81–106.

10.

Lichman

, UCI machine learning repository, [online]. Available: http://archive.ics.uci.edu/ml, 2013.

11.

Hall

, Frank

, Holmes

, Pfahringer

, Reutemann

and Witten

I.H.

, The weka data mining software: An update, ACM SIGKDD explorations newsletter 11(1) (2009), 10–18.

12.

De Cock

Martine

, Cornelis

Chris

and Kerre

Etienne

. Fuzzy rough sets: Beyond the obvious, Proceedings of IEEE International Conference on Fuzzy Systems 1 IEEE (2004), 103–108.

13.

Verma

N.K.

, Maini

and Salour

, Acoustic signature based intelligent health monitoring of air compressors with selected features, in Information Technology: New Generations (ITNG), 2012 Ninth International Conference on IEEE (2012), 839–845.

14.

Parthalain

N. Mac.

and Jensen

, Unsupervised fuzzy rough set based dimensionality reduction, Information Sciences 229 (2013), 106–121.

15.

Parthalain

N. Mac.

and Jensen

, Fuzzy-rough feature selection using flock of starlings optimisation, in Fuzzy Systems (FUZZIEEE), 2015 IEEE International Conference on. IEEE (2015), 1–8.

16.

Eberhart

R.C.

, Kennedy

, et al., A new optimizer using particle swarm theory, Proceedings of the sixth international symposium on micro machine and human science, New York, NY, 1 (1995), 39–43.

17.

Jensen

and Shen

, FuzzyâĂŞrough attribute reduction with application to web categorization, Fuzzy sets and systems 141(3) (2004), 469–485.

18.

Jensen

and Shen

, New approaches to fuzzy-rough feature selection, IEEE Transactions on Fuzzy Systems 17(4) (2009), 824–838.

19.

Kohavi

, et al., A study of cross-validation and bootstrap for accuracy estimation and model selection, in IJCAI 14(2) (1995), 1137–1145.

20.

Swiniarski

R.W.

and Skowron

, Rough set methods in feature selection and recognition, Pattern recognition letters 24(6) (2003), 833–849.

21.

Maini

, Misra

R.K.

and Singh

, Optimal feature selection using elitist genetic algorithm, 2015 IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions (WCI), IEEE (2015), 1–5.

22.

Maini

, Kumar

, Misra

R.K.

and Singh

, Feature selection with intelligent dynamic swarm and fuzzy rough set, IEEE International conference on Computing, Communication and Automation 2017(ICCCA 2017), IEEE (2017), 385–389.

23.

Maini

, Kumar

, Misra

R.K.

and Singh

, Fuzzy Rough Set Based Feature Selection with Improved Seed Population in PSO and IDS, Computational Intelligence: Theories, Applications and Future Directions - Volume II, Springer, Singapore (2019), 137–149.

24.

Cohen

W.W.

, Fast effective rule induction, in, Proceedings of the twelfth international conference on machine learning (1995), 115–123.

25.

Wang

, Yang

, Teng

, Xia

and Jensen

, Feature selection based on rough sets and particle swarm optimization, Pattern Recognition Letters 28(4) (2007), 459–471.

26.

Saeys

, Inza

and Larranaga

, A review of feature selection techniques in bioinformatics, bioinformatics 23(19) (2007), 2507–2517.

27.

Shi

and Eberhart

, A modified particle swarm optimizer, in Evolutionary Computation Proceedings, 1998. IEEE World Congress on Computational Intelligence., The 1998 IEEE International Conference on IEEE (1998), 69–73.

28.

Shi

, et al., Particle swarm optimization: Developments, applications and resources, evolutionary computation, 2001. Proceedings of the 2001 Congress on, vol. 1, IEEE (2001), 81–86.

29.

Pawlak

, Rough sets, International Journal of Computer & Information Sciences 11(5) (1982), 341–356.

Proposed DS-initialization			Random initialization
Part of population	For every solutions	Value assigned	For the whole population	For every solutions	Value assigned
	If rand > p₁	1
p₁ to p_(p/3)	otherwise	0		If rand > p_i	1
	If rand > p₂	1	P
p_(p/3)+1 to p_(2p/3)	otherwise	0
	If rand > p₃	1		otherwise	0
p_(2p/3)+1 to p_p	otherwise	0
Where, 0 < p₁ < p₂ < p₃ < 1			Where, 0 < p_i < 1

Intelligent fuzzy rough set based feature selection using swarm algorithms with improved initialization

Abstract

Keywords

1 Introduction

2 Background

2.1 Fuzzy rough sets [14, 18]

2.1.1 Fuzzy Lower approximation based Fuzzy Rough Feature Selection (L-FRFS)[18]

3.1 Particle Swarm Optimization (PSO)

3.1.1 Traditional PSO

3.2 Intelligent Dynamic Swarm (IDS)

5.1 Dataset

Table 2 Charateristics of datasets S. No. Dataset No. of Objects No. of Features 1 Cleveland 303 13 2 Ecoli 336 7 3 Glass 214 9 4 Ionosphere 351 34 5 Lung 32 56 6 Soybean small 47 35 7 Wine 178 13 8 LSVT 126 310

5.6 Statistical analysis

6 Result and discussion

References

Table 2
Charateristics of datasets

S. No. Dataset No. of Objects No. of Features

1 Cleveland 303 13

2 Ecoli 336 7

3 Glass 214 9

4 Ionosphere 351 34

5 Lung 32 56

6 Soybean small 47 35

7 Wine 178 13

8 LSVT 126 310