Abstract
This paper focuses on Fuzzy rough set, which is the fusion of fuzzy sets and rough sets theory for doing feature selection. For selecting the appropriate feature subset, swarm algorithms are used. The fitness function used here is Fuzzy Rough Dependency Measure. This paper demonstrates that by optimizing the fitness function, swarm algorithms are capable to select the best subset of features. Further, in this paper, an attempt has been made to improve the capability of the swarm based algorithms such as Intelligent Dynamic Swarm (IDS) and Particle Swarm Optimization (PSO) through modified initialization of solutions, for picking the appropriate features for the feature selection task. Improvement in the size of reducts and classification accuracy of these reducts are observed when initialization is done using the proposed method. Statistical t-tests have also been performed for the validation of the results.
Keywords
Introduction
Feature Selection is the key task nowadays in the area of pattern recognition and many other applications. It is an important area of Computer Science, a preprocessing task of almost any application area; it also defies the curse of dimensionality and facilitates only the useful and significant features of the problem at hand.
In datasets where large number of features are involved, there may exist irrelevant (noisy) features and/or redundant features [1]. Since datasets containing such features may deteriorate the computational efficiency, a process known as feature selection or attribute reduction becomes necessary for further proceeding in this case, which detects and discards these irrelevant (noisy) and redundant features while maintaining acceptable classification accuracy. Thus, the feature selection technique is expected to select a reduced subset of feature having relevant information from a dataset [5, 26].
The correspondence between the resulting reduced features and class label features should be the reflection of the correspondence existing between unreduced features and the class label features, i.e. the underlying meaning of dataset should not get altered. Thus, feature selection reduces the dimension of the problem at hand and optimizes the computational time and space complexity. There are a number of feature selection techniques [2, 26] (and many more in literature) that can be used for selection of feature, but no single technique is able to guarantee the production of an optimal feature set.
In order to select features, fuzzy rough set and dependency measure are used, as a metric to evaluate dependency of the decision features on the selected set of features. The fitness function used here is fuzzy rough dependency measure. We maximize the fuzzy rough dependency measure using swarm algorithms. This paper demonstrates that by optimizing the fitness function, swarm algorithms are capable to produce the best subset of features.
In a large dataset, having a moderate number of features, the possible number of feature sets in the search space is very high (2 N , where ‘N’ denotes number of input features), hence swarm intelligence (population based) techniques such as Particle Swarm Optimization (PSO) [25] and Intelligent Dynamic Swarm (IDS) [2] have been used in literature. These techniques employ random initialization of population.
A new alternative initialization method, for initial population to search optimal set of features, is being attempted in this paper. This optimal set of features is referred as reduct in the literature.
The preliminary finding obtained after applying Fuzzy Rough Set theory and PSO and IDS techniques to search for optimal features (reducts) using proposed initialization, was published in [23] by the authors of this paper. The wider applicability and validity of the proposed methods were required on the test datasets having larger number of features and objects. Also, a statistical analysis showing the significance of the obtained results was needed to establish that the results are not obtained by chance. Hence, in this paper, statistical t-test and non-parametric tests (Wilcoxon and Friedman test) have been performed to establish the statistical significance of the proposed method. Also, additional datasets such as Cleveland, Glass, Lung, and Soyabean Small were used to show its general applicability. Further, an additional classifier, PART,has also been used to verify the classification capability of the resulting reducts. The proposed method has additionally been compared with Genetic Algorithm (GA). It is established that, by applying proposed method the reducts improve and acceptable mapping of features to class labels is maintained.
Background
Fuzzy rough sets [14, 18]
Fuzzy rough sets, as a fuzzy generalization of rough sets [29], was introduced by Dubois and Prade [4]. Fuzzy sets are a continuous generalization of set characteristic functions, while rough sets are a calculus of partition. Combining these two notations leads to the consideration of rough approximations of fuzzy sets [1]. A more general approach has been suggested by Radzikowska and Kerre [1]. They defined a family of fuzzy rough sets, each one is known as (I, T) fuzzy rough set, is evaluated by an fuzzy implicator I and a triangular norm T [1].
Fuzzy-rough set deals with vagueness (for fuzzy sets) and the indiscernibility (for rough sets). Each of these concepts occur as results of uncertainty in knowledge [14, 18].
Fuzzy Lower approximation based Fuzzy Rough Feature Selection (L-FRFS)[18]
In the process of dimensionality reduction, membership grades of feature values to fuzzy sets are not exploited. It is possible to use this information to better guide feature selection, while using fuzzy rough sets. Fuzzy rough method for feature selection alleviates the problems encountered by rough set feature selection, such as real-valued features and dealing with noise [12]. Discretization of dataset are not necessary and fuzzy rough set can be directly applied to the reduction of continuous or numerical attributes [3].
A different method for feature selection has been introduced in [1, 18], named as L-FRFS. L-FRFS uses fuzzy partitioning of the input space. Fuzzy lower and fuzzy upper approximations are defined as under:
Here, fuzzy implicator, I defined as (min (1, 1 - a + b)) and Fuzzy Similarity Matrices (FSM, denoted R p ) induced by the subset of features P as described below [14]
FSM can be constructed using the equation.
Fuzzy positive region may be defined as:
The measure of fuzzy rough dependency
A fuzzy rough reduct, R, should preserves, degree of dependency of the entire dataset, i.e.
The fuzzy connectives chosen throughout this paper and in this example are Lukasiewicz fuzzy implicator (min (1, 1 - a + b)) and Lukasiewicz t-norm (max (0, a + b - 1)).
Equation (3) can be used for finding other FSM induced by the subset of features P. For example if P ={m,e}, then
In order to find the detailed description of these computations, readers may refer to [18].
In this work PSO and IDS have been applied for feature reduction and the function of fitness have been computed using dependency measure of L-FRFS suggested in [14, 22]. This method gives the combination of selected set of features, which results in maximum fitness function. The string of 0s and 1s is considered as a solution. Here ’1’ represents selected feature and ’0’ represents dropped features.
All the particles are initialized as a combination of random sequence of 0 and 1, and for fitness function, first of all fuzzy lower approximation are computed for all classes. Consequently fuzzy positive regions and the fuzzy rough dependency measure is computed.
In the next section, PSO and IDS techniques will be introduced.
Particle Swarm Optimization (PSO)
In this section, the traditional PSO algorithm and its modified version suitable for the feature selection task, have been discussed.
Traditional PSO
PSO algorithm mimics the social behavior of bird flocking [16, 28]. In PSO a set of solutions, called population, are randomly initialized. Each solution, X
i
, tries to improve itself with the help of a velocity term, V
i
, expressed as a linear function of best solutions achieved so far since the beginning of the algorithm, and the best solution in the current iteration has been described below.
PSO is being modified and transformed to Binary PSO, to suit the need for selecting appropriate features, as it was suggested in [25], according to which, velocity V
i
, is obtained using the following equation.
Function velocityupdate (P G - X i ) returns an integer.
The V
i
is updated as follows.
Using this value of V
i
, solution X
i
of length N where N is the total number of features is updated using the following equation.
Further for the details of different functions and algorithms, readers may refer [25].
Initialize the population randomly. Initialize parameter values of equation (9). Using equation (15) calculate the function of fitness for each and every particle. pbest value is being compared with current fitness value, for each particle. If the current value is better than the pbest value, then replace this value as the pbest and the current solution X
i
as P
i
. Recognize the particle having best fitness value till that instant, and denoted as gbest and P
g
is its position. Update the velocities and positions of all the particles using equation (9) and equation (11). Repeat steps 2 to 5 until a stopping criterion is met.
Intelligent Dynamic Swarm (IDS)
IDS is closely related to PSO and it is a kind of adaption of PSO for problems involving discrete variables. The IDS implements the PSO procedure by using following expression for the jth element, X
ij
of individual X
i
described in Equations (7) and (8).
From the above equation it may be said that element X ij can take corresponding jth value of P i (personal best of individual i) or P G (Global best of population) or a random number x or will remain unchanged depending on the range in which the random number R ij falls. The values of the parameters C w , C p and C g in equation (12) are selected such that 0 ≤ C w ≤ C p ≤ C g ≤ 1.
IDS is another swarm based algorithm. Parameters used in IDS are described in Subsection 5.3.
Algorithm for IDS is as follows: Initialize the population randomly. Initialize parameter values. Compute the function of fitness for each solution. pbest value is being compared with current fitness value, for each particle. If the current value is better than the pbest value, then replace this value as the pbest Recognize the particle having best fitness value till that instant, and denoted as gbest. For each particle a random number r has been generated, between 0 and 1. Current particle will be kept as it was, if random number r lies between zero and C
w
. Otherwise current particle will be replaced by the particle having pbest value, if random number r lies between C
w
and C
p
. Otherwise current particle will be replaced by the particle having gbest value, if random number r lies between C
p
and C
g
. Otherwise current particle will be replaced by the particle generated randomly, if random number r lies between C
g
and 1. Repeat steps 2 to 5 until a stopping criterion is met.
Augmenting the performance of PSO and IDS, a new method for the initialization of the population is proposed.
In this paper, a population of such solutions is generated using a new proposed initialization technique called Distributed Sampled (DS) initialization, the proposed technique distributes the search space in three parts and then takes solutions-samples randomly from these search spaces.
The effectiveness of proposed DS-initialization technique has been implemented for Particle Swarm Optimization (PSO) and Intelligent Dynamic Swarm (IDS). These methods incorporating DS-initialization have been implemented for feature reduction on few benchmark datasets [10] proposed in the literature.
The proposed method of initialization facilitates better variation in selected number of features i.e. too low, medium, too high. In this method of initialization (see Table 1) the first one-third population is initialized, which is inclined towards selecting least number of features, the second one-third population is initialized and inclined towards selecting medium number of features, and the last one-third population is initialized, which is inclined towards selecting high number of features.
Initialization of solutions for random and proposed methods
Initialization of solutions for random and proposed methods
In the present work, PSO and IDS have been implemented using the DS-initialization, results have been compared with that of RANDOM version, and verified using t-test. Presence of near-optimal solutions in the population would be having higher probability, due to DS-initialization, and consequently, iteration-by-iteration convergence of population towards optimal solutions would be faster, which reduce the chance of getting stucked in the local minima.
Implementation of all the algorithms used in this paper is done in MATLAB, significant constituent of experiments are discussed consequently. Parameters and assumptions considered regarding experiments and data are discussed as follows:
Dataset
All the datasets are accessed from the UCI data repository, [10]. Characteristics of datasets used in the present work, are described in Table 2.
Charateristics of datasets
Charateristics of datasets
In this work all dataset using the following equation are scaled into range [0,1].
Further, FSMs have been computed separately using equation (4) for each of the individual feature of the corresponding dataset individually. These FSMs are utilized to compute dependency measure corresponding to the feature set of more than one feature.
Parameters used for PSO [25] equation (7); the inertia weight w decreases from 1.4 to 0.4 and C1 and C2 will have their frequently used value 2.
Parameters used for IDS [2], are set as; C w = 0.1, C p = 0.4, C g = 0.9.
The value of fitness will depend upon fuzzy rough dependency measure
Where
In this work J48, JRip and PART classifiers are used on each of the dataset, to calculate stratified tenfold cross validation accuracy for comparison purposes. J48 is the Java version of the decision tree based classifier C4.5 [8, 9], classifier JRip [24] is based on learning propositional rules and PART [7] generates rules by means of repeatedly creating partial decision trees from datasets.
We computed the classification accuracy of J48 [9], JRip [24] and PART [7] using data mining workbench WEKA [7, 11].
Statistical analysis
In order to validate results, statistical t-tests are performed for both classification accuracy and for the reduced subset size, with respect to DS initialization of PSO and IDS methods of feature selection. Statistical analysis ensures that the results found are not by chance. The t-test is a parametric test based on the assumption that the subject data groups under comparison are drawn from the normal distribution. In general cases, the normality of data is assumed rather than verified and therefore, the validity of t-test under such circumstances is not reliable. In view of this non-parametric tests such as the Wilcoxon and Friedman tests are used. Therefore, results have also been validated using Wilcoxon and Friedman tests. In t-tests and Wilcoxon test, significance value of 0.05 is taken. The symbol “*” denotes that the proposed methods performs worse than the indicated method, “-” denotes that the proposed methods performs equally well as compared to the indicated method and “v” denotes that the proposed methods performs better than the indicated method. For example in Table 3, Lung(56) dataset, the symbol “v” marked against IDS-RANDOM indicates that IDS-DS performs better than IDS-RANDOM. Similarly, the symbol “-” marked against PSO-RANDOM indicates that performance of IDS-DS is equally good to that of PSO-RANDOM. Thus, more number of “v” or “-” indicates that IDS-DS is either better or equally good as compared to other methods given in the Table 4.
Result and discussion
Randomly initialized and Distributed Sampled (DS, the proposed) initialized PSO and IDS have been executed 25 times for each of the benchmark datasets. Each run is of 100 generations with a population size of 100. Classification accuracies are computed using J48, JRip and PART classifiers in terms of their best, mean and s.d. values. Statistical t-tests are also performed for classification accuracy and for the reduced subset size, with respect to DS initialized PSO and IDS.
Our objective is to reduce the features maintaining high dependency measure. PSO-DS and IDS-DS are better in terms of the above objective as compared to the random PSO, the random IDS and GA. Further, the effect of optimized reducts on classification accuracy has been investigated.
Table 3 shows results of t-test and Wilcoxon test for PSO-DS method and Table 4 shows results of t-test and Wilcoxon test for IDS-DS in terms of classification accuracy. In these tables, statistical significance of PSO-DS and IDS-DS are compared to the PSO-random, the IDS-random and GA has been tabulated. From these values it is evident that the proposed PSO-DS and IDS-DS are always comparable or better in terms of statistical significance for all the three classifiers used in this work.
Comparison of Selected number of features and Classification accuracy using different classifiers for PSO-DS method along with Statistical significance using t-tests and Wilcoxon tests
Comparison of Selected number of features and Classification accuracy using different classifiers for PSO-DS method along with Statistical significance using t-tests and Wilcoxon tests
Comparison of Selected number of features and Classification accuracy using different classifiers for IDS-DS method along with Statistical significance using t-tests and Wilcoxon tests
It is observed from the Tables 3 and 4, that Fuzzy rough set methodology facilitates the reduction in the size of the feature subset, with acceptable and comparable classification accuracies. Feature sets are able to manage the high fuzzy rough dependency measures even with this reduced subset of features. Using random and proposed DS initialization, PSO and IDS are used to perform the feature selection task.
All the datasets are benchmark dataset and the significance of DS initialization is visible in the reasonably large dataset like Soybean small, Lung and LSVT. For example in case of LSVT, PSO-DS gives 16.51 subset size, and IDS-DS gives 20.76 subset size, as compared to 120.95 subset size given by PSO-random and 122.8 given by IDS-random, where the unreduced feature size is 310, and the feature subsets(reducts) evaluated from all these method produce the comparable classification accuracy.
Similar is the case of Lung dataset, where PSO-DS and IDS-DS produce the smaller reducts with high classification accuracies, when compared to random version (Tables 3 and 4). In Soybean small, accuracies achieved are very near to 100 percent with smallest reducts.
In datasets Cleveland and Ionosphere also, DS initialized PSO and IDS reduce the size of the reduct with acceptable classification accuracies. In smaller datasets Ecoli, Glass and Wine, all the methods provide the stable sized reducts, i.e. s.d = 0.
Through these investigations, it has been demonstrated that in almost all of the cases, classification accuracy does not suffer.
It is observed from Tables 3 and 4 that PSO-DS and IDS-DS are also always comparable to or better than random version of PSO, IDS and GA in terms of statistical significance of classification accuracy.
Thus, it is established that optimal reducts obtained using PSO-DS and IDS-DS, having maximum possible fuzzy rough dependency measure have acceptable accuracies, with relatively smaller reduct size than in the case of RANDOM version of PSO, IDS and GA.
Table 5 shows ranking obtained from Friedman test. It is observed from the Table 5 that performance of PSO-DS and IDS-DS are top ranked among PSO-RANDOM, IDS-RANDOM, and GA.
Friedman test. (FR: Friedman Rank)
DS-initialized PSO and IDS outperform random initialized PSO, IDS and GA for large datasets, without compromising with classification accuracy. Due to distributed sampled seed population, DS-initialized swarm algorithms always produce smaller reducts, as they are able to select the appropriate reducts in early iteration. Thus, using Fuzzy Rough Set, DS-initialized swarm algorithms, in general, achieve better performance as compared to randomly initialized PSO, IDS and GA, which has been established using t-test, Wilcoxon test and Friedman test.
