Two-step based feature selection method for filtering redundant information

Abstract

In text classification field, many classifiers cannot deal with the features with large dimensions, thus it is very important to filter the redundant information from the original feature space efficiently and achieve the features with best qualities. On this basis, a new two-step based feature selection method is proposed in this paper. Firstly, some definitions (word semantic correlation, set semantic correlation, semantic correlative and semantic correlative set) are introduced, and the algorithm of generating the semantic correlative sets is given. Secondly, the process of the two-step based feature selection method is described: in the first step, a feature subset is obtained by using an optimal feature selection method, and a set of semantic correlative sets is generated by using the selected feature subset; in the second step, the redundant information of the selected features is filtered by using the generated semantic correlative sets. Finally, in order to avoid local optimum when searching the best threshold, the conception of memory recall position is introduced and an improved memory recall mechanism based fruit fly optimization algorithm is proposed. In the experiments, two typical classifiers: support vector machine and naïve bayes are used on four datasets: Reuters50, SMSSPAS, WebKB and 20-Newsgroups, and the 10-cross validation is carried out when the measurements of F₁ and receiver operating curve are used. Experimental results show that the proposed method achieves higher accuracy than several representative traditional feature selection methods and runs faster than typical mutual information based feature selection methods, illustrating its effectiveness on achieving the best features in text classification filed.

Keywords

Feature selection redundant information feature space fruit fly optimization algorithm support vector machine receiver operating curve

1 Introduction

In text classification field, documents are usually represented in vector space, in which each term is considered as a separate dimension (feature). However, as a text dataset usually contains thousands of features, it is very important to deal with the high dimensionality of the original feature space. Immoderate number of features not only increases the computational complexity but also deteriorates the performance of text classification [1]. Therefore, it is desirable to find suitable methods to reduce the dimensionality of the feature space and improve the quality of selected features.

Feature selection and feature extraction are two mainly types of dimensionality reduction methods. Feature selection methods select subsets from the original feature sets according to some criteria of feature importance. Existing feature selection methods in text categorization are either based on document frequency or term frequency [2]. Typical document frequency based feature selection methods include: Chi-squre (CHI) [3], improved Gini index (IMGI) [4], document frequency (DF) [3], odds ratio (OR) [5], etc. Typical term frequency based feature selection methods include: ambiguity measure (AM) [6], normal term frequency based discriminative power measure (DPM_NTF) [7], t-test based feature selection (TTFS) [8], TFSM [9], etc. Feature extraction methods extract sets of new features by transforming the original features into distinct feature spaces. Typical feature extraction methods include: principal component analysis (PCA) [10], latent semantic indexing (LSI) [11], partial least square (PLS) [12], multidimensional scaling (MS) [13], latent Dirichlet allocation (LDA) [14], etc. Compared to feature extraction, feature selection maintains the physical properties of the original features, and has been widely used in time series analysis and pattern classification [15].

The filters and wrappers are two mainly types of feature selection methods [16]. Filters select features based on some classifier agnostic criterions, and are not dedicated to any classification algorithms. The methods like DF [3], darmstadt indexing approach (DIA) [17], information gain (IG) [18], CHI [3], mutual information (MI) [19], IMGI [4], TTFS [8] are all typical feature selections of filers. Wrappers rely on the performances estimated by one type of classifiers to select features and to evaluate the quality of those features [20]. Typical feature selection methods of wrappers include: forward search (FS), backward search (BS), sequential floating search (SFS), best-first search (BFS) [21], cluster based search (CBS) [22], etc. Compared to wrappers, filters are sampler and have lower computational complexities, thus have been widely used in text classification field.

Traditional feature selection methods of filters (such as DF, IG, CHI, TTFS, etc) calculate the discriminative abilities of a term with respect to different categories separately. Thus, these methods cannot deal with the redundant information of the selected feature subset, deducing the classification accuracy be decreased if the selected feature subset is used directly for text classification. For solving this problem, many MI based feature selection methods, like MIFS [23], NMIFS [24], MIFS-CR [25] and MDMR [26] have been proposed. MIFS introduced the balance parameter and gave attention to both relevance and redundancy. NMIFS improved the MIFS measurement and solved the problem that the entropy of a feature varied greatly. MIFS-CR defined a new feature redundancy measurement capable of accurately estimating mutual information between features with respect to the target class. MDMR considered the conditional redundancy between the candidate feature and the selected features, and presented a max-dependency and min-redundancy criterion for multi-label classification learning tasks.

However, the above MI based methods adopt greedy searching strategies to select features incrementally, thus face the problems of high computation complexity and parameter dependency. On this basis, we take advantage of traditional methods of filers in execution speed and propose a new two-step based feature selection method to solve the problems of MI based methods in this paper. The remainder of this paper is organized as follows. Section 2 reviews the related work on feature selection methods. Section 3 gives the motivation of the proposed method. Section 4 descripts the proposed method. Section 5 shows the experimental results. Section 6 concludes the whole paper.

2 Related work

2.1 Traditional feature selection methods of filters

The IG method

IG defines the expected reduction in entropy caused by partitioning the texts according to a word [18]. IG is defined as follows:

$\begin{matrix} IG (t_{i}) & = & \sum_{c_{k}} p (t_{i}, c_{k}) {log}_{2} (\frac{p (t_{i}, c_{k})}{p (t_{i}) \times p (c_{k})}) \\ + \sum_{c_{k}} p (\bar{t_{i}}, c_{k}) {log}_{2} (\frac{p (\bar{t_{i}}, c_{k})}{p (\bar{t_{i}}) \times p (c_{k})}) \end{matrix}$ (1) where p (t_i) and $p (\bar{t_{i}})$ denote the occurring probability that a document contains and doesn’t contain the word t_i, respectively; p (c_k) denotes the probability of the documents in category c_k; p (t_i, c_k) and $p (\bar{t_{i}}, c_{k})$ denote the probability that t_i occurs and doesn’t occur in c_k, respectively.

The orthogonal centroid feature selection (OCFS)

OCFS method considers the centroids of each class and all training samples, and calculates the score of a word according to the obtained centroids [6]. Given a word t_i, the OCFS score of t_i is calculated as follows: $OCFS (t_{i}) = \sum_{j = 0}^{| C |} \frac{n_{j}}{n} {(m_{j}^{k} - m^{k})}^{2}$ (2) where |C| is the number of categories in the dataset, n_j is the number of documents in category c_j, n is the number of documents in the dataset, $m_{k}^{j}$ is the kth dimension of c_j’s centroid vector m_j, m_k is the kth dimension of the entire dataset’s centroid vector m.

The MI method

The task of feature selection is to find the features that contain as much information about the categories as possible [27]. MI measures the mutual dependence of a feature t_i and a category c_k. The formula for MI is given as:

$\begin{matrix} MI (t_{i}) = \sum_{c_{k}} CMFS (t_{i}, c_{k}) \\ CMFS (t_{i}, c_{k}) = p (c_{k}, t_{i}) log \frac{p (c_{k} | t_{i})}{p (c_{k})} \end{matrix}$ (3) where p (c_k) is the probability of the kth category, p (c_k, t_i) is the probability that the term t_i occurs if the category is c_k, p (c_k|t_i) denotes the conditional probability that a document belongs to c_k when t_i occurs.

Comprehensively measure feature selection (CMFS)

For solving the problem DF [3] and DIA [17] only focus on the rows or the columns of feature vector space model, Yang proposed the CMFS method, which calculates the significance of a term from both inter-category and intra-category. The CMFS score of word t_i is defined as follows [28]:

$\begin{matrix} CMFS (t_{i}) = max_{c_{k}} {CMFS (t_{i}, c_{k})} \\ CMFS (t_{i}, c_{k}) = p (t_{i} | c_{k}) \times p (c_{k} | t_{i}) \end{matrix}$ (4) where p (c_k|t_i) denotes the conditional probability that a document belongs to c_k when t_i occurs, p (t_i|c_k) denotes the probability that the feature t_i occurs in c_k.

2.2 MI based feature selection methods of filters

In information theory, the MI methods calculate the uncertainty in a variable due to the knowledge of another variable [29], thus is widely used in feature selection as it is a good indicator of the correlation between features. The MI based methods use greedy search strategies to select the best feature subset [25]. Given a candidate feature set F, the main idea of the MI based feature selection methods is to find a subset S (S ⊂ F) which maximizes ∑_{c
_k}MI (S, c_k) and minimizes ∑_{t_i∈S, t_j∈S}MI (t_i, t_j), where c_k is a category, t_i and t_j are two features in S. Obviously, as the MI based methods consider the relationship among the selected features, they become the most widely used methods for filtering the redundant information of selected feature subset. In this section, some typical MI based feature selection methods are given as follows:

Improved mutual information feature selection (MIFS-ND)

As decreasing in the number of irrelevant features can lead to reduction in computation time, Hoque introduced a greedy feature selection method using mutual information. This method uses the mutual information of feature-feature and feature-class to determine an optimal set of features [30]. The objective function of MIFS-ND is defined as follows: $max {\sum_{c_{k}} I (t_{i}, c_{k}) - \frac{1}{| S |} \sum_{t_{j} \in S} I (t_{i}, t_{j})}$ (5) where S is the set of selected features, |S| is the number of features in S, I (t_i, c_k) is the relevance between the candidate feature t_i and category c_k, I (t_i, t_j) is the redundancy between the candidate feature t_i and the selected feature t_j.

Multi-label feature selection based on max-dependency and min-redundancy (MDMR)

For multi-label classification learning tasks, Lin presented a max-dependency and min-redundancy criterion [26]. The objective function is described as follows:

$\begin{matrix} max {\sum_{c_{j} \in C} I (t_{i}, c_{j}) - \frac{1}{| S |} \sum_{t_{l} \in S} \\ (I (t_{i}, t_{l}) - \sum_{c_{j} \in C} I (t_{i}, c_{j} | t_{l}))} \end{matrix}$ (6) where S is the set of selected features, I (t_i, c_j|t_l) is the relevance between the candidate feature t_i and all categories when given the selected feature t_l.

Cumulate conditional mutual information minimization criterion feature selection (CCM)

Zhang proposed a two-step based feature subset selection algorithm with conditional mutual information [20]. Firstly, formula (7) is used to find a subset S. Secondly, formula (8) is used to find and filter the redundant features in S. $max {\sum_{c_{j} \in C} \sum_{t_{l} \in S} I (t_{i}, c_{j} | t_{l})}$ (7) $\sum_{c_{j} \in C} I (t_{i}, c_{j} | S - t_{i}) = 0$ (8)

Global MI based feature selection (GMFS)

Han used the binary particle swarm optimization (BPSO) algorithm to select the best features and provided a new idea for solving feature selection problems. The objective function is described as follows [31]: $max {\sum_{c_{j} \in C} w_{i} I (t_{i}, c_{j}) - β \sum_{j = 2}^{| S |} \sum_{i = 1}^{j - 1} w_{i} w_{j} I (t_{i}, t_{j})}$ (9) where |S| is the number of all features. If w_i = 1, the feature t_i is selected, else, t_i is ignored.

3 Motivation

The selected feature subsets of traditional feature selection methods always contain redundant information, which is helpless to improve the text classification accuracy. On this basis, the MI based selection methods are proposed to avoid selecting the redundant features. However, the traditional MI based methods still have the following problems: (1) the correlations between the candidate feature and all categories, and the correlations between the candidate feature and the selected features should both be calculated, deducing that the time complexities of these methods are high when the numbers of all words and the selected features are both very high. (2) these methods adopt greedy searching strategies to incrementally select features, which may generate local optimal solutions. (3) these methods use parameters to control the tradeoff between relevance and redundancy, however, the best values of these parameters are hard to choose. These have motivated us to design a new feature selection method to identify an optimal subset of features which gives the best classification accuracy. In this paper, a new two-step based feature selection method which filters the redundant information is proposed.

4 The proposed method

4.1 Implementation of the proposed method

Mutual information can reflect the statistical relationship between two variables, thus is widely used in measuring the relevance of two words [26 , 30]. However, the output values of MI method are difficult to be quantized to interval [0, 1]. On this basis, we define simple measurements to measure the relationship between any two words or two word sets. Assume that DS = {d₁, d₂, …, d_i, …, d_{N
_d}} is a sample set which contains N_d documents, TS = {t₁, t₂, …, t_i, …, t_{N
_t}} is the set of the words in DS. Some definitions and propositions are given before introducing the proposed method.

Definition 1. (word semantic correlation). The word semantic correlation between two words t_i and t_j (i ≠ j) in TS is defined as: $R_{w} (t_{i}, t_{j}) = p (t_{j} | t_{i}) \times p (t_{i} | t_{j})$ (10) where p (t_i|t_j) denotes the conditional probability that t_i occurs when t_j occurs in a document, p (t_j|t_i) denotes the conditional probability that t_j occurs when t_i occurs in a document.

Definition 2. (set semantic correlation). For any two word sets s₁ ⊂ TS, s₂ ⊂ TS, the set semantic correlation between s₁ and s₂ is defined as: $R_{s} (s_{1}, s_{2}) = \frac{1}{| s_{1} | \times | s_{2} |} \sum_{t_{i} \in s_{1}} \sum_{t_{j} \in s_{2}} R_{w} (t_{i}, t_{j})$ (11)

Where, |s₁| and |s₂| are the numbers of words in s₁ and s₁, respectively.

Proposition 1.For any two word setss₁ ⊂ TS, s₂ ⊂ TS, there exists: R_s (s₁, s₂) = R_s (s₂, s₁).

Proof. From formula (11) we have:

$\begin{matrix} R_{s} (s_{2}, s_{1}) \\ = \frac{1}{| s_{2} | \times | s_{1} |} \sum_{t_{i} \in s_{2}} \sum_{t_{j} \in s_{1}} R_{w} (t_{i}, t_{j}) \\ = \frac{1}{| s_{2} | \times | s_{1} |} \sum_{t_{i} \in s_{2}} \sum_{t_{j} \in s_{1}} p (t_{j} | t_{i}) \times p (t_{i} | t_{j}) \\ = \frac{1}{| s_{1} | \times | s_{2} |} \sum_{t_{i} \in s_{1}} \sum_{t_{j} \in s_{2}} p (t_{j} | t_{i}) \times p (t_{i} | t_{j}) \\ = R_{s} (s_{1}, s_{2}) \end{matrix}$ (12)

Definition 3. (semantic correlative). Given a predetermined threshold th, for any two word sets s₁ ⊂ TS, s₂ ⊂ TS, if there exists: $R_{s} (s_{1}, s_{2}) \geq th$ (13) then we say s₁ and s₂ are semantic correlative, which is denoted as s₁ ≡ s₂.

Proposition 2.For two word setss_iands_j (i ≠ j) inTS, if there existss_i ≡ s_j, then we haves_j ≡ s_i.

Proof. If there exists s_i ≡ s_j, then we have R_s (s_i, s_j) = R_s (s_i, s_j) ≥ th. According to Proposition 1, we have R_s (s_j, s_i) = R_s (s_i, s_j) ≥ th, so s_j ≡ s_i.

Definition 4. (semantic correlative set). Given a word set S in TS, |S| denotes the number of words in S:

If |S|=1, then S is a semantic correlative set;

If there exists two subsets s₁ ⊂ S, s₂ ⊂ S (s₁ ≠ s₂), which satisfy:

(s₁ ∪ s₂ = S) ∧ (s₁ ≡ s₂) = true, then S is a semantic correlative set.

Based on the above definitions and propositions, the process of generating a set of semantic correlative sets from the word set TS is given as follows:

Algorithm 1. generating a set of semantic correlative sets
Input: a word set TS = {t₁, t₂, …, t_Nt}; a predetermined threshold th; a temporary array mark = {ma₁, …, ma_Nt} = {0}.
Output: a set of semantic correlative sets SCS = {e₁, e₂, …, e_i, …, e_u}(e_i is a semantic correlative set, u is the number of semantic correlative set in SCS).
1. i← 1, e_i ← t₁ ;
2. whilei < N_t
3. ifmark = 1, then
e_i ← null, break;
4. end if
5. forj = i + 1 toN_tstep 1
6. ife_i ≡ {t_j}, then
e_i ← e_i ∪ {t_j};
mark [j] ←1;
flag ← 1;
7. end if
8. end for
9. ifflag = 0, then
i ← i + 1;
e_i ← {t_i};
10. end if
11. end while
12. SCS ← null;
13. fori = 1 toN_tstep 1
14. ife_i ≠ null, then
SCS ← SCS ∪ {e_i};
15. end if
16. end for

For example, as is shown in Fig. 1, t₁ - t₉ are the words in TS, and the word semantic correlations between each pair of these words are given in Fig. 1(a). When the threshold th = 0.8, as is shown in Fig. 1(b), because the word subset {t₉} and the word subset {t₁, t₂} are semantic correlative, the word set s₁ = {t₁, t₂, t₉} is a semantic correlative set. Further, two more semantic correlative sets are obtained and denoted as e₂ and e₃, which are shown in Fig. 1(c-d), respectively.

Fig.1

Example of semantic correlative sets (th = 0.8, the solid lines denote that the corresponding two words are semantic correlative).

Obviously, given a suitable threshold th, if two words both appear in a semantic correlative set, they are relatively redundant to a great extent. On this basis, we combine the advantage of traditional feature selection methods of filters (such as IG, CMFS, CHI, etc.) in execution speed, and propose a new feature selection method which can efficiently improve the speed of filtering the redundant information while keeping the accuracy. This method mainly contains two steps: (1) obtain a feature subset FS₁ by using an optimal feature selection (OFS) method and generate a set of semantic correlative sets; (2) filter the redundant information from FS₁ to form the final feature subset FS by combining the generated semantic correlative sets. Because the number of features in FS₁ is much lower than that of the words in the entire dataset, the time complexity of the proposed method is controllable. As is shown in Fig. 2, the skeleton frame of the proposed method is shown as Algorithm 2:

Algorithm 2. two-step based feature selection method for filtering the redundant information
Input:DS: entire dataset; N₁: number of features selected by OFS; th: predetermined threshold; a temporary array A = {a₁, a₂, …, a_i, …, a_N1} = {0}.
Output:FS: final feature subset.
1. obtain the feature subset FS₁
1.1 for each word t in DS
obtain the OFS value of t by using the OFS method;
1.2 end for
1.3 rank all words in DS by the OFS values in descending order;
1.4 obtain a feature subset FS₁ = {f₁, f₂, …, f_N1} by getting
the top N₁ features;
2. obtain the set of all semantic correlative sets
SCS = {e₁, e₂, …, e_i, …, e_u} (u is the number of semantic
correlative sets in SCS) from FS₁ by using Algorithm 1.
3. filtering the redundant information of FS₁ using SCS and
obtain the final feature subset FS:
3.1 fori = 1 to N₁ - 1 step 1
3.2 ifa_i = 1 then continue;
3.3 end if
3.4 forj = i + 1 to N₁step 1
3.5 ifa_j = 1 then continue;
3.6 end if
3.7 ift_i and t_j occur in the same semantic correlative set of
SCSthen
a_j ← 1;
3.8 end if
3.9 end for
3.10 end for
4. obtain the final selected feature subset by using the following formula: $FS = {f_{i} \| a_{i} = 0} (14)$

Fig.2

Execution process of the proposed method.

Further, we will replace the words those occur in a document but do not occur in FS by the features which are semantic correlative with the word before document representation process. Given a document d = (t₁, t₂, …, t_i, …, t_n) (t_i is a word of d, n is the number of words in d), the word replacement process can be described as: $\begin{matrix} if \exists e_{j} \exists t_{i}^{'} ((e_{j} \in SCS) \land (t_{i} \in e_{j}) \land (t_{i}^{'} \in e_{j}), \\ \land (t_{i}^{'} \in FS) \land (t_{i}^{'} \notin d)) = true \\ then t_{i} is replaced by t_{i}^{'} . \end{matrix}$

4.2 The selection of threshold th

We use metaheuristic algorithms to search the optimal value of threshold th, which is a key parameter that effects the qualities of generated semantic correlative sets. Fruit fly optimization algorithm (FOA) is a new global optimization algorithm inspired by the foraging behavior of fruit flies [32]. Comparing with other metaheuristic algorithms, FOA has many advantages such as less control parameters, simple computational process, ease of understand and implementation, etc [33]. To improve the performance of the FOA and eliminate the drawbacks which lie with fixed values of the searching radius in FOA, Pan presented an improved FOA (IFFO) by introducing a new parameter and solution generating method [34]. The details of IFFO are given in Algorithm 3:

Algorithm 3. The IFFO algorithm
Input:FN: the number of the fruit flies; λ_min: the minimum value of the searching radius λ; λ_max: the maximum value of the searching radius λ; x_min: the minimum value of the food source position; x_max: the maximum value of the food source position; t: the current iteration; T: the maximum iteration; D: the dimension of food position.
Output: the global best food source position x_b
1: fori = 1, 2, …, FN $x_{i, j} \leftarrow x_{\min} + (x_{\max} - x_{\min}) \times rand ()$ (15) where 0 ≤ j < D, x_i, j is the jth dimension of fruit fly x_i, rand() is a function which returns a value from the uniform distribution on the interval [0,1].
2: end for
3: obtain the global best fruit fly x_b;
4: t ← 0;
5: while(t < T) $λ \leftarrow λ_{\max} \times exp (log (\frac{λ_{\min}}{λ_{\max}}) \times \frac{t}{T}) (16)$ (16) 6: fori = 1, 2, …, FN $d \leftarrow ⌊ 10000 \times rand () ⌋ mod D$ (17) $x_{i, j} \leftarrow {\begin{matrix} x_{b, j} \pm λ \times rand (), if j = d \\ x_{b, j}, else \end{matrix}$ (18) where x_b, j is the jth dimension of x_b.
7: ifx_ij > x_maxthenx_ij ← x_maxend if
8: ifx_ij < x_maxthenx_ij ← x_minend if
9: end for
10: obtain the global best fruit fly x_b’;
11: if $f (x_{b}^{'}) > f (x_{b})$ thenx_b ← x_b’
where f (x_b) is the fitness of x_b.
12: end if
13: t ← t + 1;
14: end while

However, IFFO does not memorize the history best positions, and lacks the mechanism of jumping out of the local extreme value. Moreover, this method ignores the possibilities that the best global position always occurs near the history best positions. On this basis, this paper introduces the conception of memory recall position, and proposes an improved memory recall mechanism based FOA method (MIFFO) to solve the problem that the global best position cannot be improved when IFFO falls into local optimum.

Fig.3

Improved memory recall mechanism based FOA method.

MIFFO defines the latest m global best positions as the memory recall positions, which are denoted as M = {x_bp,1, x_bp,2, …, x_bp,m} (m is the number of memory recall positions). As is shown in Fig. 3, if the current best position x_b is not improved in L iterations, the new positions of the fruit flies will be generated by utilizing the memory recall positions in M, and x_b will be memorized. Given the fruit fly i and a variable k, i’s position is updated using formula (18) in normal cases. However, when x_b is not improved in L iterations, the formula (18) in IFFO can be rewritten as follows: $x_{i, j}^{'} \leftarrow {\begin{matrix} x_{bp, k, j} \pm λ \times rand (), & if j = d \\ x_{b, j}, & else \end{matrix}$ (19) where k is initialized to m, x_bp, k, j and x_b, j are the jth dimension of x_bp, k and x_b, respectively. On this basis, MIFFO rewrites step 11 of Algorithm 3 as follows:

ifx_b is improved in L iterations then

forl = 1 tom - 1

x_bp,l ← x_bp,l+1;

end for

x_{bp, m} \leftarrow x_{b}^{'}

;

k ← m;

else

ifk > 1 thenk ← k - 1;

elsex_bp,k ← x_bp,m;

end if

4.3 Time complexity of the proposed method

The time complexity of the proposed feature selection method can be analyzed from the following aspects:

In the step of obtaining the feature subset FS₁ in Algorithm 2: from reference [28] we know, the time complexity of this step is O (|C| × N_a + N_a × log N_a), where |C| is the number of categories, N_a is the number of all words in the dataset.

In the step of generating the set of semantic correlative sets (SCS) in Algorithm 2: we know that there are $N_{1}^{2}$ calls to generate SCS in Algorithm 1, thus the time complexity is $O (N_{1}^{2})$ , where N₁ is the number of features in FS₁.

In the step of filtering the redundant information of FS₁ in Algorithm 2: this step also takes $N_{1}^{2}$ calls to filter the redundant information in FS₁, thus the time complexity is $O (N_{1}^{2})$ .

By combining the above results, we can obtain the time complexity of the proposed method as follows: $TC = O (N_{a} \times (| C | + log N_{a}) + 2 N_{1}^{2})$ (20)

Further, the time complexities of three typical MI based feature selection method (CCM method, MDMR method and GMFS method) are given as follows: $\begin{matrix} T_{CCM} & = & O (N_{1} \times (N_{d} + N_{a}) \\ \times N_{a} + N_{1}^{3} (N_{d} + N_{1})) \end{matrix}$ (21) $\begin{matrix} T_{MDMR} & = & O (N_{1} \times (N_{a} - N_{1}) \\ \times (| C | + N_{1} \times (1 + | C |))) \end{matrix}$ (22) $T_{GMFS} = O (T \times N_{p} \times N_{a} \times (N_{a} + | C |))$ (23)

Where, N_d is the number of documents, T is the max iteration number of PSO. Generally, there exists N₁ << N_a, and T is always set to be large enough to grantee the convergence. Therefore, we deduce that the time complexity of our method is lower than those of other methods, illustrating that the proposed method is feasible on reducing the consuming time.

5 Experimental results and analysis

The experiments are conducted on an Intel core(TM)-i5 processor with a CPU clock rate of 3.10 GHz and 4 GB main memory. The vector space model of the selected features is built on the platform of visual studio 2008, using C++ standard template library (STL). According to [28], CMFS can achieve high accuracy while keeping the execution efficiency, thus it is chosen as the OFS method. In this section, we compare the proposed method with three typical traditional feature selection methods (IG, CMFS and MI) and three typical MI based feature selection methods (MDMR, CCM and GMFS).

5.1 Datasets

In order to validate the performance of the proposed method, four widely used textual datasets: Reuters50 (RE), SMSSPAS (SM), WebKB (WE) [28] and 20-Newsgroups(NE) [35] are adopted in the following experiments. The WE dataset contains the web pages collected by the World Wide Knowledge Base project of the CMU text learning group. The RE, SM and NE datasets are all from the UCI machine learning repository, which has been widely used by the researchers all over the world as a primary source of machine learning datasets. Table 1 summarizes general information about these datasets. For ease of computation, we only consider the top 4, 6 and 8 categories with respect to WE, NE and RE, respectively. For the documents of each dataset, stop-words are removed, and the stemming process is executed by using Porter stemming algorithm [36]. Moreover, the 10-fold cross validation is executed to measure the performances of different methods.

Table 1
Details of four datasets used in this paper

Datasets Number of Number of

samples categories

RE 2500 50

SM 5574 2

WE 8282 7

NE 20000 20

Datasets	Number of	Number of
RE	2500	50
SM	5574	2
WE	8282	7
NE	20000	20

5.2 Performance measurements

Precision (PR), recall (RC), true positive rate (TPR) and false positive rate (FTR) are widely used measurements to evaluate the performances of feature selection methods. The definitions of PR, RC, TPR and FTR are given as follows: $\begin{matrix} PR & = & \frac{\sum_{i = 1}^{| C |} {TP}_{i}}{\sum_{i = 1}^{| C |} {TP}_{i} + \sum_{i = 1}^{| C |} {FP}_{i}}, \\ RC & = & \frac{\sum_{i = 1}^{| C |} {TP}_{i}}{\sum_{i = 1}^{| C |} {TP}_{i} + \sum_{i = 1}^{| C |} {FN}_{i}} \end{matrix}$ (24) $TPR = \frac{\sum_{i = 1}^{| C |} {TP}_{i}}{N_{t}}, FPR = 1 - PR$ (25)

In formulas (24) and (25), |C| denotes the number of categories; N_t is the number of testing documents; TP_i is the number of documents which are correctly classified to category c_i; FP_i is the number of documents which are misclassified to category c_i; FN_i is the number of documents which belong to category c_i and are misclassified. Further, the F₁ measurement, which combines the measurements of PR and RC is defined as follows [17]: $F_{1} = \frac{2 \times RC \times PR}{RC + PR}$ (26)

As the receiver operating curve (ROC) and the area under ROC (AUC) measurements provide important information about the classifier performance [37], they are also applied to compare different feature selection methods in the following experiments.

5.3 Classifiers

Support vector machine (SVM) [28, 37] and Naïve bayes (NB) [28] are used to validate the performances of different feature selections. In this paper, these classifiers are implemented by matlab 2013, which is popular on machine learning and data mining. The parameters of SVM are given as follows: cost parameter c = 1.0, the tolerance of the termination criterion: e = 0.001, Kernel = RBFKernel, gamma = 0.5. Because the multinomial event model can generate higher accuracy than the multivariate Bernoulli event model in NB [38], we use the former model to classify a document when NB classifier is used.

5.4 Performance comparisons of IFFO and MIFFO

The parameters are given as follows: number of fruit flies FN = 20; maximum number of iterations T = 2000; dimension of food position D = 1; threshold L = 30; maximum value of the searching radius λ_max= 0.5; minimum value of the searching radius λ_min= 0.001; maximum value of food position x_max= 1; minimum value of food position x_min= 0.001. On this basis, the fitness function of different metaheuristic optimization algorithms is defined as follows: $F (th) = \frac{\sum_{μ} F_{1} (N_{a} \times μ, th)}{10}$ (27) where μ is a ratio which ranges from 1% to 10% with a step of 1%, F₁ (N_a × μ, th) denotes the F₁ value of a classifier when th is used and N_a × μ features are selected. Further, with respect to each classifier, 100 times experiments are carried out when IFFO and MIFFO are used, and the convergence histories of average objective F values (denoted as F_a) are given in Figs. 4, 5, respectively.

Fig.4

convergence histories of F_a values when IFFO is used.

Fig.5

convergence histories of F_a values when MIFFO is used.

Obviously, the converged iteration numbers are about 1000 and 800 with respect to SVM and NB when IFFO is used, while the corresponding converged iteration numbers of MIFFO are 1200 and 1000, respectively. When IFFO is used, the average converged F_a values are 0.847 and 0.819 when SVM and NB are applied. However, when MIFFO is used, the corresponding average converged F_a values are 0.853 and 0.826, respectively. Though the convergence rate of MIFFO is lower than IFFO, the former obtains significant improvements on F_a values, denoting the efficiency of MIFFO in utilizing the experience of the fruit flies and improving the global searching accuracy when comparing with IFFO method.

5.5 Comparison of execution speed

We compare the average execution time (denoted as et_a) of each method when the ratio of selected features ranges from 1% to 10% with a step of 1%, and the results are shown in Table 2. It can be seen from this table that the results of MI based feature selection methods (CCM, MDMR and GMFS) are much higher than those of traditional feature selection methods like IG, CMFS and MI. The reason is that the MI based methods are much more time-consuming as they consider both the correlations between the candidate feature and all categories, and the correlations between the candidate feature and the selected features. GMFS obtains the maximum et_a value (21.642) on NE dataset, and MI obtains the minimum et_a value (0.436) on SM dataset. Further, we notice that the et_a values of GMFS is significantly higher than those of other methods, illustrating that GMFS is very time-consuming as it uses PSO algorithm to optimize the best features. It is noteworthy that the et_ta values of the proposed method are slightly higher than those of IG, CMFS and MI, with the average increment of about 0.8 s over those of CMFS. Moreover, the proposed method performs obviously better than CCM, MDMR and GMFS, with the largest improvement about 19.2 s over GMFS on NE dataset, illustrating the effectiveness of the proposed method on reducing the execution time of redundant information filtering process.

Table 2
et_a values of seven feature selection algorithms (et_a is expressed as seconds)

Datasets IG CMFS MI MDMR CCM GMFS The proposed

method

RE 1.068 1.324 0.936 8.134 10.925 16.478 2.123

SM 0.634 0.597 0.436 4.263 7.694 11.125 1.341

WE 0.971 1.381 0.921 6.306 8.773 13.387 1.956

NE 1.386 1.632 1.162 10.352 14.538 21.642 2.397

Datasets	IG	CMFS	MI	MDMR	CCM	GMFS	The proposed
RE	1.068	1.324	0.936	8.134	10.925	16.478	2.123
SM	0.634	0.597	0.436	4.263	7.694	11.125	1.341
WE	0.971	1.381	0.921	6.306	8.773	13.387	1.956
NE	1.386	1.632	1.162	10.352	14.538	21.642	2.397

5.6 Performances of different feature selection methods

In this section, we compare the proposed method with six typical feature selection methods when the ratio of the final selected features ranges from 2% to 10% with a step of 2%.

Table 3 shows F₁ values of different methods on RE dataset. For each ratio, the highest F₁ values are all denoted in bold. When SVM is used, the F₁ values of the proposed method are generally higher than those of the other methods when the ratio of the selected features equals to 4% or 6%. When NB is used, the proposed method outputs better results than those of the other methods when the ratio of the selected feature is 4%, 8% or 10%, and it obtains the highest F₁ value (0.749) when 8% of the features are selected. Figure 6 shows the ROC curves of different methods on RE dataset, and the TPR and FPR values are averaged when SVM and NB are used. In Fig. 6, the numbers in the brackets are the AUC values of different methods. It can be seen that the performances of MDMR and CCM are similar. Obviously, MI obtains the lowest AUC value 0.925, and the proposed method achieves the highest AUC value 0.966, with the improvement of 0.009 over that of CMFS, illustrating the effectiveness of the proposed method in selecting the best features.

Table 3
F₁ values of different feature selection methods on RE dataset when SVM and NB are used, respectively

Classifiers Feature selections Ratio of selected features

2% 4% 6% 8% 10%

SVM IG 0.715 0.722 0.734 0.739 0.733

CMFS 0.729 0.728 0.736 0.738 0.745

MI 0.672 0.699 0.719 0.721 0.722

MDMR 0.735 0.742 0.758 0.766 0.765

CCM 0.732 0.747 0.762 0.762 0.772

GMFS 0.729 0.732 0.742 0.745 0.755

The proposed method 0.729 0.747 0.766 0.755 0.771

NB IG 0.683 0.711 0.719 0.726 0.724

CMFS 0.688 0.719 0.722 0.728 0.728

MI 0.652 0.704 0.709 0.711 0.715

MDMR 0.713 0.725 0.732 0.744 0.739

CCM 0.711 0.728 0.736 0.741 0.743

GMFS 0.706 0.717 0.725 0.728 0.723

The proposed method 0.709 0.732 0.732 0.749 0.746

Classifiers	Feature selections	Ratio of selected features
SVM	IG	0.715	0.722	0.734	0.739	0.733
	CMFS	0.729	0.728	0.736	0.738	0.745
	MI	0.672	0.699	0.719	0.721	0.722
	MDMR	0.735	0.742	0.758	0.766	0.765
	CCM	0.732	0.747	0.762	0.762	0.772
	GMFS	0.729	0.732	0.742	0.745	0.755
	The proposed method	0.729	0.747	0.766	0.755	0.771
NB	IG	0.683	0.711	0.719	0.726	0.724
	CMFS	0.688	0.719	0.722	0.728	0.728
	MI	0.652	0.704	0.709	0.711	0.715
	MDMR	0.713	0.725	0.732	0.744	0.739
	CCM	0.711	0.728	0.736	0.741	0.743
	GMFS	0.706	0.717	0.725	0.728	0.723
	The proposed method	0.709	0.732	0.732	0.749	0.746

Fig.6

ROC curves of seven methods on RE dataset.

Table 4 shows F₁ values of different methods on SM dataset. According to Table 4 we know, when the performances of different classifiers are compared, the SVM classifier shows a higher performance than NB classifier, denoting the high accuracy of SVM in dealing with text classification problems. When SVM is used, MDMR and CCM perform generally better than other methods as they considered the conditional redundancy between the candidate feature and the selected features. The proposed method obtains the highest F₁ value 0.924 when 8% features are selected. Moreover, when NB is used, the proposed method obtains the highest F₁ values 0.898 and 0.893 when 4% and 8% features are selected, respectively. Figure 7 shows the ROC curves of different methods on SM dataset, and the TPR and FPR values are averaged when SVM and NB are used. From Fig. 7 we know, CMFS performs generally better than IG as its AUC value is slightly higher than that of IG. Moreover, the AUC value of MDMR, CCM and the proposed method are similar, showing the high ability of the methods which filter redundant information on searching the most discriminative features.

Table 4

F₁ values of different feature selection methods on SM dataset when SVM and NB are used, respectively

Classifiers	Feature selections	Ratio of selected features
		2%	4%	6%	8%	10%
SVM	IG	0.902	0.909	0.913	0.911	0.902
	CMFS	0.905	0.907	0.916	0.922	0.913
	MI	0.883	0.885	0.879	0.881	0.886
	MDMR	0.909	0.912	0.926	0.924	0.926
	CCM	0.911	0.915	0.924	0.921	0.929
	GMFS	0.907	0.912	0.918	0.913	0.911
	The proposed method	0.908	0.914	0.922	0.924	0.922
NB	IG	0.873	0.878	0.881	0.889	0.886
	CMFS	0.875	0.882	0.889	0.886	0.889
	MI	0.865	0.871	0.876	0.868	0.891
	MDMR	0.888	0.895	0.889	0.883	0.894
	CCM	0.883	0.896	0.891	0.883	0.889
	GMFS	0.881	0.879	0.876	0.879	0.885
	The proposed method	0.879	0.898	0.888	0.893	0.891

Fig.7

ROC curves of seven methods on SM dataset.

Table 5 shows F₁ values of different methods on WE dataset. According to Table 5, the highest F₁ value 0.868 with SVM is obtained by CCM when 10% of the features are selected, and the second highest F₁ value 0.864 is obtained by the proposed method when 8% of the features are selected. In addition, when NB is used, the proposed method outperforms other methods when 4% or 10% of the features are selected, illustrating the efficiency of the proposed method in classifying the web pages of WE dataset. Figure 8 shows the ROC curves of different methods on WE dataset, and the TPR and FPR values are averaged when SVM and NB are used. We can see that the performances of IG and CMFS are similar as their AUC values are 0.978 and 0.977, respectively. Moreover, MI and GMFS perform worst, and their AUC values are 0.965 and 0.973, respectively. Further, the proposed method obtains the second highest AUC value 0.987, which is significantly higher than those of other methods except for CCM.

Table 5

F₁ values of different feature selection methods on WE dataset when SVM and NB are used, respectively

Classifiers	Feature selections	Ratio of selected features
		2%	4%	6%	8%	10%
SVM	IG	0.836	0.845	0.852	0.849	0.851
	CMFS	0.838	0.844	0.852	0.851	0.853
	MI	0.826	0.829	0.833	0.828	0.824
	MDMR	0.841	0.844	0.851	0.855	0.861
	CCM	0.844	0.849	0.854	0.862	0.868
	GMFS	0.838	0.832	0.836	0.829	0.826
	The proposed method	0.840	0.856	0.852	0.864	0.859
NB	IG	0.829	0.812	0.815	0.820	0.818
	CMFS	0.822	0.826	0.819	0.824	0.815
	MI	0.793	0.806	0.802	0.811	0.818
	MDMR	0.829	0.836	0.835	0.843	0.845
	CCM	0.822	0.834	0.838	0.838	0.849
	GMFS	0.819	0.831	0.824	0.831	0.822
	The proposed method	0.824	0.837	0.837	0.842	0.855

Fig.8

ROC curves of seven methods on WE dataset.

Table 6 shows F₁ values of different methods on NE dataset, respectively. When SVM is used, the proposed method obtains the highest F₁ values for 2 times, which are generally higher than those of the other methods. When NB is used, the F₁ values of the proposed method are significantly better than those of other methods when 6% or 8% features are selected, with the highest improvement 0.029 over that of CMFS when 10% features are selected. Figure 9 shows the ROC curves of different feature selection methods on NE dataset, and the TPR and FPR values are averaged when SVM and NB are used. It can be seen that the proposed method and MDMR are the top two best methods of which the AUC values are 0.981 and 0.979, respectively. Generally, the MI based feature selection methods perform better than other methods, the reason is that the features selected by the former methods contain less redundant information, achieving high category discriminating ability when facing with multi-label classification tasks.

Table 6

F₁ values of different feature selection methods on NE dataset when SVM and NB are used, respectively

Classifiers	Feature selections	Ratio of selected features
		2%	4%	6%	8%	10%
SVM	IG	0.832	0.839	0.833	0.832	0.825
	CMFS	0.829	0.833	0.838	0.841	0.828
	MI	0.803	0.811	0.812	0.821	0.818
	MDMR	0.833	0.842	0.846	0.853	0.861
	CCM	0.832	0.839	0.851	0.842	0.846
	GMFS	0.826	0.835	0.841	0.838	0.836
	The proposed method	0.832	0.847	0.855	0.849	0.858
NB	IG	0.804	0.813	0.821	0.825	0.817
	CMFS	0.801	0.816	0.825	0.827	0.812
	MI	0.782	0.793	0.794	0.801	0.785
	MDMR	0.812	0.824	0.839	0.835	0.842
	CCM	0.809	0.817	0.831	0.829	0.833
	GMFS	0.803	0.815	0.830	0.827	0.828
	The proposed method	0.809	0.822	0.841	0.838	0.841

Fig.9

ROC curves of seven methods on NE dataset.

5.7 Statistical results and discussion

We use the widely used nonparametric two-tail Wilcoxon signed ranks test [39] to evaluate the performance of our method in terms of F₁ value using SVM and NB classifiers. The Wilcoxon signed ranks test computes a value called test statistic z value, which measures how statistically significant the results are.

The null hypothesis of Wilcoxon signed ranks test is that the mean values of the two paired values are equal. If |z| is less than the critical z value (Z_α), then the null hypothesis is accepted, otherwise, the null hypothesis is rejected. When the significance level (α) equals to 0.05, Z_α equals to 1.96 [40]. On this basis, 10 times experiments are carried out and the average z values between the proposed method and each method are computed when the ratio of selected features ranges from 1% to 10% with the step of 1%, and the results are shown in Table 7.We know from Table 7 that, the z values are greater than Z_α in 30 of 48 cases, showing that the proposed method outperforms other methods significantly in 62.5% of the cases. As the traditional feature selection methods (IG, CMFS and MI) do not deal with the redundant information, they perform worse than the proposed method in all cases. When compared to the OFS method (CMFS), we notice that the corresponding z values are all greater than Z_α, showing the improvement of the proposed method over the OFS method. In addition, the proposed method performs similarly with MDMR and CCM in most cases and outperforms GMFS in 5 of 8 cases, illustrating the effectiveness of the proposed method in filtering the redundant information.

Table 7
Results of statistical analysis when SVM and NB are used, respectively

Datasets Classifiers IG CMFS MI MDMR CCM GMFS

RE SVM 2.17 2.04 2.22 1.53 1.02 2.04

NB 2.33 2.45 2.62 1.76 0.89 2.13

SM SVM 2.24 2.16 2.71 –0.77 –1.19 1.79

NB 2.38 1.99 2.62 1.88 1.67 2.47

WE SVM 2.45 1.66 2.38 1.98 1.66 2.11

NB 1.94 2.05 2.45 1.83 1.52 1.95

NE SVM 2.38 1.98 2.12 1.67 1.99 2.02

NB 2.33 2.06 2.45 2.03 1.56 1.93

Datasets	Classifiers	IG	CMFS	MI	MDMR	CCM	GMFS
RE	SVM	2.17	2.04	2.22	1.53	1.02	2.04
	NB	2.33	2.45	2.62	1.76	0.89	2.13
SM	SVM	2.24	2.16	2.71	–0.77	–1.19	1.79
	NB	2.38	1.99	2.62	1.88	1.67	2.47
WE	SVM	2.45	1.66	2.38	1.98	1.66	2.11
	NB	1.94	2.05	2.45	1.83	1.52	1.95
NE	SVM	2.38	1.98	2.12	1.67	1.99	2.02
	NB	2.33	2.06	2.45	2.03	1.56	1.93

6 Conclusions

It is very important to reduce the dimensionality of the feature space and improve the quality of selected features in text classification field. Traditional feature selection methods of filters cannot deal with the redundant information which deteriorates the classification accuracy. Moreover, the MI based feature selection methods have the problems of high computational complexity and parameter dependency. On this basis, we propose a new two-step based feature selection for filtering redundant features. In order to improve the execution speed of the MI based feature selection method, firstly, we introduced the definitions of word semantic correlation, set semantic correlation, semantic correlative and semantic correlative set; then, based on the above definitions, we give the details of generating a set of semantic correlative sets and propose a two-step based feature selection method for filtering the redundant information of selected features. In order to search the best parameter, we introduce the conception of memory recall position and propose an improved memory recall mechanism recall based FOA algorithm (called IMFFO) which avoids the problem of local optimum. The efficiency of the proposed method is examined through the experiments of text classification with SVM and NB classifiers on four textual datasets: Reuters50, SMSSPAS, WebKB and 20-Newsgroups. By comparing the proposed method with six typical feature selection methods on the aspects of execution speed and classification accuracy, we find that: (1) the proposed method has obviously improvement over traditional methods of filers on classification accuracy; (2) the proposed method can highly improve the execution speed while guaranteeing the classification accuracy when compared to MI based feature selection methods.

In the future, we will study deeply in the following two aspects: (1) generating a new set of semantic correlative sets which contains more semantic information; (2) investigating new OFS methods to improve the performance of the proposed method and extend it to online applications.

Footnotes

Acknowledgments

This research is supported by the Beijing Natural Science Foundation, under grant no. 4174105 and the Discipline Generation Foundation of the Central University of Finance and Economics, under grant no. 2016XX02.

The Young Teachers Development Fund of Central University of Finance and Economics (No. QJJ1635).

References

Bharti

K.K.

and Singh

P.K.

, Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering [J], Expert Systems with Applications42(6) (2015), 3105–3114.

Azam

and Yao

, Comparison of term frequency and document frequency based feature selection metrics in text categorization [J], Expert Systems with Applications39(5) (2012), 4760–4768.

Yang

and Pedersen

, A comparative study on feature set selection in text categorization, In Fisher

D.H.

(Ed.), Proceedings of the 14th International Conference on Machine Learning, San Francisco, CA: Morgan Kaufmann, 1997, pp. 412–420.

Shang

, Huang

, Zhu

, et al., A novel feature selection algorithm for text categorization, Expert Systems with Applications33(1) (2007), 1–5.

Mosteller

, Association and estimation in contingency tables, Journal of the American Statistical Association (American Statistical Association)63(321) (1986), 1–28.

Mengle

S.S.R.

and Goharian

, Ambiguity measure feature selection algorithm, Journal of the American Society for Information Science and Technology60(5) (2009), 1037–1050.

Azam

and Yao

, Comparison of term frequency and document frequency based feature selection metrics in text categorization, Expert Systems with Applications39(5) (2012), 4760–4768.

Wang

, Zhang

, Liu

, et al., Feature selection based on term frequency and T-test for text categorization [J], Pattern Recognition Letters45(1) (2013), 1482–1486.

Wang

, Liu

and Zhu

, Two-step based hybrid feature selection method for spam filtering [J], Journal of Intelligent & Fuzzy Systems27(6) (2014), 2785–2796.

10.

Joseph

A.A.

, Tokumoto

and Ozawa

, Online feature extraction based on accelerated kernel principal component analysis for data stream [J], Evolving Systems7(1) (2016), 1–13.

11.

Elghazel

, Aussem

, Gharroudi

, et al., Ensemble multi-label text categorization based on rotation forest and latent antic indexing [J], Expert Systems with Applications57(C) (2016), 1–11.

12.

Tenenhaus

, Vinzi

V.E.

, Chatelin

Y.M.

, et al., PLS path modeling [J], Computational Statistics & Data Analysis48(1) (2005), 159–205.

13.

Kruskal

J.B.

and Wish

, Multidimensional scaling [M], Sage, 1978.

14.

Zhang

, Clark

R.A.J.

, Wang

, et al., Unsupervised language identification based on Latent Dirichlet Allocation [J], Computer Speech and Language39 (2016), 47–66.

15.

Han

and Ren

, Global mutual information-based feature selection approach using single-objective and multi-objective optimization [J], Neurocomputing168(C) (2015), 47–54.

16.

Liu

and Yu

, Toward integrating features election algorithms for classification and clustering, IEEE Transactions on Knowledge and Data Engineering17(4) (2005), 491–502.

17.

Sebastiani

, Machine learning in automated text categorization [J], ACM Computing Surveys34(1) (2002), 1–47.

18.

Yang

and Pedersen.

J.O.

, A comparative study on feature selection in text categorization [C], in: Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp. 412–420.

19.

Peng

, Long

and Ding

, Feature selection based on mutual information criteria of max-dependency: Max-relevance, and min redundancy [J], IEEE Transactions on Pattern Analysis and Machine Intelligence27(8) (2005), 1226–1238.

20.

Zhang

and Zhang

, Feature subset selection with cumulate conditional mutual information minimization [J], Expert Systems with Applications39(5) (2012), 6078–6088.

21.

Norvig

and Russell

S.J.

, Artificial intelligence: A modern approach[J], Applied Mechanics & Materials263(5) (2003), 2829–2833.

22.

Jaskowiak

P.A.

and Campello

R.J.G.B.

, A cluster based hybrid feature selection approach[C], Brazilian Conference on Intelligent Systems IEEE, 2015, pp. 43–48.

23.

Battiti

, Using mutual information for selecting features in supervised neural net learning, IEEE Trans Neural Netw5(4) (1994), 537–550.

24.

Estévez

P.A.

, et al., Normalized mutual information feature selection, IEEE Trans Neural Netw20(2) (2009), 189–201.

25.

Wang

, Li

and Li

, A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure [J], Information Sciences307 (2015), 73–88.

26.

Lin

, Hu

, Liu

, et al., Multi-label feature selection based on max-dependency and min-redundancy [J], Neurocomputing168 (2015), 92–103.

27.

Huang

, Cai

and Xu.

, A hybrid genetic algorithm for feature selection wrapper based on mutual information [J], 28(13) (2007), 1825–1844.

28.

Yang

, Liu

, Zhu

, et al., A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization, Inform Process Manage48(4) (2012), 741–754.

29.

Swingle

, Renyi entropy, mutual information, and fluctuation properties of Fermi liquids [J], Physical Review B Condensed Matter86(4) (2010), 7794–7794.

30.

Hoque

, Bhattacharyya

D.K.

and Kalita

J.K.

, MIFS-ND: A mutual information-based feature selection method [J], Expert Systems with Applications41(14) (2014), 6371–6385.

31.

Han

and Ren

, Global mutual information-based feature selection approach using single-objective and multi-objective optimization [J], Neurocomputing168(C) (2015), 47–54.

32.

Pan

W.T.

, A new fruit fly optimization algorithm: Taking the financial distress model as an example [J], Knowledge-Based Systems26(C) (2012), 69–74.

33.

, Zuo

and Zhang

, A cloud model based fruit fly optimization algorithm [J], Knowledge-Based Systems89(C) (2015), 603–617.

34.

Pan

Q.K.

, Sang

H.Y.

, Duan

J.H.

, et al., An improved fruit fly optimization algorithm for continuous function optimization problems [J], Knowledge-Based Systems62(5) (2014), 69–83.

35.

Yang

J.M.

, Liu

Z.Y.

and Qu

Z.Y.

, A novel feature selection based gravitation for text categorization [J], International Journal of Database Theory and Application9(3) (2016), 211–228.

36.

Porter

M.F.

, An algorithm for suffix stripping [M]. Readings in information retrieval. Morgan Kaufmann Publishers Inc., 1997, pp. 130–137.

37.

Nemade

P.A.

and Pardasani

K.R.

, Fuzzy support vector machine model to predict human death domain protein–protein interactions [J], Network Modeling Analysis in Health Informatics and Bioinformatics4(1) (2015), 1–12.

38.

McCallum

and Nigam

, A comparison of event models for naive Bayes spam filtering [C], EACL ’03 Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics, Volume 1, pp. 307–314.

39.

Taheri

and Hesamian

, A generalization of the Wilcoxon signed-rank test and its applications [J], Statistical Papers54(2) (2013), 457–470.

40.

Corder

G.W.

and Foreman

D.I.

, Comparing Two Related Samples: The Wilcoxon Signed Ranks Test [M], Nonparametric Statistics for Non-Statisticians: A Step-by-Step Approach, John Wiley & Sons, Inc, 2011, pp. 38–56.

Two-step based feature selection method for filtering redundant information

Abstract

Keywords

1 Introduction

2 Related work

2.1 Traditional feature selection methods of filters

4 The proposed method

4.1 Implementation of the proposed method

5.1 Datasets

Table 1 Details of four datasets used in this paper Datasets Number of Number of samples categories RE 2500 50 SM 5574 2 WE 8282 7 NE 20000 20

5.4 Performance comparisons of IFFO and MIFFO

Footnotes

Acknowledgments

References

Table 1
Details of four datasets used in this paper

Datasets Number of Number of

samples categories

RE 2500 50

SM 5574 2

WE 8282 7

NE 20000 20