A novel feature selection method considering feature interaction in neighborhood rough set

Abstract

Feature selection has been shown to be a highly valuable strategy in data mining, pattern recognition, and machine learning. However, the majority of proposed feature selection methods do not account for feature interaction while calculating feature correlations. Interactive features are those features that have less individual relevance with the class, but can provide more joint information for the class when combined with other features. Inspired by it, a novel feature selection algorithm considering feature relevance, redundancy, and interaction in neighborhood rough set is proposed. First of all, a new method of information measurement called neighborhood symmetric uncertainty is proposed, to measure what proportion data a feature contains regarding category label. Afterwards, a new objective evaluation function of the interactive selection is developed. Then a novel feature selection algorithm named (NSUNCMI) based on measuring feature correlation, redundancy and interactivity is proposed. The results on the nine universe datasets and five representative feature selection algorithms indicate that NSUNCMI reduces the dimensionality of feature space efficiently and offers the best average classification accuracy.

Keywords

Feature selection feature interaction neighborhood rough set neighborhood symmetrical uncertainty

1. Introduction

Feature selection is used to select a group of the most informative and useful features. It has extensive application fields such as data analysis, data mining, and pattern recognition [1]. The way to choose the best feature set from all options is taken into account to be a subject merit study in varied learning tasks. In fact, high dimensional information can improve classification performance attribute to more features normally. However, many irrelevant features and redundant features in high dimensional information are not solely contributive to classification performance and classification accuracy however conjointly considerably increase the process complexness and memory storage needs [2]. Therefore, it is necessary for us to select a set of representative features before building a model or classifier.

In recent years, many researchers have proposed lots of feature selection algorithms. But most feature selection algorithms solely concentrate on distinguishing relevant features and removing redundant features as many as possible. Although some recent work has recognized the existence and impact of feature interaction, there is little work on the specific treatment of feature interaction [3]. But features may interact with one another and work along to replicate the character of the issues. Everything in the world is an interrelated organic whole. In the biological study, the changes of physiological and pathological processes during a complicated organism system square measure are sometimes tormented by molecular interactions. Another classic example of interaction is the well-known XOR problem. Therefore, when we built any model and algorithm, the essential factor of interaction is worthy of in-depth discussion and study.

In 1982, Z. Pawlak creatively put forward rough set theory (RST) based on the view that knowledge is a kind of classification ability. It has been widely applied to feature selection (also referred to as attribute reduction) in recent years [4]. But it is solely suitable for processing discrete data, not continuous data. Therefore, lots of scholars have extended the classical rough set model. For example, fuzzy rough set, dominant rough set, and so on. Hu et al. [5] introduced the thought of neighborhood into the RST and proposed a brand new neighborhood rough set (NRS) model based on neighborhood relationships. In recent years, various algorithms based on NRS have been widely used in feature selection, image labeling, hyperspectral classification, and other fields.

Feature subset selection can be viewed as a search issue. It targets choosing one feature subset that holds the information of original feature as much as possible under some predefined criteria [1]. Therefore, it is vital to live the relevance between classes and features and to live the redundancy and interaction between features. In recent years, information theory has been widely used to measure the amount of information contained in a feature. Research scholars have proposed a variety of feature selection algorithms to increase the relevance between features and classes and the interaction between features while reducing the redundancy between features [8]. Majority of strategies is summarized within the following framework [9]:

$\displaystyle J(f_{i})=I(f_{i};D)-\beta\sum\limits_{f_{\textit{sel}}\in S}{I(f% _{i};f_{\textit{sel}})+\lambda\sum\limits_{f_{\textit{sel}}\in S}{I(f_{i};f_{% \textit{sel}}|D)}}$

Where $f_{i}$ represents a candidate feature, $f_{\textit{sel}}$ represents a selected feature, $S$ represents the selected feature set.

A simple methodology termed as Mutual info Maximization (MIM) is projected in [18]. Dash and Liu [11] concentrate on a measure called consistency, which is the proportion of samples accurately recognized according to the larger part voting technique. Roberto [12] investigates the applying of the mutual information criterion to judge a group of candidate features and to pick out an informative set to be used as input file for a neural network classifier. Peng [13] study a way to choose sensible options in keeping with the highest applied mathematics dependency criterion supported mutual data. Mohamed [14] use mutual information and also the ‘maximum of the minimum’ criterion, which alleviates the matter of overestimation of the feature significance. a new info term denoted as freelance Classification info is projected during this paper [15]. It assembles the new provided info and also the preserved info negatively related to with the redundant info. By choosing options that maximize their mutual info with the category to predict conditional to any feature already picked, it ensures the choice of options that area unit each on an individual basis informative and two-by-two weak dependent [16]. the interaction weight factor which might mirror the data of whether or not a feature is redundant or interactive is projected [3]. Bennasar [17] introduced a replacement feature choice method: Feature Interaction Maximization (FIM), which employs trilateral interaction data as a live of feature redundancy. The planned technique employs interaction info to guide the search, consecutive adds one feature at a time into the presently hand-picked set, and adopts early stopping to prevent overfitting and speed up the search [19].

From the abovementioned, it can see that information theory has been utilized in countless feature selection algorithms. Some classic evaluation criteria for feature selection are summed up in Table 1 below.

Table 1
Evaluation criteria of the algorithm

Algorithms	Evaluation criteria
MIM [18]	$J(f_{i})=I(f_{i};C)$
MIFS [12]	$J(f_{i})=I(f_{i};D)-\beta\sum\limits_{f_{\textit{sel}}\in S}{I(f_{i};f_{% \textit{sel}})}$
MRMR [13]	$J(f_{i})=I(f_{i};D)-\frac{1}{\|S\|}\sum\limits_{f_{\textit{sel}}\in S}{I(f_{i};f% _{\textit{sel}})}$
JMIM [14]	$J(f_{i})=I(f_{i};D)-\frac{1}{\|S\|}\sum\limits_{f_{\textit{sel}}\in S}{I(f_{i};f% _{\textit{sel}})}+\frac{1}{\|S\|}\sum\limits_{f_{\textit{sel}}\in S}{I(f_{i};f_{% \textit{sel}}\|D)}$
MRI [15]	$J(f_{i})=I(f_{i};D)+\sum\limits_{f_{\textit{sel}}\in S}{I(f_{i};C\|{f_{\textit{% sel}}})+I(f_{\textit{sel}};C\|{f_{i}})}$
CMIM [16]	$J(f_{i})=I(f_{i};D)-\max_{f_{\textit{sel}}\in S}(I(f_{i};f_{\textit{sel}})-I(f% _{i};f_{\textit{sel}}\|D))$
IWFS [3]	$J(f_{i})=(IW(f_{i};f_{\textit{sel}})+1)\times SU(f_{i};D)$
FIM [17]	$J(f_{i})=I(f_{i};D)+\mathop{\min}\limits_{f_{\textit{sel}}\in S}(I(f_{i};f_{% \textit{sel}};D))$
IGIS [19]	$J(f_{i})=\arg\mathop{\max}\limits_{f_{\textit{sel}}\in C}\left[I(f_{i};D)+% \frac{1}{\|S\|}\sum\limits_{f_{i}\in S}{I(f_{i};f_{\textit{sel}};D)}\right]$

In NRS, neighborhood mutual information (NMI), a very important method of information theory, is usually utilized by research scholars to access the relativity between features and classes and between features in classification issues [6]. Because NMI favors variables with multi-values. A new way of information measurement is proposed, which is called neighborhood symmetric uncertainty(NSU). Meanwhile, neighborhood conditional mutual information (NCMI) is introduced to evaluate the interaction between features, which may be accustomed to replicating the influence of a newly further feature on the connection between features and classes [4]. Accordingly, during this work, a completely unique neighborhood symmetric uncertainty and neighborhood conditional mutual information-based interaction feature selection algorithm (NSUNCMI) is proposed. The main contributions of this article are as follows [7]:

(1)

Inspired by symmetric uncertainty, we propose a new information measure, namely neighborhood symmetric uncertainty (NSU) in the neighborhood rough set, to measure how much information feature $f$ contains about class label $d$ . NSU can be seen as the normalized NMI by scaling its range to [0, 1] and it is able to overcome the inherent shortcoming of NMI.

(2)

In this paper, the characteristic of the interaction is explored thoroughly. Further, the characteristics of feature correlation, feature redundancy, feature interaction are summarized and analyzed from different angles.

The remainder of this paper is coordinated as follows [3]. In Section 2, some fundamental information-theoretic notions are assessed. In Section 3, we give different definitions of feature relevance, feature redundancy, and feature interaction from different angles. Then put forward a new feature selection algorithm in Section 4. Experimental results and analysis are introduced in Section 5. At last, we give a concise end and provide the future research direction in Section 6.

2. Preliminaries

2.1 Basic concepts

A neighborhood decision system can be expressed by a quin-tuple $\textit{NDS}=(U,C\cup D,V,f,\delta)$ . Where $U=\{x_{1},x_{2},\cdots,x_{n}\}$ is a universe; $C=\{f_{1},f_{2},\cdots,f_{m}\}$ is a conditional attributes set; $D=\{d\}$ is a decision attribute set; $V$ is the union of feature domains such that $V=\mathop{\cup}\limits_{f_{j}\in F}V_{f_{j}}$ ; $f:U\times(F\cup D)\to V$ is an information function, $\delta$ is a neighborhood parameter.

Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\forall x,y\in U$ , an attribute subset $P\subseteq C$ , a distance function on $P$ is defined as $\Delta_{P}(x,y)=\left(\sum\limits_{a_{i}\in P}|f(x,a_{i})-f(y,a_{i})|^{\tau}% \right)^{\frac{1}{\tau}}$ .

Property 1 ([5]). Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\forall P\subseteq C$ , $\forall x,y,z\in U$ , the distance function $\Delta_{P}$ satisfies the following properties:

(1)
Non-negativity: $\Delta_{P}(x,y)\geqslant 0$ , $\Delta_{P}(x,y)=0$ iff $x=y$ ;
(2)
Symmetric: $\Delta_{P}(x,y)=\Delta_{P}(y,x)$ ;
(3)
Triangular inequality: $\Delta_{P}(x,y)\leqslant\Delta_{P}(x,z)+\Delta_{P}(z,y)$ .

Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\forall x\in U$ , $P\subseteq C$ , the neighborhood granule of $x$ on $P$ is defined as $N_{P}^{\delta}(x)=\{y|{\Delta_{P}(y,x)\leqslant\delta,\;y}\in U\}$ .

Property 2 ([5]). The neighborhood granule $NR_{P}^{\delta}$ , it satisfies the following properties:

(1)
$N_{B}^{\delta}(x)\neq\emptyset$ ;
(2)
$x\in N_{B}^{\delta}(x)$ ;
(3)
$y\in N_{B}^{\delta}(x)\Leftrightarrow x\in N_{B}^{\delta}(y)$ ;
(4)
$\bigcup\limits_{x\in U}{N_{B}^{\delta}(x)}=U$ ;

Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $P\subseteq C$ , the neighborhood relation on $P$ is defined as $NR_{P}^{\delta}=\{(x,y)\in U\times U|{\Delta_{P}(x,y)\leqslant\delta}\}$ . $NR_{P}^{\delta}$ on the universe is denoted as a relation matrix $M(N)$ , where $M(N)=(r_{ij})_{n\times n}=\left\{{{\begin{array}[]{ll}{r_{ij}=1}&{x_{j}\in% \delta(x_{i})}\\ {r_{ij}=0}&\text{otherwise}\\ \end{array}}}\right.$ .
2.2 Information measurements

Definition 1 (Neighborhood Entropy [4]). Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\delta\geqslant 0$ , $\forall S\subseteq C$ , the neighborhood relation on $S$ is expressed as $NR_{S}^{\delta}$ . Then the neighborhood of $x_{i}\in U$ obtained from $S$ is $NR_{S}^{\delta}(x_{i})$ , The neighborhood entropy of the sample set with respect to $S$ is defined as

$\displaystyle NE_{\delta}(S)=-\frac{1}{|U|}\sum\limits_{i=1}^{|U|}{\log_{2}% \frac{|{\delta_{S}(x_{i})}|}{|U|}}$ (1)

Definition 2 (Neighborhood Joint Entropy [4]). Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\delta\geqslant 0$ , $\forall R$ , $S\subseteq C$ , the neighborhood joint entropy of $R$ and $S$ is defined as

$\displaystyle NE_{\delta}(R,S)=-\frac{1}{|U|}\sum\limits_{i=1}^{|U|}{\log_{2}% \frac{|{\delta_{R\cup S}(x_{i})}|}{|U|}}$ (2)

Definition 3 (Neighborhood Conditional Entropy [5]). Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\delta\geqslant 0$ , $\forall R$ , $S\subseteq C$ , under the condition that $S$ is known, the information entropy of $R$ is expressed as the conditional entropy of $R$ with respect to $S$ , which is defined as

$\displaystyle NE_{\delta}(R|S)=-\frac{1}{|U|}\sum\limits_{i=1}^{|U|}{\log_{2}% \frac{|{\delta_{R\cup S}(x_{i})}|}{|{\delta_{S}(x_{i})}|}}$ (3)

Proposition 1. For $\forall R$ , $S\subseteq C$ , $NE_{\delta}(R|S)=NE_{\delta}(R,S)-NE_{\delta}(S)$ .

Definition 4 (Neighborhood Mutual Information [4]). Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\delta\geqslant 0$ , $\forall R$ , $S\subseteq C$ , the neighborhood mutual information of $R$ and $S$ is defined as

$\displaystyle\textit{NMI}_{\delta}(R;S)=-\frac{1}{|U|}\sum\limits_{i=1}^{|U|}{% \log_{2}\frac{|{\delta_{R}(x_{i})}|\cdot|{\delta_{S}(x_{i})}|}{|U|\cdot|{% \delta_{R\cup S}(x_{i})}|}}$ (4)

The neighborhood mutual information can be used to measure the contribution of feature subset $R$ to reduce the uncertainty of classification under the condition that the selected feature subset $S$ is known. If $R$ is a feature and $S$ is a class label, then $\textit{NMI}_{\delta}(R;S)$ measures how much information feature $R$ contains about class label $S$ .

Proposition 2 [4]. For $\forall R$ , $S\subseteq C$ , according to Definitions 1–4 and Proposition 1, we have the following three inferences.

(1)

$\textit{NMI}_{\delta}(R;S)=\textit{NMI}_{\delta}(S;R)$ ;

(2)

$\textit{NMI}_{\delta}(R;S)=NE_{\delta}(R)+NE_{\delta}(S)-NE_{\delta}(R,S)$ ;

(3)

$\textit{NMI}_{\delta}(R;S)=NE_{\delta}(R)-NE_{\delta}(R|S)=NE_{\delta}(S)-NE_{% \delta}(S|R)$ .

Definition 5 (Neighborhood Symmetrical Uncertainty). Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\delta\geqslant 0$ , $\forall R$ , $S\subseteq C$ , the neighborhood symmetrical uncertainty of $R$ and $S$ is defined as

$\displaystyle\textit{NSU}_{\delta}(R;S)=2\times\frac{\textit{NMI}_{\delta}(R;S% )}{NE_{\delta}(R)+NE_{\delta}(S)}$ (5)

$\textit{NSU}_{\delta}(R;S)$ can be seen as the normalized $\textit{NMI}_{\delta}(R;S)$ by scaling its range to [0, 1]. It is able to overcome the inherent shortcoming of $\textit{NMI}_{\delta}(R;S)$ .

Definition 6 (Neighborhood Conditional Mutual Information [4]). Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $\delta\geqslant 0$ , $\forall R,S,P\subseteq C$ , under the condition of known $S$ , the neighborhood conditional mutual information of $R$ and $P$ is defined as

$\displaystyle\textit{NCMI}_{\delta}(R;P|S)=-\frac{1}{|U|}\sum\limits_{i=1}^{|U% |}{\log_{2}\frac{|{\delta_{R\cup S}(x_{i})}|\cdot|{\delta_{P\cup S}(x_{i})}|}{% |{\delta_{S}(x_{i})}|\cdot|{\delta_{R\cup S\cup P}(x_{i})}|}}$ (6)

The neighborhood conditional mutual information is defined as the reduction within the uncertainty of $R$ thanks to knowledge of $P$ once $S$ is known.

Proposition 3 [3]. For $\forall R,S,P\subseteq C$ , we have the following three inferences.

(1)

$\textit{NCMI}_{\delta}(R;P|S)=NE_{\delta}(R,S)+NE_{\delta}(P,S)-NE_{\delta}(R,% P,S)-NE_{\delta}(S)$ ;

(2)

$\textit{NCMI}_{\delta}(R;P|S)=\textit{NCMI}_{\delta}(P;R|S)$ ;

3. Definitions of feature relevancy, feature redundancy and feature interaction

In this segment, we will make a comprehensive analysis from different angles and provide the different definitions of feature relevancy, feature redundancy and feature interaction.

John et al. [20] classified features into three disjoint categories, namely, strongly relevant, weakly relevant and irrelevant features from the purpose of view of probability.

Definition 7 [20] (Strong relevance). A feature $f_{i}$ is strongly relevant if the following conditions are met.

$\displaystyle P(F|C)\neq P(F|{C-\{f_{i}\}})$ (7)

Definition 8 [20] (Weak relevance). A feature $f_{i}$ is weakly relevant if the equation holds, that is $P(F|C)=P(F|{C-\{f_{i}\}})$ . and $\exists S\subset C-\{f_{i}\}$ , such that

$\displaystyle P(C|{f_{i},S})\neq P(C|S)$ (8)

Definition 9 [20] (Irrelevance). A feature $f_{i}$ is said to be irrelevant if the following conditions are met, that is $\exists S\subset C-\{f_{i}\}$ .

$\displaystyle P(C|{f_{i},S})=P(C|S)$ (9)

Yu and Liu [21] give the definition of feature redundancy from the point of view of Markov blanket.

Definition 10 [21] (Markov blanket). Given a feature $f_{i}$ , let $M_{i}\subset C$ , $M_{i}$ is a Markov blanket for $f_{i}$ if and only if

$\displaystyle P(C-M_{i}-\{f_{i}\},C|{f_{i},M_{i}})=P(C-M_{i}-\{f_{i}\},C|{M_{i% }})$ (10)

Definition 11 [21] (Redundancy). Given a full set of features $C$ , a feature $f_{i}$ is redundant, if and only if it is weakly relevant and has a Markov blanket in the set $C$ .

From the point of view of information theory, Zeng et al. [3] give another definitions of feature relevance, feature redundancy and feature interaction.

Definition 12 [3] (Relevance). Let $C$ be a full set of features, $f_{i}\in C$ , and ${f}^{\prime}=C-\{f_{i}\}$ . Feature $f_{i}$ is relevant to the class $d$ if and only if $\exists S\subseteq{f}^{\prime}$ , such that

$\displaystyle I(f_{i};C|S)>0$ (11)

Definition 13 [3] (Redundancy). Let $C$ be a full set of features, ${F}^{\prime}\subset C$ , and ${F}^{\prime\prime}=C-{F}^{\prime}$ . The feature in $C$ are said to be redundant with each other if and only if $\forall{F}^{\prime}\subset C$ ,

$\displaystyle I(C;F)\leqslant I({F}^{\prime};F)+I({F}^{\prime\prime};F)$ (12)

Definition 14 [3] (Interaction). Let $C$ be a full set of features, ${F}^{\prime}\subset C$ , and ${F}^{\prime\prime}=C-{F}^{\prime}$ . The feature in $C$ are said to be redundant with each other if and only if $\forall{F}^{\prime}\subset C$ ,

$\displaystyle I(C;F)\geqslant I({F}^{\prime};F)+I({F}^{\prime\prime};F)$ (13)

Jakulin [22] study the relationship between features and features from the point of view of Interaction Gain (IG). The formula of the IG of the variables $X$ , $Y$ and $Z$ is given.

$\displaystyle IG(X;Y;Z)=I(X,Y;Z)-I(X;Z)-I(Y;Z)$ (14)

(1)

$IG(X;Y;Z)<0$ . It means that $X$ and $Y$ are redundant.

(2)

$IG(X;Y;Z)=0$ . It means that $X$ and $Y$ are independent.

(3)

$IG(X;Y;Z)>0$ . It means that $X$ and $Y$ are interactive or complementary.

Zeng et al. [11] adopt the normalization of IG to evaluate whether or not the two variables area unit redundant or interactive with one another objectively. The Normalized Interaction Gain (NIG) of variables $X$ , $Y$ and $Z$ is outlined as follows:

$\displaystyle\textit{NIG}(f_{i};f_{j};C)=\frac{1}{2}+\frac{IG(f_{i};f_{j};C)}{% 2\times[H(f_{i})+H(f_{j})]}$ (15)

(1)

$0<\textit{NIG}(f_{i};f_{j};C)<0.5$ . It means that $f_{i}$ and $f_{j}$ are redundant.

(2)

$\textit{NIG}(f_{i};f_{j};C)=0.5$ . It means that $f_{i}$ and $f_{j}$ are independent.

(3)

$0.5<\textit{NIG}(f_{i};f_{j};C)\leqslant 1$ . It means that $f_{i}$ and $f_{j}$ are interactive.

4. Proposed feature selection algorithm

In this part, a novel feature selection method considering feature relevance, feature redundancy and feature interaction is introduced in the structure of neighborhood rough set. In Section 4.1, we give the definitions of feature relevance, feature redundancy and feature interaction in neighborhood rough set environment. In Section 4.2, we give the pseudo code of feature selection algorithm and give the analysis of time complexity and space complexity [23].

4.1 The evaluation criteria of feature correlations

4.1.1 The relevance measure

Neighborhood mutual information is generally used to measure the relevance between features and classes. However, NMI favors variables with multi-values [2]. Inspired by NMI, here a new method of information measurement is proposed by us. For a feature $f_{i}\in C$ , the relevance between $f_{i}$ and class label $d$ is measured by neighborhood symmetrical uncertainty.

$\displaystyle\textit{NSU}_{\delta}(f_{i};d)=2\times\frac{\textit{NMI}_{\delta}% (f_{i};d)}{NE_{\delta}(f_{i})+NE_{\delta}(d)}$ (16)

4.1.2 The redundancy measure

Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $f_{i}\in C-\textit{red}$ is the current candidate feature, $f_{\textit{sel}}\in\textit{red}$ is the selected feature, the redundancy of class independence between $f_{i}$ and $f_{\textit{sel}}$ is defined as

$\displaystyle\textit{NSU}_{\delta}(f_{i};f_{\textit{sel}})=2\times\frac{% \textit{NMI}_{\delta}(f_{i};f_{\textit{sel}})}{NE_{\delta}(f_{i})+NE_{\delta}(% f_{\textit{sel}})}$ (17)

In the second step of the proposed methodology, so as to get rid of the redundant features within the selected feature set. We are supposed to select the feature with the lowest redundancy.

4.1.3 The interaction measure

When it comes to interaction between features, from the angle of information theory, it is expressed as the quantity of information contributed by the addition of a new feature for classification once a explicit feature is known. In this paper [4], the author focuses on the interaction between the current candidate feature and the features within the remaining candidate feature set. In our work, we also consider this similar situation.

Given a $\textit{NDS}=(U,C\cup D,V,f,\delta)$ , $f_{i}\in C-\textit{red}$ is the current candidate feature, $f\in C-\textit{red}-\{f_{i}\}$ is feature within the remaining candidate feature set, under the condition feature $f_{i}$ is known, the interaction of between $f$ and the decision class $d$ is defined as

$\displaystyle\textit{NCMI}_{\delta}(f;d|{f_{i}})=-\frac{1}{|U|}\sum\limits_{i=% 1}^{|U|}{\log_{2}\frac{|{\delta_{\{f\}\cup\{d\}}(x_{i})}|\cdot|{\delta_{\{f_{i% }\}\cup\{d\}}(x_{i})}|}{|{\delta_{f_{i}}(x_{i})}|\cdot|{\delta_{\{f\}\cup\{f_{% i}\}\cup\{d\}}(x_{i})}|}}$ (18)

4.1.4 The objective evaluation criterion

This section summarizes feature correlations, which include the relevance between features and classes, redundancy, and interaction between features [25]. We propose an assessment criterion for distinguishing ability of a feature or a feature subset. It’s also known as the feature or feature subset that determines how well distinct classes are distinguished [26]. Consequently, a novel feature objective evaluation function is established as follows.

$\displaystyle J(f_{i})=\textit{NSU}_{\delta}(f_{i};d)-\frac{1}{|\textit{red}|}% \sum\limits_{f_{\textit{sel}}\in\textit{red}}{\textit{NSU}_{\delta}(f_{i};f_{% \textit{sel}})+\frac{1}{|{C-\textit{red}}|}\sum\limits_{f\in C-\textit{red}-\{% f_{i}\}}{\textit{NCMI}_{\delta}(f;d|{f_{i}})}}$

From this evaluation function, we consider relevance, redundancy and interaction comprehensively when choose a feature. The neighborhood conditional mutual information is used to evaluate the interaction between the selected feature and the candidate features, and the neighborhood symmetrical uncertainty is used to measure the relevance between features and classes and the redundancy between the feature and the selected features in this evaluation function. In this way, we can not only ensure the representativeness of the selected features, but also ensure that we do not lose any useful information.

4.2 feature selection algorithm

Based on the comprehensive analyses of feature and class relationships in Section 4.1 [27]. As a consequence, a feature selection algorithm is demonstrated, as shown in Algorithm 1.

Algorithm 1:
Input: A $\textit{NDS}=(U,C\cup D,V,f,\delta)$ with $U=\{x_{1},x_{2},\cdots,x_{n}\}$ and $C=\{f_{1},f_{2},\cdots,f_{m}\}$ .
Outout: A reduct feature subset $\textit{red}_{\textit{best}}$ .
1: Selected feature subset $\textit{red}\leftarrow\emptyset$ ;
2: for $i=$ 1 to $\|C\|$ do
3: Calculate $NR_{P}^{\delta}$ ;
4: Get neighborhood matrix $M(N)$ ;
5: end for
6: for each $f_{i}\in F$ do
7: Calculate $\textit{NSU}_{\delta}(f_{i};d)$ ;
8: end for
9: The feature $f_{s}$ with the maximum relevance is selected;
10: $\textit{red}\leftarrow\{f_{s}\}$ ;
11: $C\leftarrow C\backslash\{f_{s}\}$ ;
12: for each $f_{i}\in F$ do
13: for each $f_{\textit{sel}}\in\textit{red}$ do
14: Calculate $\textit{NSU}_{\delta}(f_{i};f_{\textit{sel}})$ ;
15: end for
16: for each $f_{i}^{C}\in C-\textit{red}-\{f_{i}\}$ do
17: Calculate $\textit{NCMI}_{\delta}(f_{i}^{C};d\|{f_{i}})$ ;
18: end for
19: Calculate $J(f_{i})$ ;
20: Select the feature $f$ that satisfies $\arg\mathop{\max}\limits_{f_{i}\in C-\textit{red}}\textit{sig}(f_{i})$ ;
21: Update $\textit{red}\leftarrow\textit{red}\cup\{f\}$ ;
22: $C\leftarrow C\backslash\{f\}$ ;
23: end for
24: The best feature subset $\textit{red}_{\textit{best}}$ is selected by using the classifiers;
25: return $\textit{red}_{\textit{best}}$ ;

The following phases make up the majority of Algorithm 1 [28]. First of all, the selected feature set is initialized as an empty set, and the neighborhood relation matrix of samples about the feature $f_{i}$ is obtained in order to calculate the neighborhood information uncertainty measures. In steps 6–8, we calculate the values of neighborhood symmetry uncertainty between each feature and class. The feature with the greatest correlation is selected by us in step 9. Then adding the feature $f_{s}$ to red and removing the feature $f_{s}$ from $C$ . In steps 12–23, the feature evaluation function calculates the relevance of features, and the feature with best classification accuracy is chosen in turn. The wrapper feature selection approach is used to find the best feature set with the most information and differentiating ability in the final phase.

Now, we will give the analysis of time complexity and space complexity of NSUNCMI algorithm [29]. In steps 2–5, the computational complexity of the neighborhood relation matrix is $O(mn^{2})$ . In steps 6–8, the computational complexity of the relevance is $O(m)$ . In steps 12–18, the computational complexity is $O(m^{3})$ . To summarize, the time complexity of the algorithm is $O(mn^{2})$ and the space complexity is $O(n^{2})$ .

5. Experimental results and analysis

5.1 Experiment setup

The following experiments are run on a personal computer with 2.5 GHz Intel Core CPU, 4 GB memory, and Windows 7. Each dataset is divided into 70% as a training set and 30% as a test set. The experimental results compared with the other five different kinds of feature selection algorithms upon nine real world datasets respectively [31]. The algorithms embrace five well-known and regularly used MIFS, MRMR, IWFS, Relief-F and FCBF. Among these feature selection algorithms, Relief-F can be found in the WEKA1 environment and others are all implemented in Python. The SVM classifier is used to select the top k features to produce the highest accuracy. The quality of feature selection outcomes is assessed by means of utilizing the average classification accuracies of three classifiers, i.e., KNN, SVM and C4.5.

Table 2
Experimental datasets description

No.	Datasets	Samples	Attributes	Class
1	Iris	150	4	3
2	Wine	178	13	3
3	Vehicle	846	18	4
4	Breast cancer	271	162	2
5	WDBC	569	31	2
6	Sonar	208	60	2
7	Credit	690	15	2
8	Glass	214	10	6
9	Colon tumor	62	2000	2

5.2 Datasets

To verify the adequacy of our methodology on universe issues, nine datasets (i.e., Nos. 1–9) from UCI machine learning repository is utilized [32]. The overall characteristics of datasets are introduced in Table 2. The eight datasets include two classifications problem and multiple classification problem, the sizes of datasets vary from 150 to 846, the numbers of candidate features vary from 4 to 2000.

5.3 Experimental results

5.3.1 Number of selected features

For each dataset, we run every one of the six feature selection algorithms to acquire the number of selected feature subsets, and results are shown in Table 3.

Table 3
Number of features selected with different algorithms

Datasets	Full set	NSUNCMI	MRMR	Relief-F	MIFS	FCBF	IWFS
Iris	4	2	2	2	2	2	2
Wine	13	6	5	8	8	7	8
Vehicle	18	8	15	12	10	7	8
Breast cancer	162	12	13	16	12	15	10
WDBC	31	8	17	10	19	10	12
Sonar	60	9	19	19	14	10	15
Credit	15	10	10	9	13	6	12
Glass	10	5	8	7	6	5	8
Colon tumor	2000	15	15	46	20	13	27
Avg.	257	8	12	14	12	8	11

Table 2 shows that all the eight feature selection algorithms can apparently reduce the data dimension. Compared with these algorithms such as MRMR, Relief-F, MIFS, FCBF, IWFS, and our proposed NSUNCMI algorithm can select a relatively small number of features.

5.3.2 Accuracy comparison

Classification performance is taken into account to be one of the most effective and direct ways to verify the standard of the feature selection algorithm [34]. In order to avoid the influence of a certain data set and calculation error. The classification accuracies of the same feature selection algorithm on totally different datasets square measure averaged, which is shown in the row named as “Avg.” [35]. The comparisons of the average classification accuracies of various feature selection algorithms on the three classifiers are given in Tables 4–6.

Table 4
Classification accuracy (%) of selected features (SVM)

Datasets	Raw data	NSUNCMI	MRMR	Relief-F	MIFS	FCBF	IWFS
Iris	96.13 $\pm$ 0.40	97.22 $\pm$ 0.39	96.47 $\pm$ 0.43	95.94 $\pm$ 0.55	96.00 $\pm$ 0.60	96.27 $\pm$ 0.35	95.82 $\pm$ 0.37
Wine	98.33 $\pm$ 2.24	97.50 $\pm$ 1.86	96.63 $\pm$ 0.47	84.31 $\pm$ 5.63	98.54 $\pm$ 0.37	98.66 $\pm$ 3.07	98.37 $\pm$ 0.69
Vehicle	50.00 $\pm$ 1.94	54.15 $\pm$ 2.74	47.20 $\pm$ 1.28	45.83 $\pm$ 2.84	52.66 $\pm$ 1.79	50.56 $\pm$ 2.75	50.93 $\pm$ 1.95
Breast	86.01 $\pm$ 0.85	86.43 $\pm$ 0.97	85.96 $\pm$ 1.70	86.43 $\pm$ 1.35	86.32 $\pm$ 1.27	87.21 $\pm$ 0.86	83.81 $\pm$ 0.53
WDBC	94.91 $\pm$ 3.25	98.05 $\pm$ 2.30	97.80 $\pm$ 0.16	97.02 $\pm$ 1.66	97.72 $\pm$ 0.14	94.51 $\pm$ 1.66	97.66 $\pm$ 0.17
Sonar	75.48 $\pm$ 1.20	82.44 $\pm$ 1.49	77.40 $\pm$ 0.91	72.71 $\pm$ 5.80	78.70 $\pm$ 1.02	79.58 $\pm$ 6.37	75.48 $\pm$ 1.28
Credit	77.15 $\pm$ 1.13	82.76 $\pm$ 1.27	79.67 $\pm$ 1.44	82.48 $\pm$ 2.25	79.88 $\pm$ 1.33	86.34 $\pm$ 2.04	85.51 $\pm$ 0.00
Glass	64.84 $\pm$ 0.84	85.38 $\pm$ 1.33	64.95 $\pm$ 0.93	82.80 $\pm$ 0.58	65.00 $\pm$ 0.64	64.25 $\pm$ 4.06	82.80 $\pm$ 0.98
Colon tumor	76.10 $\pm$ 7.57	93.44 $\pm$ 7.27	93.42 $\pm$ 6.32	90.11 $\pm$ 8.54	84.65 $\pm$ 5.20	89.84 $\pm$ 8.27	93.37 $\pm$ 7.83
Avg.	79.88 $\pm$ 2.16	86.37 $\pm$ 2.18	82.17 $\pm$ 1.52	81.96 $\pm$ 3.24	82.16 $\pm$ 1.37	83.02 $\pm$ 3.27	84.86 $\pm$ 1.53

Table 5

Classification accuracy (%) of selected features (KNN)

Datasets	Raw data	NSUNCMI	MRMR	Relief-F	MIFS	FCBF	IWFS
Iris	95.40 $\pm$ 0.36	97.78 $\pm$ 0.32	96.07 $\pm$ 0.55	96.63 $\pm$ 0.73	95.67 $\pm$ 0.45	96.00 $\pm$ 0.63	97.37 $\pm$ 0.45
Wine	96.04 $\pm$ 4.33	98.68 $\pm$ 0.69	96.47 $\pm$ 3.93	95.56 $\pm$ 0.73	94.98 $\pm$ 3.78	89.66 $\pm$ 0.80	95.05 $\pm$ 0.73
Vehicle	63.39 $\pm$ 2.27	64.72 $\pm$ 2.07	64.95 $\pm$ 1.99	65.48 $\pm$ 2.73	62.49 $\pm$ 0.98	60.97 $\pm$ 2.48	63.93 $\pm$ 2.26
Breast	77.45 $\pm$ 0.91	82.45 $\pm$ 0.79	85.12 $\pm$ 1.10	77.75 $\pm$ 1.51	83.57 $\pm$ 0.80	74.55 $\pm$ 1.09	78.83 $\pm$ 0.57
WDBC	95.31 $\pm$ 0.21	97.27 $\pm$ 0.22	95.89 $\pm$ 0.29	95.75 $\pm$ 0.25	96.20 $\pm$ 0.30	96.22 $\pm$ 0.22	95.36 $\pm$ 0.19
Sonar	83.56 $\pm$ 1.48	87.99 $\pm$ 1.29	86.54 $\pm$ 0.53	87.16 $\pm$ 0.57	84.30 $\pm$ 1.10	79.86 $\pm$ 0.69	87.36 $\pm$ 0.53
Credit	66.38 $\pm$ 0.48	77.92 $\pm$ 0.53	65.67 $\pm$ 0.98	81.04 $\pm$ 1.15	67.04 $\pm$ 0.36	82.13 $\pm$ 1.00	81.62 $\pm$ 0.35
Glass	71.25 $\pm$ 0.69	98.72 $\pm$ 1.27	72.94 $\pm$ 1.11	98.08 $\pm$ 0.33	71.47 $\pm$ 1.62	98.13 $\pm$ 0.38	92.71 $\pm$ 0.56
Colon tumor	76.25 $\pm$ 6.39	84.94 $\pm$ 4.52	84.74 $\pm$ 2.36	77.46 $\pm$ 5.90	79.55 $\pm$ 5.25	82.69 $\pm$ 6.28	84.44 $\pm$ 5.73
Avg.	80.56 $\pm$ 1.90	87.83 $\pm$ 1.30	83.15 $\pm$ 1.43	86.10 $\pm$ 1.54	81.70 $\pm$ 1.63	84.47 $\pm$ 1.51	86.30 $\pm$ 1.26

Table 6

Classification accuracy (%) of selected features (C4.5)

Datasets	Raw data	NSUNCMI	MRMR	Relief-F	MIFS	FCBF	IWFS
Iris	96.13 $\pm$ 0.40	98.20 $\pm$ 0.55	96.47 $\pm$ 0.43	95.78 $\pm$ 0.72	96.00 $\pm$ 0.60	96.27 $\pm$ 0.35	97.28 $\pm$ 0.65
Wine	90.16 $\pm$ 2.74	96.47 $\pm$ 3.08	95.19 $\pm$ 2.53	93.94 $\pm$ 4.10	91.39 $\pm$ 1.04	89.89 $\pm$ 3.07	91.12 $\pm$ 1.15
Vehicle	70.59 $\pm$ 2.44	70.97 $\pm$ 1.93	71.83 $\pm$ 2.56	72.25 $\pm$ 2.53	73.21 $\pm$ 2.94	75.56 $\pm$ 2.75	75.66 $\pm$ 1.47
Breast	86.01 $\pm$ 0.85	90.04 $\pm$ 1.22	85.96 $\pm$ 1.70	86.43 $\pm$ 1.35	86.32 $\pm$ 1.27	87.21 $\pm$ 0.86	82.64 $\pm$ 0.78
WDBC	94.91 $\pm$ 3.25	97.91 $\pm$ 2.59	97.80 $\pm$ 0.16	92.92 $\pm$ 0.67	97.72 $\pm$ 0.14	94.51 $\pm$ 1.66	91.97 $\pm$ 0.83
Sonar	75.48 $\pm$ 1.20	82.55 $\pm$ 3.84	77.40 $\pm$ 0.91	79.18 $\pm$ 5.64	78.70 $\pm$ 1.02	76.58 $\pm$ 6.37	71.06 $\pm$ 2.47
Credit	85.83 $\pm$ 1.36	87.27 $\pm$ 0.92	85.51 $\pm$ 1.04	85.33 $\pm$ 1.24	81.39 $\pm$ 0.80	84.77 $\pm$ 1.50	81.62 $\pm$ 0.76
Glass	68.36 $\pm$ 0.25	96.65 $\pm$ 0.64	98.32 $\pm$ 0.31	98.88 $\pm$ 0.31	98.74 $\pm$ 1.87	68.09 $\pm$ 5.17	98.18 $\pm$ 0.49
Colon tumor	90.00 $\pm$ 7.92	93.50 $\pm$ 8.58	93.42 $\pm$ 7.82	90.00 $\pm$ 8.55	90.44 $\pm$ 7.84	90.21 $\pm$ 7.85	93.37 $\pm$ 7.83
Avg.	84.16 $\pm$ 2.27	90.40 $\pm$ 2.59	89.10 $\pm$ 1.94	88.30 $\pm$ 2.79	88.21 $\pm$ 1.95	84.79 $\pm$ 3.29	86.99 $\pm$ 1.83

From the comparisons in Tables 4–6, we will notice the subsequent results [36].

(1)

Compared with the raw data, the classification performance of the NSUNCMI algorithm is improved conspicuously. The average classification accuracies of the proposed algorithm are improved by 8.12%, 9.02% and 7.41% individually on the three classifiers.

(2)

On the entire, the performance of the proposed NSUNCMI algorithm is healthier than other feature selection algorithms on most datasets. For instance, when C4.5 is employed as a classifier for testing, the NSUNCMI algorithm achieves the most effective classification performance on seven datasets.

For those feature ranking algorithms, one technique to examine effectiveness of selected features are to add features for learning one by one within the order that the features are selected [3]. In this paper, six datasets with completely different range of the selected features square measure tested on SVM, KNN and C4.5 classifier. Figure 1 shows the common classification accuracy on the six datasets [37].

Figure 1.

Average classification accuracy versus different number of selected features.

From careful observations and analyzes of Fig. 1, our projected NSUNCMI algorithmic program will get an improved result compared with different five feature selection algorithms [38]. For instance, the highest average accuracy of Iris, Wine, Glass, WDBC achieved by NSUNCMI is 98.20%, 98.68%, 98.72%, 98.05% respectively. On the whole, the classification accuracy of each data set increase slowly with the increase of the number of features, and finally reaches the peak value and tends to be stable.

5.3.3 Computational time

The calculation time is also one of significant measures to evaluate algorithms [39]. It is well known that MRMR is a feature selection method with high computational complexity. We contrast the running time of NSUNCMI and that of MRMR and IWFS. In Table 7, we compare the computing time of three feature ranking algorithms of NSUNCMI, MRMR and IWFS on eight datasets.

Table 7
Running time of three algorithms

Datasets	Samples	Attributes	NSUNCMI	MRMR	IWFS
Iris	150	4	0.0072	0.0094	0.0086
Wine	178	13	0.0309	0.0452	0.0265
Vehicle	846	18	0.0472	0.0608	0.0671
Breast	271	162	1.0461	1.2320	1.2257
WDBC	569	31	4.0310	5.2218	4.6270
Sonar	208	60	1.0110	0.0733	0.1029
Credit	690	15	0.0738	0.0964	0.0742
Glass	214	10	0.0244	0.0752	0.0586
Colon tumor	62	2000	0.3524	0.3830	0.2510

From the Table 7, we can find that the running time of NSUNCMI algorithm on Iris, Vehicle, Breast, WDBC, Credit and Glass is lower than the other two algorithms. However, the running time on the three data sets of Wine, Sonar and Colon Tumor are higher than the other two algorithms, but it is within the range of our acceptance.

6. Conclusions and future work

The main purpose of feature selection is to select the feature sets that can represent the original data [40]. The problem of feature interaction is interesting, which can be further discussed in the future work. In this article, we will introduce a new feature selection algorithm that not only removes unrelated and redundant features, but also takes into account interactions that are very effective [30]. First, inspired by mutual information, a new method of information measurement is proposed in the neighborhood rough set environment. Then considering comprehensively from three aspects, we put forward an evaluation standard of feature selection. Based on the evaluation standard, we present the NSUNCMI feature selection algorithm. Experimental results from synthetic datasets showed that NSUNCMI is able to effectively identify related features, while eliminating unnecessary features and preserving interactive features.

As a matter of fact, feature selection has been widely used in data mining, image annotation, data analysis [41]. As an important data preprocessing method, the study of feature selection is an intriguing hot topic. For example, how to find better information measurement tools and how to propose more effective and reasonable algorithms. learning information shared by four or more features and using N-directional interaction to select more relevant and interactive features. There are many complex problems that need to be solved [42]. They are all the contents of our next stage of research.

Footnotes

Acknowledgments

We appreciate Jingjing Hu, Jing Yu and another person for their contributions to this paper. This research is supported by the Software Project Development Contract (No. 13028001).

References

Gao

Zhao

Zhang

and Wang

, Feature selection considering two types of feature relevancy and feature interdependency, Expert Systems with Applications 93 (2018), 423–434.

Lin

Ren

Luo

and Qi

, A new feature selection method based on symmetrical uncertainty and interaction gain, Biology and Chemistry 83 (2019), 107149.

Zeng

Zhang

and Yi

, A novel feature selection method considering feature interaction, Pattern Recognition 48 (2015), 2656–2666.

Wan

Chen

Yuan

Yang

and Sang

, A novel hybrid feature selection method considering feature interaction in neighborhood rough set, Knowledge-Based Systems 227 (2021), 107167.

Liu

and Wu

, Neighborhood rough set based heterogeneous feature subset selection, Information Sciences 178 (2008), 3577–3594.

Gao

and Wu

, Relevance assignation feature selection method based on mutual information for machine learning, Knowledge-Based Systems (2020), 106439.

Wang

Jiang

and Jiang

, A feature selection method via analysis of relevance, redundancy, and interaction, Expert Systems with Applications 183 (2021), 115365.

Kira

, L. and Rendell

, The feature selection problem: Traditional methods and a new algorithm, in: Proceedings of Ninth National Conference on Artificial Intelligence, 1992, pp. 129–134.

and I

, Theoretical and Empirical Analysis of Relief-F and RRelief-F, Machine Learning 53 (2003), 23–69.

10.

Kira

and Rendell

L.A.

, The feature selection problem: Traditional methods and a new algorithm, in: Proceedings of Ninth National Conference on Artificial Intelligence, 1992, pp. 129–134.

11.

Dash

and Liu

, Consistency-based search in feature selection, Artificial Intelligence 151 (2003), 155–176.

12.

Roberto

, Using mutual information for selecting features in supervised neural net learning, IEEE Transactions on Neural Networks 5 (1994), 537–550.

13.

Peng

Long

and Ding

, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005), 1226–1238.

14.

Mohamed

Yulia

and Rossitza

, Feature selection using Joint Mutual Information Maximisation, Expert Systems with Applications 42 (2015), 8520–8532.

15.

Wang

Wei

Yang

and Wang

, Feature selection by maximizing independent classification information, IEEE Transactions on Knowledge and Data Engineering 29 (2017), 828–841.

16.

Fleuret

, Fast binary feature selection with conditional mutual information, Journal of Machine Learning Research 5 (2004), 1531–1555.

17.

Bennasar

Setchi

and Hicks

, Feature interaction maximization, Pattern Recognition Letters 34 (2013), 1630–1635.

18.

Lewis

D.D.

, Feature Selection and Feature Extraction for Text Categorization, in: Proceedings of the Workshop on Speech and Natural Language, 1992, pp. 212–217.

19.

Nakariyakul

, High-dimensional hybrid feature selection using interaction information-guided search, Knowledge-Based Systems, 145 (2018), 59–66.

20.

John

G.H.

Kohavi

and Pfleger

, Irrelevant features and the subset selection problem, in: Proceedings of the Eleventh International Conference on Machine Learning, 1994, pp. 121–129.

21.

and Liu

, Efficient feature selection via analysis of relevance and redundancy, The Journal of Machine Learning Research 5 (2004), 1205–1224.

22.

Jakulin

, Attribute Interactions in Machine Learning, PhD Thesis, University of Ljubljana, 2015.

23.

Zhou

Zhang

Zhou

Guo

and Ma

, A feature selection algorithm of decision tree based on feature weight, Expert Systems With Applications 164 (2021), 113842.

24.

Wang

Jiang

and Jiang

, A feature selection method via analysis of relevance, redundancy, and interaction, Expert Systems With Applications 183 (2021), 115365.

25.

Liu

Huang

Jiang

and Zeng

, Quick attribute reduct algorithm for neighborhood rough set model, Information Sciences 271 (2014), 65–81.

26.

Wan

Chen

Yang

and Sang

, Dynamic interaction feature selection based on fuzzy rough set, Information Sciences 581 (2021), 891–911.

27.

Eric

C.C.

Tsang

Guo

Chen

and Xu

, A novel approach to attribute reduction based on weighted neighborhood rough sets, Knowledge-Based Systems 220 (2021), 106908.

28.

Sun

Zhang

Qian

and Zhang

, Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification, Information Sciences 502 (2019), 18–41.

29.

Apolloni

Leguizamon

and Alba

, Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments, Applied Soft Computing 38 (2015), 922–932.

30.

Yuan

Chen

Sang

and Luo

, Unsupervised attribute reduction for mixed data based on fuzzy rough sets, Information Sciences 572 (2021), 67–87.

31.

Chen

Fan

and Luo

, Feature selection for imbalanced data based on neighborhood rough sets, Information Sciences 483 (2019), 1–20.

32.

Fan

Zhao

Wang

and Huang

, Attribute reduction based on max-decision neighborhood rough set model, Knowledge-Based Systems 151 (2018), 16–23.

33.

Guo

Wei

and He

, Spatial-domain steganalytic feature selection based on three-way interaction information and KS test, Soft Computing, 2019.

34.

Jiang

and Wang

, Efficient feature selection based on correlation measure between continuous and discrete features, Information Processing Letters 116 (2016), 203-215.

35.

Yang

Bai

Zhang

and Deng

, Maximum relevance minimum common redundancy feature selection for nonlinear data, Information Sciences 409 (2017), 68–86.

36.

Dai

and Chen

, Feature selection via normative fuzzy information weight with application into tumor classification, Applied Soft Computing 92 (2020), 106299.

37.

and Xie

, Information-preserving hybrid data reduction based on fuzzy-rough techniques, Pattern Recognition Letters 27 (2006), 414–423.

38.

Nie

Yang

Zhang

and Li

, A general framework for auto-weighted feature selection via global redundancy minimization, IEEE Transactions on Image Processing 28 (2019), 2428–2438.

39.

Tang

Dai

and Xiang

, Feature selection based on feature interactions with application to text categorization, Expert Systems with Applications 120 (2018), 207–216.

40.

Zhang

Mei

Chen

and Li

, Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy, Pattern Recognition 56 (2016), 1–15.

41.

Zhang

Chen

Zhang

and Wen

, New uncertainty measurement for categorical data based on fuzzy information structures: An application in attribute reduction, Information Sciences 580 (2021), 541–577.

42.

Lin

Liu

and Wu

, Streaming feature selection for multi-label learning based on fuzzy mutual information, IEEE Transactions on Fuzzy Systems 25 (2017), 1491–1507.

A novel feature selection method considering feature interaction in neighborhood rough set

Abstract

Keywords

1. Introduction

Table 1 Evaluation criteria of the algorithm

2.1 Basic concepts

4.1 The evaluation criteria of feature correlations

4.1.1 The relevance measure

4.2 feature selection algorithm

5. Experimental results and analysis

5.1 Experiment setup

Table 2 Experimental datasets description

5.3 Experimental results

5.3.1 Number of selected features

Table 3 Number of features selected with different algorithms

Table 4 Classification accuracy (%) of selected features (SVM)

Table 7 Running time of three algorithms

Footnotes

Acknowledgments

References

Table 1
Evaluation criteria of the algorithm

Table 2
Experimental datasets description

Table 3
Number of features selected with different algorithms

Table 4
Classification accuracy (%) of selected features (SVM)

Table 7
Running time of three algorithms