Abstract
Feature interaction is crucial in the process of feature selection. In this paper, a grouping feature selection method based on feature interaction (GFS-NPIS) is proposed. Firstly, a new evaluation function measuring feature interaction is proposed. Secondly, a grouping strategy based on approximate Markov blanket is used to remove strong redundant features. Lastly, a new feature selection method called as GFS-NPIS is given. In order to verify the effectiveness of our method, we compare GFS-NPIS with other eight representative ones on three classifiers (SVM, KNN and CART). The experimental results on fifteen public data sets show that GFS-NPIS outperforms others in terms of classification accuracy and Macro-F1.
Introduction
Feature selection is an important step in data mining and pattern recognition [1]. There are a large number of irrelevant and redundant features existing in data for practical applicating. These irrelevant and redundant features not only increase the computational cost but also have a negative impact on learning models [2]. As a result, it is necessary to identify and remove them. And the feature selection method based on them are more urgent.
Nowadays, in the field of feature selection, most scholars focus on the topic of feature evaluation function and search strategy [3]. For feature evaluation function, there are two common relationships among features, which are relevance and redundancy. Relevance is usually determined by the relationship between a feature and class label, whereas redundancy is determined by the relationship between two features [4]. Apart from relevance and redundancy, feature interaction is also used now. Among massive of feature selection algorithms, most of them only consider feature relevance and feature redundancy, but seldomly consider feature interaction [5]. As a matter of fact, different features may interact and collaborate together. Hence, feature interaction is very important and inescapable in the process of feature selection. If feature interaction is ignored, some very important features will be misjudged as redundant features removal. Therefore, this paper aims to propose an efficient interactive measurement criterion, which can be used in the process of removing redundant features to keep some redundant but interactive features and prevent important features from being removed, thus improving the final classification accuracy.
In the topic of search strategy, most feature methods use greedy ones. Obviously, they are computationally expensive [6]. For this problem. Yu and Liu proposed FCBF [7] method that selects predominant features and removes those highly correlated features through an approximate Markov blanket search strategy. Compared with greedy strategy, approximate Markov blanket is more efficient in execution time.
In this paper, we consider feature interaction into feature selection and use approximate Markov blanket as our search strategy. To sum up, our contributions are as follows. (1) A new concept, Normalized Pure Interaction Score (NPIS), is proposed to measure feature interaction; (2) Feature interaction is considered in removing strong redundant features by approximate Markov blanket search strategy; (3) A new feature selection method, A grouping feature selection method based on NPIS (GFS-NPIS), is proposed.
The rest of the paper is arranged as follows. Section 2 reviews the related works. Section 3 introduces the background theory involved. Section 4 introduces the proposed method in details. Section 5 is the experiments and the analysis of the experimental results. Section 6 is the summary and our future work.
Related works
Suppose we have a data set
Mutual Information Maximum (MIM) [8] algorithm is the earliest feature selection method using mutual information (MI). MIM uses
In terms of redundancy problems, many scholars have studied how to identify the redundancy features more accurately. In 2005, Peng and Ding proposed minimum Redundancy Maximum Relevance (mRMR) [9]. In mRMR, they not only used
Based on mRMR, Conditional Informative Feature Extraction (CIFE) [10] and Feature Selection Based on Joint Mutual Information (JMI) [11] are proposed. In these two ones, the redundancy is further refined to the intra-class feature redundancy. And JMI uses
The above-mentioned algorithms utilize the different evaluation index to identify relevant features and redundant features, but they have two limitations. (1) They are computationally expensive with greedy search strategies; (2) They ignore the feature interaction among different features.
In order to solve the first problem, many researchers have done a lot of work. For example, Yu and Liu proposed Fast Relevance-Based Filter (FCBF) [7] method that determines the selected feature set through an approximate Markov blanket search strategy. FCBF uses
In terms of feature interaction, more and more scholars have studied feature interaction to improve the classification accuracy. In 2020, a new feature selection algorithm based on relevance, redundancy and complementarity (FSRRC) [12] was proposed by Chao Li et al. FSRRC not only has the same search strategy as FCBF in relevancy and redundancy but also considers feature interaction in feature selection. FSRRC uses Eq. (9). to identify the interactive features.
Besides FSRRC, a filter-based feature selection by feature grouping (FFSG) [13] was proposed in 2020. FFSG uses strong approximate Markov blanket to remove redundant features. In FFSG, the redundant features are identified and removed by Eq. (10) firstly. Then, the interactive features are identified and retained by Eq. (11). Finally, the irrelevant features are removed by an evaluation function which are shown as Eq. (12). From Eq. (12), it can be seen that FFSG only considers the change of information between two features, but ignores the influence of the class label on feature interaction.
Yamada et al. [14] proposed HSIC Lasso algorithm a kernel-based mRMR, which is a non-linear feature selector. Instead of mutual information, HSIC Lasso employs the HSIC [15] to measure dependency between two variables. In addition, it uses a penalty term to select a small number of features. This results in a convex optimization problem, for which one can therefore find a globally optimal solution. HSIC Lasso considers a feature-wise kernelized Lasso for capturing nonlinear input-output dependency [16]. HSIC of two variables
Where,
Information theory
Entropy, mutual information and conditional mutual information are the most popular metrics in feature selection. Entropy is used to evaluate the uncertainty of a random variable. The entropy of a random variable
Where
For any two random variables, the concept of entropy can be extended to joint entropy and conditional entropy. Joint entropy of two random variables
where
As is known to all, mutual information (MI) is widely applied in feature selection. MI can be used to measure the degree of the relevance between two random variables and how much information one variable contains about another one. MI of two variables
Here,
In fact, MI tends to favor features with more values. Therefore, Symmetrical Uncertainty (SU) is introduced in many studies. SU is the normalized MI, which is defined as follows [19].
Conditional Mutual Information (CMI) is an extension of MI for measuring the conditional dependence between two variables given a third variable [20]. CMI of two random variables
In this paper, we use a grouping strategy to remove strong redundant features. There are three related concepts about grouping strategy, which are approximate Markov blanket, predominant feature and predominant group.
To understand approximate Markov blanket better, it is required to learn about the concept of Markov blanket. Markov blanket is proposed by Koller and Sahami [21]. Markov blanket is defined as follows.
Here,
MB is calculated by the feature subset, and the involved calculation amount is relatively large. Therefore, a method called as approximate Markov blanket is proposed by Yu and Liu [7]. Approximate Markov blanket is defined as follows.
Based on Approximate Markov blanket, predominant feature is defined as follows.
Detecting all correlated subsets is a combinatorial problem since it requires to take into account
By grouping relevant features, the feature space can be split into some groups so that each group is composed of redundant features. Predominant features are relevant and not redundant, so that each group
Problem
In Section 2 (Related works), we introduced seven feature selection algorithms. MIM uses the mutual information to measure feature relevance. The larger the value of mutual information, the higher the correlation between features and categories. However, MIM doesn’t consider feature redundancy and feature interactivity, which leads to a large number of redundant features in the selected feature set, and the final classification accuracy is very low.
Compared with MIM, the four algorithms (mRMR, CIFE, JMI, FCBF) not only consider feature correlation but also consider feature redundancy. They used different evaluation functions to evaluate the importance of features, but ignored feature interaction. Feature interaction means that two features working together could provide more information than the sum of their individual information. Even if two features are redundant, their combination may be more informative, and their cooperation may provide additional information in some cases. Therefore, neglecting the interaction between features may lose some meaningful information. The loss of important features will eventually reduce the final classification accuracy.
The above algorithms do not consider feature interaction, and there are two algorithms (FSRRC, FFSG) that consider feature interaction. In FSRRC, Eq. (9) is used to evaluate feature interaction. Equation (9) uses CMI to subtract the information difference to evaluate feature interaction, which has a problem of multi-value tendency. FFSG, they used Eq. (12) to identify the interactive features. From Eq. (19), it can be seen that FFSG only considers the change of information between two features, but ignores the influence of the class label on feature interaction.
Generally speaking, the current feature selection methods either don’t consider feature interaction or the evaluation of feature interaction are too single and need to be further improved. Therefore, this paper aims to study feature interaction in the process of feature selection. Taking feature interaction into the process of removing redundant features can prevent some important features from being misjudged as redundant features.
Method
It is well known that feature selection aims to select the features that are highly relevant with the category. Thus, the influence of class label on feature interaction can’t be ignored. And an effective feature interaction criterion is very important for feature interaction. After analyzing Eq. (9), we know that CMI subtracts the information difference can measure feature interaction. However, they ignored the difference of relevancy between two features in Eq. (9). In our view, if the difference value of relevancy between two features is smaller than the value of CMI, we think that
Without considering the information difference between two features of
Here,
In the proposed method, NPIS is used to measure feature interaction and feature interaction is considered in the process of removing redundancy features. Therefore, we also put forward a concept of strong redundant feature. Strong redundant feature can be defined as follows.
Flow chart of GFS-NPIS.
When a candidate feature
Compared with FSRRC and FFSG, NPIS measures feature interaction more comprehensively to retain important interactive features and improve the final classification accuracy.
In this section, we propose a new feature selection method called as GFS-NPIS. Figure 1 shows the flow chart of the proposed method and the detail pseudo-code of the proposed method is shown as in Table 1 (Algorithm1). GFS-NPIS algorithm mainly consists of three steps. Step1 pre-filters and sorts the features to determine the search sequence (Algorithm1, lines 1–7). Step2 uses NPIS to identify interactive features and uses a grouping tactics to remove the strong redundant features by approximate Markov blanket (Algorithm1, lines 8–21). Step3 is to remove irrelevant features by an evaluation function (Algorithm1, lines 22–30).
Pseudo-code of GFS-NPIS
Pseudo-code of GFS-NPIS
In Algorithm1,
Experiment setup
We build and train the model using three popular classifiers, which are Support Vector Machine(SVM) [23], K-Nearest Neighbor(KNN) [24] and Classification and regression tree(CART) [25]. SVM adopts the linear kernel. The number of the neighbors is set to be 3 as KNN classifier. Decision tree classifier chooses Gini coefficient as the measurement index.
Before feature selection begins, we preprocess the data sets. Concretely, we standardize the original data [26] and discrete the continuous data into 10 equal size bins [27]. In addition, we adopt the following strategy in order to ensure the fairness of the experiment in GFS-NPIS. If the number of the selected features is less than 50, the contrast algorithms will choose the same number as GFS-NPIS. If not, all algorithms set the number of selected features as 50. Finally, fifteen data sets and the ten-fold cross method are used to verify the performance of GFS-NPIS. Besides these, it is worth noting that we use classification accuracy and Macro-F1 to evaluate the feature subsets selected by different methods [28].
Data sets
Experiments were performed on fifteen public data sets, which were selected from the public University of California Irvine (UCI) Repository and Kaggle website [29]. The detail descriptions of the data sets are shown in Table 2, in which they are sorted in ascending order according to the number of features. The number of samples from 72 to 9298, the number of features ranges from 60 to 7070 and the number of class label ranges from 2 to 40.
Description of data sets
Description of data sets
The column named “Abbr” is the abbreviation of data set name. The column named “Missing value?” can help us clearly know which data set has missing values. For the missing values, we used the average value of the column where the missing values are located to fill up the missing values. The column named “Unbalance ratio” is an indicator for evaluating the degree of data set balance. Unbalance Ratio can be calculated as follows.
Where,
To verify the effectiveness of GFS-NPIS, eight popular feature selection algorithms (FFSG, FSRRC, FCBF, mRMR, JMI, MIM, CIFE and HSIC-Lasso) are used to contrast. In the experiment results, as shown in Tables 4–5, the maximum value of each row is marked by bold fonts and red color. And if the accuracy gap between two methods is within 0.5, the two algorithms will be supposed to be same performance. The row in the table “Avg” indicates the average value of the corresponding algorithm on all data sets. Finally, we use a Kolmogorov-Smirnov-test [30] to make the test. And in the experimental results, ‘
Accuracy assessment
Table 4 shows the average classification accuracy of nine algorithms under SVM classifier. From Table 4, we can see that GFS-NPIS only does not achieve the highest accuracy on two data sets (WarPIE10P and Lung). Compared with FFSG, GFS-NPIS is better than FFSG on all 15 data sets, especially on Lung_discrete, Musk-1 and Isolet. The reason is that FFSG ignores the influence of the category
Table 4 shows the classification accuracy under KNN classifier. Under KNN, GFS-NPIS does not achieve the highest on four data sets (USPS, Yale, Lung, RELATHE). And we can conclude that the algorithm can obtain different accuracy under different classifiers (USPS and REALTHE), GFS-NPIS achieves the highest accuracy under SVM. But it is inferior to FSRRC and FCBF under KNN.
Table 5 shows the classification accuracy under CART classifier. We can see that the average classification accuracy of our method (GFS-NPIS) is obviously higher than those of other eight algorithms.
Figure 2 is a histogram of the average classification accuracy of nine algorithms under three classifiers (SVM, KNN, CART). It can be clearly seen that GFS-NPIS performs better than other eight algorithms under all three classifiers. Comparing the three classifiers, each algorithm performs best under SVM classifier, followed by KNN and CART in sequence.
Furthermore, nine representative line charts of average classification accuracy using SVM are shown in Fig. 3. As shown in Fig. 3, the x-axis indicates the number of selected features, the y-axis is the average classification accuracy, and different colors indicate different feature selection methods. It can be discovered that GFS-NPIS outperforms other baselines in terms of average classification accuracy in Fig. 3. This is because feature redundancy and feature interaction are both considered in GFS-NPIS and NPIS is a more effective for evaluating feature interaction.
| Dataset | HSIC-Lasso | MIM | CIFE | FCBF | mRMR | JMI | FSRRC | FFSG | GFS-NPIS |
|---|---|---|---|---|---|---|---|---|---|
| Syn | 86.11 0.03(=) | 73.66 0.05(+) | 86.29 0.04(+) | 77.78 0.05(+) | 89.23 0.03(=) | 89.09 0.03(=) | 77.78 0.05(+) |
|
|
| Sem | 67.05 0.03(+) | 60.62 0.03(+) | 64.92 0.03(+) | 74.91 0.04(+) | 67.21 0.03(+) | 65.18 0.04(+) |
|
74.55 0.03(+) |
|
| Usp | 80.78 0.02(+) | 76.03 0.01(+) | 83.56 0.02(+) | 86.82 0.02(+) | 85.78 0.02(+) | 84.58 0.01(+) |
|
86.76 0.02(+) |
|
| Lud | 80.20 0.12(+) | 81.41 0.11(+) | 76.07 0.13(+) | 85.59 0.10(+) | 88.65 0.10(+) | 88.82 0.10(+) | 85.59 0.10(+) | 87.17 0.10(+) |
|
| Mus | 64.19 0.11(+) | 66.02 0.13(+) | 64.50 0.13(+) | 63.89 0.11(+) | 67.38 0.11(+) | 63.87 0.10(=) | 63.89 0.11(+) | 60.65 0.06(+) |
|
| Iso | 56.73 0.05(+) | 55.14 0.05(+) | 64.08 0.04(+) | 77.48 0.05(=) | 72.22 0.05(+) | 70.23 0.06(+) | 77.48 0.05(=) | 78.41 0.06(+) |
|
| Yal | 46.57 0.11(+) | 57.58 0.12(+) | 45.70 0.11(+) | 64.49 0.13(+) |
|
58.11 0.12(+) | 64.49 0.13(+) | 64.67 0.11(+) |
|
| Orl | 75.54 0.06(+) | 79.79 0.06(+) | 61.93 0.06(+) | 78.94 0.05(+) | 80.77 0.05(+) | 80.52 0.05(+) | 86.27 0.04(+) | 85.64 0.05(=) |
|
| Coi | 70.37 0.07(+) | 66.87 0.08(+) | 74.29 0.06(+) | 81.10 0.07(=) | 82.33 0.08(+) | 82.73 0.08(+) | 81.10 0.07(=) | 84.80 0.06(=) |
|
| War | 90.65 0.08(+) | 85.79 0.13(+) | 91.74 0.07(=) | 82.48 0.12(+) | 90.89 0.10(+) | 90.18 0.11(+) |
|
90.03 0.09(+) | 92.72 0.08 |
| Lun | 86.48 0.06(+) | 92.17 0.05(+) | 88.87 0.06(+) | 92.46 0.04(+) |
|
84.65 0.05(+) | 92.46 0.04(+) | 94.57 0.04(+) | 94.52 0.04 |
| Lym | 77.03 0.18(+) | 79.67 0.12(+) | 74.61 0.16(+) | 86.70 0.10(+) | 90.77 0.10(+) | 92.53 0.10(+) | 86.70 0.10(+) | 93.01 0.08(=) |
|
| Rel | 56.42 0.01(+) | 62.11 0.05(+) | 66.30 0.06(+) | 75.88 0.05(=) | 75.35 0.05(+) | 64.30 0.05(+) | 75.35 0.05(+) | 67.50 0.05(+) |
|
| Tox | 51.97 0.12(+) | 60.75 0.12(+) | 52.79 0.11(+) | 63.41 0.14(+) |
|
63.30 0.13(=) | 62.17 0.12(+) | 63.70 0.16(=) |
|
| Leu | 66.40 0.15(+) | 96.44 0.10(+) | 94.52 0.10(+) | 98.01 0.10(=) | 97.96 0.05(=) | 97.99 0.05(=) | 98.01 0.04(=) | 97.18 0.06(+) |
|
| Avg | 70.43 | 72.94 | 72.68 | 79.33 | 81.07 | 78.41 | 80.52 | 81.22 |
|
| W/T/L | 14/1/0 | 15/0/0 | 14/1/0 | 11/4/0 | 10/4/1 | 11/4/0 | 10/4/1 | 10/5/0 |
Classification accuracy (mean
Classification accuracy (mean
Average classification accuracy.
Accuracy comparison on nine feature selection methods.
Table 6 shows the Macro-F1 values of nine algorithms on SVM classifier. It can be seen that GFS-NPIS does not get the highest Macro-F1 on two data sets (WarPIE10P and Lung). Table 7 shows the Macro-F1 values of nine algorithms under 3KNN classifier. It can be found that GFS-NPIS performs not as well on 3KNN as SVM. But it also can be seen that GFS-NPIS achieved the highest average Macro-F1 from the last row. Table 8 shows the Macro-F1 of nine algorithms under CART classifier. It can be seen that GFS-NPIS do not get the highest Macro-F1 only on four data sets (Lung_discrete, Yale, ORL, Leukemia). And GFS-NPIS still achieved the highest average Macro-F1 among nine algorithms. We can draw a conclusion that GFS-NPIS performs better than other eight algorithms in terms of Macro-F1 under three different classifiers.
Macro-F1 using SVM
Macro-F1 using SVM
Macro-F1 using 3KNN
Macro-F1 using CART
Comparison of time complexity
Figure 4 is a histogram of average Macro-F1 for nine algorithms under SVM, KNN and CART respectively. It can be clearly seen that the Macro-F1 of GFS-NPIS is higher than those of other algorithms under three classifiers, and each algorithm performs best on SVM, followed by KNN and CART in sequence.
Average Macro-F1.
A good feature selection algorithm has not only higher classification accuracy but also less time complexity [31]. Table 8 gives the time complexity of nine algorithms. In Table 8, the number of features selected is
In GFS-NPIS, the time complexity of Step1 is
Conclusions and future work
Feature selection aims to remove redundant features and retain useful features [32]. At present, most feature selection methods only considered feature relevance and feature redundancy, but ignored feature interaction. Our study focuses on feature interaction as well as feature relevance and feature redundancy. In this paper, a new algorithm feature selection method GFS-NPIS is proposed. While removing irrelevant and redundant features, GFS-NPIS can select interactive features. The experiment on the fifteen public datasets illustrated that GFS-NPIS could select more important and useful features than HSIC-Lasso, MIM, CIFE, FCBF, mRMR, JMI, FSRRC and FFSG in most cases. Therefore, combining feature relevance, feature redundancy and feature interaction can help to define more important information from the different data set for improving the final classification performance.
In addition, there still exist some problems. For example, it is too complex in the process of calculating NPIS, which leads to time-consuming on some high-dimensional data set (USPS, Tox-171, Leukemia). In a word, our future work is improving the computational performance of NPIS to improve our algorithm GFS-NPIS in terms of accuracy and time consuming.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
Hongfang Zhou: Conceptualization; Methodology; Writing original draft. Lei An: Software; Validation; Formal analysis; Writing original draft. Rourou Zhu: Investigation; Writing original draft.
Footnotes
Acknowledgments
The corresponding author would like to thank the support from the National Key Research and Development Plan under the Grant of 2017YFB1402103, the National Natural Science Foundation of China under the Grant of 61971347, the Key Research and Development Program of Shaanxi under the Grant of 2021JZ-49, the Education Department of Shaanxi Province Key Laboratory Project under the Grant of 15JS079, Xi’an Science Program Project under the Grant of 2022GXFW0075.
