Abstract
Feature Selection (FS) is currently a very important and prominent research area. The focus of FS is to identify and to remove irrelevant and redundant features from large data sets in order to reduced processing time and to improve the predictive ability of the algorithms. Thus, this work presents a straightforward and efficient FS method based on the mean ratio of the attributes (features) associated with each class. The proposed filtering method, here called MRFS (Mean Ratio Feature Selection), has only equations with low computational cost and with basic mathematical operations such as addition, division, and comparison. Initially, in the MRFS method, the average from the data sets associated with the different outputs is computed for each attribute. Then, the calculation of the ratio between the averages extracted from each attribute is performed. Finally, the attributes are ordered based on the mean ratio, from the smallest to the largest value. The attributes that have the lowest values are more relevant to the classification algorithms. The proposed method is evaluated and compared with three state-of-the-art methods in classification using four classifiers and ten data sets. Computational experiments and their comparisons against other feature selection methods show that MRFS is accurate and that it is a promising alternative in classification tasks.
Introduction
Classification tasks require an algorithm to process different data types and learn how to identify each different class of output, label different patterns in the data, and correctly classify the largest number of data. However, some factors that can make hard the classification are found in the data sets themselves. Among these factors, one can cite: a large number and unrepresentative, highly correlated, and noisy attributes [1, 2]. Unrepresentative data and a large volume of attributes without relevant information can make the processing difficult, increase the response time, and decrease the classifiers’ predictive capability. Thus, one solution to minimize the probability of error in the classification and improve its predictive capacity is the use of Feature Selection (FS) methods. The FS methods aim to provide smaller subsets of attributes, but with greater representativeness, to make the algorithms faster and more assertive in data processing [3, 4, 5].
The FS methods use statistical analysis or heuristics to identify the most relevant attributes. The statistical methods are based on equations such as the chi-squared test,
FS algorithms are usually classified into three main classes: filter, wrapper, and embedded approaches. The filter method pre-processes the data, then identifies the most relevant attributes and rank them from the best to the worst. The wrapper method selects and tests different subsets of attributes with a learning algorithm, which chooses the best subset of them all. In the embedded method, the selection process is an internal and natural part of the classifier. The selection of subsets of attributes is carried out simultaneously with the training step classification process. Each method is developed specifically for its classifying algorithm. It should be noted that the classifier and the attribute selection process cannot be separated.
As mentioned, filter methods are used to obtain and rank the most relevant attributes from the data set. However, they do not identify how many attributes should be used. Consequently, the number of most relevant attributes processed in an algorithm must be selected manually or selected with the help of another method. The filter method can be found in several works, and in [16] the interaction dominance-based feature selection (IDFS) is proposed. In IDFS, the mutual information and three-way interaction information techniques are used to rank the attributes. The mutual information is used to measure the relevance of the attributes associated with the output classes, while the three-way interaction information is applied to calculate the attributes redundancy. Only then does the method use a greedy search to rank the attributes. In [17] the multi-filter feature selection approach (MFFS) is proposed, where initially the Logistic Regression, Support Vector Machine (SVM) and Random Forest (RF) algorithms are used to filter the most relevant attributes. Then, linear correlation (MFFS-lr) and Information Gain (MFFS-IG) are applied to the remaining attributes to measure the level of association and, therefore, to select the set of the best input variables. The stratified feature ranking (SFR) is a method proposed in [18]. To allow the ranking of the attributes, initially a method called subspace feature clustering (SFC) is performed, in which the attributes are grouped by their importance according to each class of output. After that, the SRF analyzes the groups of the different attributes and separately classifies them according to the weight produced by the SFC. In [19] the unsupervised feature selection method using feature clustering (EUFSFC) is proposed. This method performs an analysis of the redundancies and the significance of the attributes. The redundancy is calculated using the generated groups, and the significance is measured by grouping the attributes. In [20] another FS method is presented, in which the ranking is accomplished by a fuzzy model, where all the processing is started generating the fuzzy sets and membership function for each attribute; the process is performed using the clustering algorithm Fuzzy C-Means. Subsequently, in a second step, the information value of each attribute in the set is measured, selecting the most relevant ones.
Several state-of-the-art investigations present algorithms derived from different methods and with great complexity in their development and equations. A straightforward filter-type feature selection (FS) method applied for classification problems is proposed in the current paper. Known as MRFS (Mean Ratio for Feature Selection), the proposed method is based on the mean ratio of the attributes of different classes. The choice of the mean ratio is due to its simplicity, transparency, and low computational cost, even for a large number of attributes. In the proposed method, the averages of the data associated with the different classes of outputs are calculated for each attribute. Afterwards, the ratio of the averages is computed. Finally, the attributes are ordered by the results of the ratio, from the lowest to the highest value. The ratio results allow to verify which of the lowest values attributes have the most different averages. In the method, the attributes with different averages are more relevant for classification.
After this brief introduction, the rest of the paper is organized as follows. Section 2 introduces the proposed method. Section 3 shows the computational experiments and analyses the results obtained. Finally, Section 4 presents the final considerations.
Proposed approach
The Mean Ratio for Feature Selection uses the difference among sample averages associated with the different output classes to select the most relevant attributes (input variables). The proposed method aims to explore a more direct, efficient, and easily applied solution in environments that require rapid processing. The focus is on a feature selection method with low computational cost equations and basic mathematical operations such as addition, division, and comparison.
Before developing the MRFS method, experiments were performed with statistical tests such as analysis of variance (ANOVA) and Confidence Interval [9, 10] to prove that excluding statistically equal sets when associated with different output classes improve the accuracy of the classification algorithms. Figure 1 shows the steps of the statistical analysis performed.
In the first step, a descriptive analysis of the data is carried out. The goal is to extract information to define the data type (categorical or non-categorical) and to determine the best statistical method for its processing (ANOVA or Confidence Interval). The hypothesis test is applied to check for statistical equality (or not) among the different samples of an evaluated population, given a specific significance
in which
Steps of the data analysis before the development of MRFS.
In the second step, categorical data are processed using the statistical confidence interval model, whereas non-categorical data are processed using ANOVA. The measure evaluated in the results of these experiments is the
In the third step, the attributes that presented statistical equality are deleted, reducing the number of attributes for the classifier. In the fourth step – the test step –, the performances of the classification algorithms are compared using the data set with all inputs and the same data set without attributes that showed statistical equality (deleted in the third step). The results suggest that attributes with statistically similar sample mean decrease the predictive capacity of the classifiers.
Note that, independently of the value of the significance level
Based on the results, MRFS is developed to rank the attributes from the best to the worst, given the ratio of the averages of the sample sets associated with the different output classes. It uses only a central idea of the statistical methods, which are the sample means, and ignores other mathematical equations that would require further processing such as variance, standard deviation, residual means, Torricelli, Bhaskara, among others. The proposed MRFS is detailed in the next section.
The filter method proposed here is based on the sample averages of each population and aims to be straightforward, effective, and fast. In the proposal, each population is the attribute or input variable of a data set, and the sample averages are the sets associated with the desired output classes. Hence, given a data set with two output classes and four input attributes, each one of the attributes is split into two different sample averages according to the output class. This statement can be represented by Eq. (2):
in which
Despite the well-defined sample levels and separate populations, it is still necessary to measure which sets of samples from the same population have more dispersed averages. To calculate the dispersion between the sample sets, the ratio between the two values is applied. The ratio is a measure that expresses a fraction or a percentage between two values; it is a mathematical formula used to compare two quantities. Thus, the proposed final method can be represented by Eq. (3):
in which
MRFS stands out for its simple equations, which can be associated with calculating the ratio between two values, in which sample averages can be easily computed from the sum of each new sample presented to its proper set, therefore eliminating the recursive calculations of past samples to recalculate information values when a new sample is presented. The method also features an equation with a low computational cost, unlike other methods in the literature that require more complex calculations such as the variance or standard deviation assigned in formulas with more operations. A pseudocode of the MRFS filter method is presented in Algorithm 2.1.
[!h] Data set Input: X, Y Output: rankCols extract different classes in Y
MRFS was evaluated using ten data sets and four classifiers. The results were compared with three alternative feature selection methods: Kruskal Wallis, ReliefF, and Chi-Square. The classifiers used in the experiments were: Artificial Neural Network (ANN) MultiLayer Perceptron, K-Nearest Neighbor (KNN), Naive Bayes (NB), and Support Vector Machine (SVM). The algorithms were developed in Python, using the sklearn library [23].
First of all, it is necessary to find the best setting for the classifiers. For each data set, different settings of the classifiers with all attributes were tested. The setting that obtained the best performance was used in the experiments to compare the FS methods. The structure and parameters tested for the classifiers are as follows:
ANN: hidden layers KNN: distance NB: Gaussian, Bernoulli, Categorical, Multinomial. SVM: kernel
Description of data sets
Best parameter with all inputs for the ANN
Experimental results with ANN
The filter FS methods rank the variables by their relevance, but they do not identify which ones should be used by the algorithms. For the experiments, the number of attributes was reduced with 20% intervals in the number of all inputs for each data set used. The number of attributes that achieve the best performance is presented in the next sections. The ordered rank of the attributes by the FS methods and the number of attributes used in the experiments are shown in Appendix A. Attributes ranked by FS methods. To allow replicating the results, the classifiers and the data sets employed in the experiments are available at the URL: shorturl.at/hlD67.
The data sets used in the experiments are described in Table 1, which presents the name, the source, the number of samples, the number of attributes, and the number of output classes. The FS methods performance was evaluated by the Accuracy (Eq. (4)):
in which
The following sections presents the experiments and results with ANN (Section 3.1), KNN (Section 3.2), Naive Bayes (Section 3.3) and SVM (Section 3.4).
Best parameter with all inputs for the KNN
Experimental results with KNN
Best parameter with all inputs for the NB
Results between FS methods with ANN.
In this section, the FS methods are evaluated using Artificial Neural Network. Initially, the ANN algorithms were executed with all attributes to find the best setting for each data set. Then, the best setting was used to evaluate the FS methods. Table 2 shows the activation function and learning algorithm that obtained better performance for each data set.
Table 3 shows the accuracy and the number of attributes for the experiments with ANN, and Fig. 2 illustrates the results graphically. The MRFS method obtained the best accuracy in four of the ten experiments, Chi-Square in three and Kruskal-Wallis in two. In the experiment with the QSAR data set, MRFS achieved the same accuracy as Chi-Square, but with fewer attributes. In the average of the experimental results with the ANN, the best overall performance was obtained by MRFS, followed by ReliefF, Kruskal-Wallis, and Chi-Square.
KNN experiments
This section evaluates the FS methods with the KNN classifier. Initially, the best parameters of the KNN with all inputs were found. Table 4 shows the parameters that achieved the best result for each data set.
The performance of the FS methods is summarized in Table 5 and Fig. 3. In the experiments with Cervical Cancer, Mobile Price, COIL 2000, QSAR, Spambase, Glass Identification, and Wine, MRFS outperformed the other methods. Kruskal-Wallis achieved the best accuracy in the Page Blocks and South African Health experiments. MRFS and Kruskal-Wallis obtained the best accuracy in the Heart Disease experiment; however, MRFS with fewer attributes. MRFS achieved the best average performance, followed by Kruskal-Wallis, Chi-Square, and ReliefF.
Results with NB
Results with NB
Best parameter with all inputs for the SVM
Results between FS methods with KNN.
The Naive Bayes classifier that achieved the best performance with the complete data sets is presented in Table 6. These classifiers were used to evaluate the performance of the FS methods.
Table 7 and Fig. 4 presents the experimental results. The MRFS method obtained the best performance in six experiments. In the experiment with the Cervical Cancer data set, the best results were obtained by MRFS and by ReliefF. With Page Blocks, all FS methods achieved the same accuracy. With the Wine data set, MRFS, ReliefF, and Chi-Square methods achieved the same accuracy. On average, the best performance is achieved by MRFS, followed by Chi-Square, Kruskal-Wallis, and ReliefF.
SVM experiments
In the experiments carried out with SVM, firstly, the best settings were established. Then, these settings were used to evaluate FS methods. Table 8 presents the best parameter setting for each data set.
Results with SVM
Results with SVM
Results between FS methods with NB.
Results between FS methods with SVM.
Table 9 and Fig. 5 summarize the results with SVM. MRFS obtained the best performance in the experiment with the Spambase data set, while ReliefF was the best with Page Block. In the experiments with Heart Disease, Cervical Cancer, COIL 2000, South African Health, Glass Identification, and Wine, MRFS achieved higher accuracy together with other FS methods. In the average of the experiments, MRFS outperforms the alternative FS methods.
This paper introduced a new and straightforward filter method for feature selection (FS). The proposed method, called MRFS, is based on mean ratio and uses only basic mathematical equations.
Extensive computational experiments in classification tasks suggest that MRFS is promising. The MRFS method was evaluated and compared with three alternative FS methods using four classifiers and ten data sets. The MRFS results achieved higher accuracy in 72.5% of the experiments. To be more specific, in 50%, MRFS outperformed alternative methods, and in 22.5% it achieved the best performance together with other methods. In summary, MRFS gave the best results in 9 out of 10 data sets with Naive Bayes, it was also better in 8 out of 10 data sets when used together with K-Nearest Neighbor, it showed the best results in 7 out of 10 data sets with Support Vector Machine, and in 5 out of 10 when applied in conjunction with ANN.
Future work shall address the development of a new wrapper FS based on MRFS.
Footnotes
Appendix
A. Attributes ranked by FS methods
In this appendix, Table 10 introduces the number of attributes used in the experiments; this value was a one-step reduction of 20% of the total amount. This table shows the data set name, the number of attributes present in the data set, and the number of attribute selected for the experiments. Tables 11–20 shows the attributes ranked by the FS methods. These tables present the FS method and the ordered attributes by their respective FS method.
Number of attributes used in each experiment
Data set
All attributes
Selected attributes
Heart Disease
13
11, 9, 7, 5
Cervical Cancer
33
26, 20, 14, 8
Mobile Price
20
16, 12, 8, 4
COIL 2000
85
68, 51, 34, 17
Page Blocks
10
8, 6, 4, 3
QSAR Biodegradation
41
32, 24, 16, 8
South African Health
9
8, 7, 6, 5
Spambase
57
46, 35, 24, 13
Glass Identification
10
8, 6, 4, 3
Wine
13
11, 9, 7, 5
Ranking of attributes – Heart Disease Data set
Method
Order of attributes
MRFS
9, 12, 3, 10, 2, 11, 7, 13, 6, 8, 1, 4, 5
Kruskal-Wallis
3, 7, 11, 6, 2, 10, 12, 9, 1, 4, 5, 8, 13
ReliefF
12, 13, 3, 10, 11, 8, 7, 1, 4, 2, 9, 6, 5
Chi-Square
8, 10, 12, 3, 9, 5, 1, 4, 11, 2, 13, 7, 6
Ranking of attributes – Cervical Cancer Data set
Method
Order of attributes
MRFS
15, 18, 19, 20, 25, 24, 16, 22, 21, 31, 32, 33, 29, 27, 28, 30, 23, 17, 14, 13, 12, 26, 10, 11, 9, 6, 5, 7, 4, 1, 8, 3, 2
Kruskal-Wallis
32, 31, 33, 29, 27, 30, 17, 14, 12, 13, 26, 10, 11, 23, 7, 20, 5, 28, 6, 15, 22, 24, 21, 19, 25, 16, 18, 9, 8, 2, 4, 3, 1
ReliefF
32, 33, 31, 8, 9, 4, 5, 1, 29, 20, 30, 3, 23, 27, 18, 28, 10, 26, 6, 12, 13, 11, 2, 19, 14, 16, 7, 17, 15, 22, 25, 24, 21
Chi-Square
32, 31, 33, 9, 29, 27, 13, 30, 20, 6, 11, 17, 14, 12, 26, 1, 23, 28, 10, 4, 18, 7, 5, 16, 3, 25, 19, 21, 24, 8, 2, 15, 22
Ranking of attributes – Mobile Price Data set
Method
Order of attributes
MRFS
14, 12, 1, 13, 6, 7, 19, 16, 10, 4, 2, 5, 8, 9, 15, 20, 17, 11, 18, 3
Kruskal-Wallis
3, 5, 18, 8, 6, 20, 4, 2, 19, 14, 17, 16, 15, 1, 13, 12, 10, 9, 7, 11
ReliefF
14, 1, 13, 12, 17, 5, 7, 10, 9, 15, 11, 3, 16, 18, 8, 2, 6, 4, 19, 20
Chi-Square
14, 12, 1, 13, 9, 7, 16, 17, 5, 15, 11, 10, 19, 6, 8, 2, 3, 4, 20, 18
Ranking of attributes – Coil Data set
Method
Order of attributes
MRFS
53, 50, 71, 74, 82, 61, 81, 60, 46, 67, 57, 64, 78, 85, 56, 73, 54, 75, 69, 76, 21, 58, 48, 52, 55, 84, 79, 51, 83, 68, 47, 72, 44, 65, 62, 59, 16, 77, 37, 29, 34, 30, 25, 40, 45, 20, 13, 80, 63, 19, 12, 18, 31, 43, 24, 39, 66, 36, 11, 1, 5, 23, 42, 28, 22, 26, 6, 35, 10, 15, 17, 9, 32, 70, 3, 7, 8, 27, 33, 38, 41, 4, 14, 2, 49
Kruskal-Wallis
76, 55, 64, 85, 61, 82, 83, 62, 57, 78, 49, 70, 51, 72, 63, 84, 45, 66, 58, 79, 60, 81, 41, 52, 73, 48, 56, 69, 77, 50, 71, 53, 74, 46, 67, 75, 54, 20, 65, 80, 59, 47, 68, 1, 44, 22, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 21, 23, 2, 24, 42, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 43
ReliefF
44, 65, 55, 64, 85, 78, 82, 47, 61, 57, 76, 62, 68, 1, 5, 59, 25, 83, 84, 40, 16, 81, 20, 58, 39, 60, 51, 63, 72, 49, 17, 26, 42, 79, 14, 70, 41, 33, 3, 43, 19, 27, 69, 15, 71, 74, 77, 2, 18, 66, 56, 50, 28, 48, 30, 36, 31, 35, 53, 8, 4, 45, 7, 32, 6, 22, 10, 80, 38, 12, 73, 23, 13, 9, 11, 24, 29, 21, 34, 37, 52, 75, 46, 67, 54
Chi-Square
47, 1, 61, 59, 30, 16, 44, 31, 68, 37, 64, 43, 82, 18, 25, 21, 34, 19, 39, 36, 57, 55, 65, 5, 13, 12, 76, 54, 29, 40, 24, 42, 46, 10, 85, 35, 28, 80, 32, 22, 60, 23, 20, 78, 15, 75, 81, 17, 52, 11, 9, 58, 26, 67, 53, 48, 73, 83, 50, 3, 51, 7, 6, 74, 62, 56, 69, 84, 72, 71, 45, 79, 27, 63, 8, 33, 66, 77, 38, 70, 4, 14, 2, 41, 49
Ranking of attributes – Page Blocks Data set
Method
Order of attributes
MRFS
9, 4, 7, 10, 2, 8, 3, 1, 5, 6
Kruskal-Wallis
1, 7, 5, 10, 4, 6, 2, 3, 9, 8
ReliefF
1, 6, 2, 8, 5, 9, 3, 10, 4, 7
Chi-Square
3, 8, 9, 10, 4, 7, 1, 2, 5, 6
Ranking of attributes – QSAR Biodegradation Data set
Method
Order of attributes
MRFS
28, 21, 19, 41, 20, 6, 4, 3, 25, 26, 40, 24, 29, 5, 33, 32, 11, 34, 7, 38, 23, 10, 14, 31, 9, 36, 30, 13, 39, 8, 1, 27, 35, 15, 22,
37, 17, 18, 16, 2, 12
Kruskal-Wallis
30, 9, 40, 35, 10, 28, 16, 32, 19, 29, 26, 24, 21, 4, 20, 14, 6, 34, 25, 11, 38, 5, 41, 23, 22, 3, 39, 7, 37, 36, 8, 12, 33, 13, 31,
15, 17, 18, 27, 2, 1
ReliefF
38, 19, 22, 1, 37, 28, 35, 40, 39, 8, 36, 4, 2, 27, 24, 14, 16, 12, 15, 18, 29, 32, 10, 31, 23, 34, 13, 30, 11, 9, 20, 5, 26, 17, 6,
25, 41, 21, 3, 7, 33
Chi-Square
8, 11, 7, 34, 5, 23, 41, 33, 30, 3, 31, 39, 10, 38, 36, 15, 14, 6, 1, 13, 9, 12, 25, 27, 32, 37, 20, 40, 22, 35, 4, 24, 21, 26, 29,
16, 19, 28, 17, 18, 2
Ranking of attributes – South African Health Data set
Method
Order of attributes
MRFS
2, 9, 3, 5, 4, 8, 1, 6, 7
Kruskal-Wallis
2, 8, 1, 3, 4, 5, 6, 7, 9
ReliefF
2, 1, 8, 6, 9, 4, 3, 7, 5
Chi-Square
9, 2, 4, 1, 8, 3, 6, 5, 7
Ranking of attributes – Spambase Data set
Method
Order of attributes
MRFS
41, 27, 29, 4, 32, 42, 31, 25, 26, 34, 23, 7, 30, 20, 35, 48, 44, 46, 53, 15, 24, 33, 28, 39, 43, 16, 47, 17, 56, 11, 8, 22, 36, 52, 37, 9, 55, 38, 6, 54, 45, 18, 21, 57, 5, 51, 49, 13, 40, 10, 1, 3, 14, 19, 2, 50, 12
Kruskal-Wallis
52, 53, 21, 16, 5, 7, 3, 24, 23, 17, 10, 18, 8, 6, 11, 2, 9, 1, 12, 54, 20, 19, 13, 15, 50, 14, 45, 40, 22, 49, 56, 55, 48, 47, 46, 41, 44, 43, 42, 51, 29, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 28, 27, 26, 25, 4, 57
ReliefF
19, 16, 7, 52, 21, 8, 18, 24, 53, 5, 11, 17, 9, 10, 57, 6, 50, 23, 1, 56, 20, 2, 13, 55, 4, 22, 46, 14, 38, 49, 54, 45, 40, 12, 47, 28, 51, 3, 36, 34, 33, 32, 27, 31, 48, 35, 44, 15, 41, 39, 29, 42, 30, 37, 43, 26, 25
Chi-Square
57, 56, 55, 27, 25, 21, 16, 26, 7, 52, 19, 23, 20, 46, 4, 24, 17, 5, 53, 42, 22, 45, 8, 18, 29, 35, 30, 28, 15, 9, 33, 44, 6, 37, 31, 11, 39, 3, 10, 36, 32, 34, 41, 43, 48, 54, 13, 1, 40, 2, 14, 49, 50, 38, 51, 47, 12
Ranking of attributes – Glass Identification Data set
Method
Order of attributes
MRFS
9, 7, 10, 4, 5, 1, 8, 3, 6, 2
Kruskal-Wallis
5, 2, 1, 9, 3, 4, 6, 7, 8, 10
ReliefF
1, 7, 2, 8, 5, 6, 9, 3, 4, 10
Chi-Square
1, 9, 4, 7, 5, 3, 8, 10, 6, 2
Ranking of attributes – Wine Data set
Method
Order of attributes
MRFS
10, 7, 13, 12, 6, 2, 9, 8, 11, 4, 1, 3, 5
Kruskal-Wallis
8, 2, 11, 9, 4, 1, 3, 5, 6, 7, 10, 12
ReliefF
13, 1, 4, 5, 7, 9, 3, 10, 11, 12, 2, 6, 8
Chi-Square
13, 10, 7, 5, 4, 2, 12, 6, 9, 1, 11, 8, 3
