Abstract
Feature selection focuses on selecting important features that can improve the accuracy and simplification of the learning model. Nevertheless, for the ordered data in many real-world applications, most of the existing feature selection algorithms take the single-measure into consideration when selecting candidate features, which may affect the classification performance. Based on the insights obtained, a multi-measure feature selection algorithm is developed for ordered data, which not only considers the certain information by the dominance-based dependence, but also uses the discern information provided by the dominance-based information granularity. Extensive experiments are performed to evaluate the performance of the proposed algorithm on UCI data sets in terms of the number of selected feature subset and classification accuracy. The experimental results demonstrate that the proposed algorithm not only can find the relevant feature subset but also the classification performance is better than, or comparably well to other feature selection algorithms.
Introduction
Feature selection [1–3] is an efficient method for data pre-processing by selecting important features and deleting redundant features to reduce the influence of data dimension and noise data, so as to improve data tightness and classification ability. Therefore, it has been widely used in knowledge discovery, pattern recognition, data mining and other fields [4–6]. For instance, Pan et al. [7] established an evaluation model to children’s foot & ankle deformity severity based on feature selection algorithm. Hu et al. [8] proposed a feature selection method, which accelerates the global convergence speed and improves the exploration efficiency. Zhao et al. [9] studied a feature selection approach for analyzing high-dimensional small sample data sets. Kaur et al. [10] designed a feature selection algorithm to detect depression. Zhang et al. [11] investigated a new model based on filter feature selection method and the classification performance is improved. Rough set theory is an effective mathematical tool for dealing with uncertainty, inconsistency, and incomplete knowledge [12–15]. Its main advantage is to deal with data directly, and to mine potentially useful knowledge from data sets. Consequently, this theory has been widely used in feature selection in the field of artificial intelligence.
Feature measure plays an important role as the selection basis of important features in feature selection process. Based on rough sets, many different feature measures are proposed and they can be divided into two categories from different viewpoints: the algebra theory approach and the information theory approach. For the algebra theory approach, Pawlak [16] and other scholars proposed efficient feature measures, such as approximation accuracy [20], approximation roughness [17], dependence measure [2] and discernibility matrix measure [18]. This type of feature measure can characterize the dependence between conditional features and decision feature or evaluation the certainty knowledge under the feature subsets. For the information theory approach, Shannon [19] introduced the concept of entropy in physics into information systems and expanded it, then proposed the concept of information entropy. Information entropy can quantify the amount of information in the data. It is used to evaluation the uncertainty knowledge under the feature subset. Many scholars use the information entropy and its extension as the feature measure to select feature subset [21–23]. However, the feature measures proposed in the above literature can not deal with the ordered data.
In many real-world applications, feature values of the collected objects are preferred. That is, the ranking of feature values. For example, in a car information system comprising indexes and the total evaluation, there is consistency between conditional features and decision feature. The total evaluation is better when a car has higher indexs in all aspects. The ordered data is generated. In order to accurately describe the relationship between conditional features and decision feature in ordered data, Greco et al. [24] proposed the dominance-based rough set and introduced the dominance relation which is the expansion of the traditional equivalence relation. The dominance relation is more in line with people’s thinking about the data in practical application. Under the dominance-based rough set model, many scholars have extended the feature measure in the traditional rough set theory to the ordered decision system, and have achieved some results [29, 30]. For robust fuzzy dominance rough sets, Sang et al. [25] designed a feature selection method based on self-adaptive weighted interaction. For the algebra theory approach, Du et al. [26] extended the traditional dominance relation for two different situations where incomplete values appear in incomplete ordered information systems, and proposed a feature selection algorithm based on the discernibility matrix method. Qian et al. [28] proposed a feature selection algorithm based on dependence measure for incomplete ordered information system with fuzzy decision by combining the dominance-based rough set with the α-cut set theory. Yang et al. [27] proposed a feature selection raised a new attribute reduction method based on fuzzy preference relations. For the information theory approach, Sang et al. [31] proposed the incremental feature selection algorithm based on the dominant condition entropy when the dynamic changes of a single object in ordered decision system. Hu et al. [34] proposed the measure of information entropy for ordinal classification. Palangetic et al. [33] studied the application of Ordered Weighted Average (OWA) operators dominance-based rough set. The advantage of the dominance-based rough set is that it can describe the dominance relation between objects in ordered data, and describe the conceptions through the upward and downward unions of classes instead of the particular classes. It can be used to solve the ordered data in practical problems. Therefore, we focus on the feature measure based on the dominance-based rough set for ordered data in this paper.
However, most of the existing feature measures using the dominance-based rough set are single-measure, which may not be considered sufficiently in selecting candidate features, resulting in relatively lower classification accuracies in feature selection. Multi-measure [35, 39] combines different single criterion measures, which can consider different criteria at the same time. The feature subsets are more comprehensive, which are selected by combining different criteria. In the process of feature selection, with more evaluation criteria used, the tendency characteristics of a single criterion can be avoided as much as possible. Li et al. [36] proposed a multi-criterion based attribute reduction, which considers both neighborhood decision error rate and neighborhood decision consistency. Shu et al. [37] proposed a multi-criteria evaluation function for cost-sensitive data with missing values. Sun et al. [38] proposed a multi-criteria fusion feature selection algorithm for fault diagnosis of helicopter planetary gear train. However, the mentioned above feature selection algorithms do not consider the data with dominance relation. Therefore, the research of multi-measure feature selection algorithm for ordered data still needs to be solved.
The contributions of this paper are shown as follows: (1). A multi-measure combining the dominance-based dependence and the dominance-based information granularity is proposed for the ordered decision system, which not only considers the certain information, but also uses the discern information. (2). A heuristic feature selection algorithm based on the feature multi-measure is proposed, which can select the feature subset from the multi-perspective analysis. (3). The comparative experiments prove that the proposed feature selection algorithm based on multi-measure has advantage in classification accuracy, in comparison with other feature selection algorithms.
The remaining part of this paper is organized as follows. Section 2 briefly explains the basic concepts of the ordered decision system and the existing feature measures under the ordered decision system. In Section 3, the multi-measure feature selection algorithm is developed for ordered data, which combines the dominance-based dependence and the dominance-based information granularity. In Section 4, a serious of experiments are constructed by comparing the proposed algorithm with existing three feature selection algorithms under the UCI data sets. Finally, Section 5 summarizes this paper and presents the future work.
Preliminaries
This section mainly introduces the basic knowledge of the dominance-based rough sets and the single-measures of features for ordered decision systems.
In the rough set theory, the data is always defined as a 4-tuple DS = (U, A = C ∪ D, V, f), where U ={ x1, x2, . . . , x|U| } is a nonempty finite set of objects, also called the universe; C ={ c1, c2, . . . , c|C| } is a set of conditional features which characterize the objects, D ={ d } is the decision feature, and C∩ D = ∅; V = ⋃ a∈AV a is the union of feature values and V a represents a domain of feature a ∈ A; f : U × A → V is an information function, specifically, ∀a ∈ C ∪ D, x ∈ U, f (x, a) ∈ V a is the feature value of object x under feature a.
In the decision system DS, when all features are increasing preference or decreasing preference, then the system is called an ordered decision system. It is denoted by ODS = (U, A = C ∪ D, V, f). Without lose of generality, in this paper, the features are characterized by the increasing preference. Table 1 shows an ordered decision system that introduces the students’ grades and general evaluation information, where U ={ x1, x2, . . . , x5 } is represent five students, C ={ c1, c2, c3, c4 } is the set of conditional features, which is used to describe the performance of students in various disciplines, including Chinese c1, Math c2, English c3 and Computer c4. The domain of each conditional features contains the values Excellent, Good, Passing and Failing. Their increasing preference relation is describe as Excellent > Good > Passing > Failing. The numerical values of the decision set are defined as Excellent=4, Good=3, Passing=2, Failing=1. Obviously, 4>3>2> 1 is consistent with the increasing preference of raw data. Evaluation d is the decision feature, the domain of decision feature d contains three symbols with increasing preference Excellent=3, good=2 and poor=1. In addition, the preference relation between the Evaluation d and the conditional features is consistent, that is, the better the students’ performance in various disciplines, the higher the students’ evaluation.
The ordered decision system on the students’ grades and general evaluation information
The ordered decision system on the students’ grades and general evaluation information
The dominance granules of U induced by the dominance relation
Based on the lower approximation, the dominance-based dependence is presented as follows.
From Definition 3, the dominance-based dependence
The dominance-based information granularity of D relative to B is defined as
From Equation (6), the dominance-based information granularity is utilized to evaluate the distinguishing ability of D relative to B. In addition, the smaller dominance-based information granularity of D relative to B means that the feature subset B has more similar classification ability to the decision feature set.
In this section, Section 3.1 presents several drawbacks of a single feature measure. Through the idea of feature multi-measure, the feature multi-measure is proposed by combining with two single criterion measures. In addition, a multi-measure feature selection algorithm is developed to select the feature subset for ordered decision system.
The drawbacks of a single feature measure
At present, the feature single-measure based on dominance-based relation has been proposed for ordered data [34, 40], and the corresponding feature selection algorithm can effectively select feature subsets. However, when using the feature single measure to evaluate features, they have the following shortcomings.
(1). When using single-measure to evaluate feature, if multiple candidate features have the same evaluation value, the most suitable candidate feature can not be selected from multiple features with the same feature measure values.
(2). A single evaluation measure can not quantify the certain information (algebraic theory) and discern information (information theory) contained in candidate features simultaneously. However, the single measure only considers the certain information by the dominance-based dependence or uses the discern information by the dominance-based dependence. According to people’s normal cognition, these selected features are often not really suitable. It may lead to the selection of the optimal feature subset is insufficient and the classification performance is affected.
The important features should have the ability to balance the inconsistent risk and the distinguishing ability. The selected features should have both low inconsistent risk and poor distinguishing ability from decision labels.In view of possible disadvantages mentioned above, the following example is used for detailed explanations.
Simultaneously, the dominance-based information granularity of D relative to a1, a2, a3 and a4 as IG≥ (D| { a1 }) =0.16, IG≥ (D| { a2 })=0.28, IG≥ (D| { a3 }) = 0.20, IG≥ (D| { a4 }) = 0.24. The dominance-based dependence of D relative to a1, a2, a3 and a4 as
A multi-measure for selecting feature subset
In this subsection, to effectively solve the shortcomings in the previous section, a multi-measure method which considers the dominance-based dependence and the dominance-based information granularity simultaneously is proposed for the ordered decision system. The limitations of single criteria measure can be solved by combining two single measures, and the multi-measure method is given.
From Definition 5, it can be observed that the proposed multi-criteria which considers the inconsistency risk and distinguishing ability. Simultaneously, the features are selected from the point of view of the algebra theory and information theory. The lower the value of RG≥ (D|B), the smaller the value of the inconsistency risk and the smaller the distinguishing ability.
Similarly, the dominance-based multi-measure of D relative to {a1, a2}, {a1, a3}, {a1, a4} as RG≥ (D| {a1, a2}) =0.38, RG≥ (D| { a1, a3 }) = 0.36, RG≥ (D| {a1, a4}) = 0.26. Since RG≥ (D| { a1 , a4 }) <RG≥ (D| {a1, a3}), then feature subset {a1, a3} is better than {a1, a4}. For Example 1, when evaluating features, the single-measure cannot be distinguished a1 and a3, {a1, a3} and {a1, a4}. Comparing Example 2 with Example 1, it can be seen that the dominance-based multi-measure can effectively overcome the drawback that single criterion can not distinguish multiple features or feature subsets with the same evaluation, and select the more important candidate features as much as possible.
On the basis of the proposed multi-measure for ordered data, the inner significance and outer significance of features are defined as follows.
It can be seen from Definition 6, the greater the change of dominance-based multi-measure when feature b is delete from feature subset B, it means that feature b is more important. Therefore, the feature b is redundant when Sig in (b, B, D) = 0. Then in this paper, Definition 7 is used to delete redundant features in candidate feature subsets.
It is easy to find that Sig out (b, B, D) is the variation value of the dominance-based multi-measure when feature b is added into B. Specifically, the greater the change value, the more important the feature is. Therefore, in the feature selection process, Definition 7 is used to select important features.
(1) RG≥ (D|B) = RG≥ (D|C);
(2) ∀b ∈ B, RG≥ (D|B) ≤ RG≥ (D|B - { b }) .
By Definition 8, condition (1) ensures that the feature subset has the same distinguishing ability and inconsistency risk as the whole feature set; condition (2) ensures that every feature in the feature subset is indispensable. It is obvious that the optimal feature subset is a minimal set of features which satisfies the same information ability as all features.
Based on Definition 8, the multi-measure feature selection algorithm for the ordered decision system is proposed and the flow chart is presented in Fig. 1.

Flow chart of Algorithm MMFS.
The detailed descriptions of the heuristic feature selection algorithm for ODS are given as Algorithm MMFS.
In this section, a series of comparative experiments are performed. Eight datasets are selected from UCI Machine Learning Repository [41]. Meanwhile, the condition features and decision feature of datasets are preference-ordered and there are monotonicity between them. For example, in the data set car, the acceptability of a car is evaluated by various indexes. Under the conditions of lower price, less maintenance cost, more passengers, more luggage and higher safety performance, the car is more acceptable. The descriptions of eight ordered datasets are shown in Table 2. For the datasets in Table 2, all numerical features are discretized by using the data tool Rosetta. These experiments are applied using PyCharm 2017 and running on a personal computer with Window 10, Intel(R) Core(TM) i5-9400 CPU @2.40 GHz and 8.00G memory. The programming language is Python.
Description of eight ordered datasets
Description of eight ordered datasets
To verify the efficiency of the proposed algorithm MMFS, three single feature measure-based feature selection algorithms are adopted for comparative experiments, which are denoted by FSDD [40], FSDI [40] and FSDE [34], respectively. FSDD is the
Three classifiers C4.5, SVM and KNN are used to evaluate the classification performance of different feature selection algorithms. Meanwhile, 10-fold cross-validation is used to obtain the final classification accuracy. For each data, it is divided into ten parts of equal sizes, nine of which are used as training sets and the remaining one is used as test set. Finally, the average classification accuracy of experiments is taken as the classification accuracy. The results of feature subset of MMFS, FSDD, FSDI and FSDE are shown in Table 3.
The feature subset sizes of MMFS, FSDD, FSDI and FSDE
The feature subset sizes of MMFS, FSDD, FSDI and FSDE
As shown in Table 3, Algorithms MMFS, FSDD, FSDI and FSDE can effectively reduce the feature dimension of data sets. Simultaneously, compared with Algorithms FSDD, FSDI and FSDE, different feature subsets can be selected by using Algorithm MMFS. Most features of feature subset selected by MMFS are among the optimal features selected by FSDD or FSDI, and some different features may be selected. For WDBC data set, compared with Algorithms FSDD and FSDI, one feature is different, the algorithm MMFS chose 13. Specially, in Car data set, the feature subset is only one feature (buying price) by using the algorithm FSDD. This is because the dominance dependence of each feature in the Car data set is the same as that of the complete feature set. However, it is obviously impossible to determine the performance of a car only by the buying price. Moreover, selecting a single feature can not satisfied the following complex machine learning tasks. Furthermore, compared with Algorithm FSDE, the feature subset selected by MMFS often has fewer features. For example, in Libra Movement, the feature subset size of Algorithm MMFS is 23 and the feature subset size of Algorithm FSDE is 84. Compared with the other Algorithms FSDD and FSDI, the feature subsets size are similar. Such as Dermatology data set, the feature subset sizes of MMFS, FSDD and FSDI are 17, 16 and 17. Therefore, the algorithm MMFS can select not only relatively few features, but also potentially important features.
In the following, Tables 4–6 present classification accuracies of MMFS, FSDD, FSDI and FSDE under classifiers C4.5, SVM and KNN. The last row ‘Average’ is the average classification accuracy under eight data sets.
The classification accuracies of MMFS, FSDD, FSDI and FSDE under classifier C4.5
The classification accuracies of MMFS, FSDD, FSDI and FSDE under classifier SVM
The classification accuracies of MMFS, FSDD, FSDI and FSDE under classifier KNN
As shown in Tables 4–6, compared with FSDD, FSDI and FSDE, the proposed algorithm MMFS has higher classification accuracy in most datasets. For example, for SPECTF data set, the classification accuracies of Algorithms FSDD, FSDI, FSDE under C4.5 classifier are 76.41%, 78.16% and 76.66%, respectively. The classification accuracy of MMFS is 78.83%, which improves 2.42%, 0.67% and 2.17% compared with Algorithms FSDD, FSDI and FSDE. Also, under the SVM classifier, Algorithm MMFS has the highest classification accuracy among the four algorithms, which is 80.26%. For Steel Plates Faults data set, under the classifiers of C4.5, the classification accuracies of MMFS is 68.86%, an improvement of 1.23%, 3.00% and 1.94% compared to FSDD, FSDI and FSDE. Meanwhile, for SVM classifier, the classification accuracy record of MMFS is 71.34% in Steel Plates Faults data set, which is 0.44%, 3.61% and 0.11% higher than that of FSDD, FSDI, FSDE respectively. For car data set, the classification accuracy of FSDD is 70%, which is lower than other algorithms. It shows that for some specific data sets, the features can not be completely selected by using a single criterion. Simultaneously, For Post Operative data set, four Algorithms MMFS, FSDD, FSDI and FSDE can get the same classification accuracy. Since the feature subsets selected by the four algorithms are the same. Similarly, for Sonar dataset, the classification accuracies of Algorithms MMFS, FSDD, FSDI and FSDE are 95.73%, 94.33%, 93.67% and 94.40% under the KNN classifier, respectively, the classification accuracies of Algorithms MMFS obtains the highest accuracy result. According to the average classification accuracy of the last row in Tables 4–6, the algorithm MMFS proposed in this paper has the highest average classification accuracies under the classifiers of C4.5, SVM and KNN.
Moreover, a comparison of the number of features chosen by Algorithm MMFS and the classification accuracy under three different classifiers(C4.5, SVM, KNN) are plotted in Fig. 2. Firstly, from Fig. 2, we could notice that the overall classification accuracy of dataset shows an upward trend as the number of features increases. Secondly, classification accuracy will show a stable state, when the important features are selected, no more significant upward or downward changes. Finally, from Fig. 2, we can know that the same dataset displays slightly different results under the different classifiers.

Classification accuracies of four representative datasets on three different classifiers varies with the number of features.
The above analyses show that the proposed algorithm MMFS with a feature multi-measure can select the feature subsets with higher classification accuracy at most cases.
ROC curve is an important evaluation index to measure the effectiveness of the algorithm. AUC is the area of ROC curve, and the value is [0.5, 1]. The closer ROC line is to the upper left corner of the first quadrant, the larger the area of the AUC, which in turn indicates better algorithm performance. Thus, to demonstrate the effectiveness of the MMFS, we choose the ROC curve and AUC to test the performance of MMFS. Then, the ROC curve comparisons between MMFS and other algorithms are presented in Fig. 3. we can obtain from Fig. 3 that the ROC curve of MMFS is closest to the upper left corner of first quadrant in datasets. For example, in the Sonor dataset, the MMFS is represented by a red line, blue line represents FSDD, black line represents FSDI, green line represents FSDE. From Fig. 3(a), the ROC curve of MMFS is closest to the upper left corner of the first quadrant compared to other lines, and the largest area under the curve, therefore, it is confirmed that the MMFS performs better than other algorithms, and the value of AUC recorded in Table 8. From Table 8, under the SVM classifier, the AUC values of MMFS, FSDD, FSDI and FSDE are 0.8774, 0.7701, 0.7500 and 0.8676, respectively, from which we can know Algorithm MMFS obtains maximum in Sonor dataset. Similarly, in Wdbc dataset, compared with Algorithms FSDD, FSDI and FSDE, the value of Algorithm MMFS is highest. In other words, Algorithm MMFS has the largest area under the ROC curve. Namely, MMFS performs better than FSDD, FSDI and FSDE.

The ROC curve comparisons between Algorithm MMFS and other algorithms.
Computational time of MMFS, FSDD, FSDI and FSDE
The AUC values of MMFS, FSDD, FSDI and FSDE under classifier SVM
In order to further compare the experimental results of different algorithms statistically, Friedman Test and Nemenyi Test were selected to verify the effectiveness of the algorithm comparison.
Friedman Test, as a nonparametric statistical test method, assumes that all experimental algorithms have the same classification performance. The formula is defined as
The value of Friedman
In addition, the Nemenyi Test can further analyze the relative performance and differences of all the compared algorithms, and its critical difference formula is defined as:
From the algorithm comparison experiments, we can get T = 8, s = 4 and α = 0.05, therefore q α=2.569, fininally the value of CD0.05 is 1.658. According to Table 9 and CD0.05, then, Nemenyi test results of Algorithm MMFS and other comparison algorithms under the three different classifiers (C4.5, SVM and KNN) are displayed in Fig. 4. Through Fig. 4(a) and 4(c), for the classifiers C4.5 and KNN, there are no segment connection between MMFS and other algorithms, which means that the accuracy of Algorithm MMFS is better than Algorithms FSDE, FSDD and FSDI in terms of statistics. But, under the classifier SVM, no evidence of difference in accuracy between Algorithms MMFS and FSDE.

Comparisons between MMFS and other three algorithms under the Nemenyi test.
As can be seen from the figures above, MMFS is significantly better than FSDD, FSDI, FSDE under three different classifiers. However, there is no significant difference in classification accuracy between FSDE and FSDI.
In order to analyze the computational time of different algorithms, Table 7 shows the running time of MMFS, FSDD, FSDI and FSDE. The column ‘Average’ represents the average computational time under eight datasets.
It can be seen in Table 7, under most datasets, the computational time of Algorithm MMFS is lower than FSDD and FSDE, but slightly higher than FSDI. For example, in Sonar data set, the computational time of MMFS is 32.80s, and computational time are 45.14s, 28.76s and 116.79s by using Algorithms FSDD, FSDI and FSDE. It can be seen from the average running time compared with Algorithms FSDD and FSDE, the average computational time is shortened by 38.49s and 32.34s respectively. Compared with Algorithm FSDI, the average computational time is increased by 8.73s. Consequently, the proposed algorithm MMFS is acceptable in computational time.
As mentioned above, according to a series of experimental results, the proposed algorithm MMFS can select feature subsets with higher classification accuracy without significantly increasing the computational time. Therefore, the proposed algorithm MMFS is feasible and efficient, it can be used to effectively deal with feature subset selection of ordered data sets.
Conclusion and feature work
In many practical applications, the ordered data is ubiquitous. Most of existing feature selection algorithms for ordered data consider the single-measure for selecting feature subset. Which may not be considered sufficiently in selecting candidate features, resulting in relatively lower classification accuracies. To overcome these shortcomings, this paper proposed a multi-measure feature measure, which combines dominance-based dependence with dominance-based information granularity. The feature multi-measrue can select different feature subsets by evaluating the inconsistency risk and distinguishing ability at the same time. On this basis, a heuristic feature selection algorithm is proposed. Experiments results on UCI ordered datasets demonstrate the effectiveness and efficiency of the proposed feature selection algorithm. The proposed algorithm can obtain higher classification accuracy than the compared algorithms at most data sets. At the same time, our next research work will consider the feature selection of ordered data under complex background, such as hybrid ordered data and partially labeled ordered data.
Footnotes
Acknowledgment
This work is supported by National Natural Science Foundation of China (62266018 and 61966016), and Natural Science Foundation of Jiangxi Province (20202BABL202037 and 20192B AB207018).
