Abstract
Feature selection can reduce the dimensionality of data effectively. Most of the existing feature selection approaches using rough sets focus on the static single type data. However, in many real-world applications, data sets are the hybrid data including symbolic, numerical and missing features. Meanwhile, an object set in the hybrid data often changes dynamically with time. For the hybrid data, since acquiring all the decision labels of them is expensive and time-consuming, only small portion of the decision labels for the hybrid data is obtained. Therefore, in this paper, incremental feature selection algorithms based on information granularity are developed for dynamic partially labeled hybrid data with the variation of an object set. At first, the information granularity is given to measure the feature significance for partially labeled hybrid data. Then, incremental mechanisms of information granularity are proposed with the variation of an object set. On this basis, incremental feature selection algorithms with the variation of a single object and group of objects are proposed, respectively. Finally, extensive experimental results on different UCI data sets demonstrate that compared with the non-incremental feature selection algorithms, incremental feature selection algorithms can select a subset of features in shorter time without losing the classification accuracy, especially when the group of objects changes dynamically, the group incremental feature selection algorithm is more efficient.
Introduction
With the development of Internet of things, artificial intelligence and other information technologies, many features are collected to characterize the data in different practical applications. However, the high-dimensional characteristic of data may reduce the efficiency of the classification algorithm and degrade the classification performance of classifiers. Feature selection [1, 2, 3, 4, 5, 6] is an effective method of data preprocessing, which can effectively reduce data dimension, improve the data compactness and the efficiency of knowledge discovery. As an important theory of granular computing [4], rough set theory has becomes an active research work in the fields of feature selection, knowledge discovery, data mining and so on [7, 8, 9, 10]. The biggest advantage of this theory is that it does not need to provide any prior knowledge other than the data itself. It can directly process the data and mine the potentially useful knowledge from data sets. Therefore, we focus on the feature selection method using rough set theory in this paper. For many real-world data sets, most of the data sets are hybrid data including symbolic, numerical and missing features [11, 12, 13, 14, 15, 16, 17, 18]. At the same time, the label information acquisition of data is expensive, which requires expensive resources or a long experimental process for manual labeling to get labeled data. Therefore, the hybrid data with partial label is generated, i.e., only small portions of decision labels for hybrid data are obtained. For example, in the patients’ health system, this system is the hybrid information data which consists of symbolic features such as blood glucose and numerical features such as pain set. In addition, for privacy protection, some patients’ information is hidden, some feature values are missing. Meanwhile, doctors need to judge whether the patient is sick or not according to the patient’s health information, i.e., the label information. Since this judgment process is time-consuming and laborious, a large number of ‘unlabeled’ patients exist in the system. How to select a feature subset from partially labeled hybrid data is a hot topic in the field of knowledge discovery, data mining and intelligent information processing [19, 20, 21]. The relevant feature selection algorithms are mainly for static partially labeled data. However, in the background of big data, the data sets change dynamically in many practical applications [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]. The dynamic changes of data sets include the variations of object set, feature set and feature values. As the main method of processing dynamic data, incremental learning has attracted much attention for many researchers. At present, the research on feature selection of dynamic partial labeled hybrid data is still less and needs further research. Therefore, this paper will propose incremental feature selection algorithms for the partially labeled hybrid data with the variation of an object set.
The main contributions of this paper include as follows. (1) Incremental computations of information granularity are given when an object set in the partially labeled hybrid data changes dynamically; (2) Incremental feature selection algorithms are developed when a single object and group of objects change dynamically in the partially labeled hybrid data, respectively; (3) The efficiency and effectiveness of the proposed incremental feature selection algorithms are demonstrated by experimental results.
The remainder of the paper is organized as follow. Section 2 discusses the previous studies on feature selection and incremental learning. Section 3 briefly reviews some basic concepts revolved in this paper. In Section 4, the non-incremental feature selection algorithm for dynamic partially labeled hybrid data is given. In Section 5, the incremental feature selection algorithms with the variation of a single object and group objects based on the information granularity are proposed, respectively. In Section 6, the experimental analyses on the non-incremental feature selection algorithms and the incremental feature selection algorithms on six UCI data sets are given. Section 7 presents the conclusions and the future work.
Related works
At present, some achievements have been made in the research of feature selection for the partially labeled data. For large-scale single type data, Wang et al. [19] proposed a feature selection algorithm based on the complementary information entropy for partially labeled symbolic data. Dai et al. [20] proposed a feature selection algorithm based on discernible matrix for partially labeled symbolic data. Liu et al. [21] proposed a rough set based feature selection algorithm via ensemble selector for partially labeled numerical data. In recent years, many scholars have proposed corresponding feature selection algorithms for other different types of data. For large-scale hybrid data consisting of categorical and numerical data, Xiao et al. [22] proposed a feature selection algorithm based on feature dependence. Wang and Liang [23] proposed an efficient feature selection algorithm based on the technology of decomposition and fusion. For the heterogeneous data with symbolic and real-valued condition features, Chen and Yang [24] proposed an attribute reduction algorithm by the combination of the classical rough set and the fuzzy rough set. Zhang et al. [25] proposed a feature selection algorithm based on the fuzzy rough set theory for mixed-type data including symbolic and numerical features. Kim et al. [26] proposed the feature selection algorithm based on rough set model for mixed-type data including categorical and numerical features. The above feature selection algorithms are proposed for static data.
Incremental feature selection has attracted much research interest because of the dynamic nature of data in practical application. For example, with the variation of an object set, Yang et al. [27] proposed an incremental feature selection algorithm based on fuzzy rough set for the dynamic incremental objects. Liang et al. [28] proposed a group incremental feature selection algorithm based on information entropy for discrete data. Ma et al. [29] proposed a group incremental attribute reduction based on compressed binary discernibility matrix for discrete data. Shu et al. [30] proposed two incremental feature selection algorithms through the use of the dependency function when a group of objects are added into or deleted from the discrete data, respectively. Zhang et al. [31] proposed an incremental feature selection algorithm by updating the approximations based on neighborhood rough set when a single object or group objects changes dynamically. With the variation of feature set, Jing et al. [32] proposed an incremental feature selection algorithm based on relation matrix for multiple objects and features changing simultaneously. Zheng et al. [33] proposed an incremental attribute reduction algorithm by using the relationship matrix when the attributes increase. Wang et al. [34] proposed the attribute reduction algorithm based on three representative entropies for data set with attributes increasing. With the variation of feature values, Wei et al. [35] proposed an incremental feature selection algorithm based on the discernible matrix of compressed decision table. Luo et al. [36] proposed an incremental method for updating approximations based on dominance rough set. Shu and Hong [37] proposed a rough set-based incremental feature selection algorithm in dynamic incomplete data. However, there are few researches on feature selection algorithms for the partially labeled hybrid data with the variation of an object set.
Preliminaries
In this section, we briefly review some basic concepts revolved in this paper. In addition, the concepts of the partially labeled hybrid decision system and the information granularity are introduced.
The partially labeled hybrid decision system
In the granular computing theory, the data table is represented as a 4-tuple
Given a decision system
The partially labeled hybrid decision system is represented by
The partially labeled hybrid decision system
The partially labeled hybrid decision system
In the following, the related concepts are given as follows, such as the information granularity.
where
where
where
From Definition 4, it can be seen that if
where
From Definition 5, it is shown that
Condition (1) indicates that the subset of features
In this section, in view of the dynamic partially labeled hybrid data, the non-incremental feature selection algorithm based on the information granularity is introduced. When an object set changes dynamically, the traditional (non-incremental) method regards the changed data set as a new data set and recompute the new feature subset. The detailed description is given as Algorithm NIFS.
1 The non-incremental feature selection algorithm based on information granularity (NIFS)
[1] A partially labeled hybrid decision system
Incremental feature selection with the variation of an object set for the partially labeled hybrid data
Since the traditional (non-incremental) feature selection algorithm has a lot of repeated calculations for selecting a new feature subset, the algorithm is inefficient and time-consuming. Therefore, we will select a feature subset by the incremental strategy when an object set changes dynamically. The incremental learning is an effective method for feature selection in dynamic data. On the basis of the original feature selection result, a new feature subset is quickly obtained by updating the information granularity in an incremental manner. Section 5.1 will give the incremental mechanism and the corresponding algorithm when a single object changes dynamically. Section 5.2 will present the group incremental algorithm when a group of objects changes dynamically.
Incremental feature selection with the variation of a single object for the partially labeled hybrid data
When a single object is added into or deleted from the system, the incremental mechanism can effectively reduce calculation time by updating the new information granularity based on the change of the neighborhood granules, decision classes and the exiting information granularity of the original system. On this basis, the incremental feature selection algorithm with the variation of a single object is developed. In the following, we will select a new feature subset based on the previous feature selection results of the system, which is no need to select a new feature subset from scratch.
The incremental mechanisms of the information granularity are analyzed when adding a new labeled object or unlabeled object to the system, the detailed descriptions are given by Theorem 1 and Theorem 2.
Since
then
(2) The neighborhood granule of the new object
Since
As analyzed above, the information granularity of
Since
and
it can be obtained by merging operations that
(2) All the changed neighborhood granules of the labeled objects are
It is similar to case (1), then
(3) The neighborhood granule of
Since
and
then
Because
(4) The neighborhood granule of
Since
As analyzed above, the information granularity of
In the following, the incremental mechanisms of information granularity are analyzed when deleting a labeled object or unlabeled object from the system, the detailed descriptions are given as Theorem 3 and Theorem 4.
Since
and
Because
As analyzed above, the information granularity of
Since
by merging operations, we have
Because
As analyzed above, the information granularity of
According to the above analyses, the corresponding algorithm is developed based on the incremental mechanism when adding and deleting a single object simultaneously, the detailed descriptions are given as Algorithm IFS.
2 The single incremental feature selection algorithm based on the information granularity (IFS)
[1] A partially labeled hybrid decision system
When a group of objects is added into or deleted from the system, the group incremental mechanism is regarded the dynamic objects as a whole, the new information granularity can be updated quickly by analyzing the local changed information granularity between dynamic objects and the original data set. On this basis, the proposed group incremental algorithm can obtain a new feature subset by updating the exiting feature selection results of the original data set.
The updating incremental mechanisms of the information granularity are introduced when adding a group unlabeled objects or labeled objects, the detailed descriptions are given as Theorem 5 and Theorem 6.
where
Since
based on the symmetrical relation,
then
Thus, the information granularity of
where
Since
According to the symmetrical relation,
then
Thus, the information granularity of
The updating incremental mechanisms of information granularity are introduced when deleting a group unlabeled objects or labeled objects, the detailed descriptions are given as Theorem 7 and Theorem 8.
where
since
let
where
the new information granularity becomes
where
since
where
let
According to the symmetrical relation,
thus
the information granularity of
Based on the above discussions, the group incremental feature selection algorithm based on the infromation granulrity is presented when a group of objects changes dynamically, the detailed description is given as Algorithm GIFS.
3 The group incremental feature selection algorithm based on the information granularity (GIFS)
[1] A partially labeled hybrid decision system
In this section, a series of experiments are conducted to verify the efficiency and effective of the proposed feature selection algorithms. The experiments are conducted on six data sets from UCI [40], and the specific information of each data set is shown in Table 2. The experiments are implemented using PyCharm 2017 on a PC with Windows 10, Core(TM) Duo CPU 2.40 GHz and 8 GB of RAM. All of the algorithms are coded in Python.
Data sets table
Data sets table
For each data set in Table 2, the data sets are preprocessed. At first, normalize the numerical features to set the value field to
For the proposed algorithms IFS and GIFS, the parameter values
It can be seen from Fig. 1 that, feature selection results with different number of features and classification accuracies can be obtained by using different parameters to feature selection with the original data sets, the classification accuracies and the sizes of feature subset are constantly change with the values of
Classification accuracies and sizes of feature subset on different parameters 
To verify the efficiency and effectiveness of the proposed incremental algorithms IFS and GIFS, Algorithms IFS and GIFS are compared with the non-incremental algorithm NIFS mentioned by Section 4 and the non-incremental standard feature selection algorithm based on information entropy denoted by NSFS [41] when an object set changes dynamically in each data set presented in Table 2, with regard to the results of feature subset, classification accuracy and the computational time.
Feature subset size and classification accuracy
In order to evaluate the classification performance of IFS and GIFS, the classifiers C4.5, SVM and KNN are used to evaluate the feature selection algorithms. The 10-fold cross-validation is used to obtain the final classification accuracy. For each data set, we divide it into ten parts of equal sizes, nine parts of them are used to find the feature subset, the classification performance of the feature subset is tested on the last one part. At last, the average classification accuracy of experiments is used as the final classification accuracy. Table 3 shows the feature selection results of Algorithms GIFS, IFS, NIFS, NSFS. Table 4 records the average classification accuracies of Algorithms GIFS, IFS, NIFS, NSFS.
The feature subset size of GIFS, IFS, NIFS and NSFS
The feature subset size of GIFS, IFS, NIFS and NSFS
Classification accuracies of GIFS, IFS, NIFS and NSFS with the classifiers C4.5, SVM and KNN
Computational time of Algorithms NIFS, IFS, GIFS and NSFS in six data sets.
It can be seen from Table 3 that the results of feature selection obtained by three algorithms GIFS, IFS and NIFS are similar in most case, because Algorithms GIFS, IFS, NIFS are all based on the information granularity, the calculations of the feature significance are the same. It is noted that the proposed algorithms GIFS and IFS can get the same or similar feature selection results as the algorithm NSFS. Table 3 shows that the algorithms GIFS and IFS can effectively select the feature subset.
It can be seen from Table 4, compared with the classification accuracies without feature selection in each data set, the classification accuracies of Algorithms GIFS, IFS, NIFS, NSFS are improved. Meanwhile, the classification accuracies of algorithms GIFS, IFS, NIFS and NSFS are similar in classifiers C4.5, SVM and KNN at most cases. Take Australian data set for example, the classification accuracies of Algorithms GIFS, IFS, NIFS, NSFS are the same in C4.5, SVM and KNN, which is 86.49%, 85.93% and 85.73%. From the last row of Table 4, the average classification accuracies obtained by GIFS and IFS are similar to the NIFS and NSFS under classifiers C4.5, SVM, KNN, indicating that the proposed algorithms GIFS and IFS can effectively obtain the feature subset.
To verify the efficiency of the proposed Algorithms IFS and GIFS, Fig. 2 presents the computational time by Algorithms GIFS, IFS, NIFS, NSFS when an object set changes dynamically. In each sub-figure, the
It can be seen from Fig. 2, the computational time of Algorithms GIFS, IFS, NIFS and NSFS increases as the variation size of object set increases. In each sub-figure, the computational time of the proposed algorithms IFS and GIFS are obviously shorter than NIFS. Take German data set for example, when the object set changes dynamically for the third time, the computation time of the NSFS, NIFS, IFS and GIFS are 233.26 s, 213.10 s, 68.63 s and 30.58 s. The IFS time consumption decreases by 70.57% and 67.79% relative to the NSFS and NIFS, the GIFS time consumption decreases by 86.89% and 85.64% relative to the NSFS and NIFS. Hence, the proposed algorithms IFS and GIFS are more effective than non-incremental algorithm NIFS and the algorithm NSFS. In addition, as the variation size of object set increases, the variation trend of the computational time by GIFS is more gentle than IFS and the advantage of the GIFS is more significant, indicating that GIFS can save more computational time than IFS.
As mentioned above, the experimental results show that the proposed algorithms IFS and GIFS are more effective than NIFS and NSFS, and can greatly reduce the computation time of feature selection while maintaining the classification accuracy. Meanwhile, GIFS takes less computation time than IFS when a group of objects dynamically changes.
Conclusions and future work
In many practical applications, the partially labeled hybrid data is ubiquitous. For the dynamic partially labeled hybrid decision data, the traditional (non-incremental) feature selection algorithm is given. However, the traditional feature selection is time-consuming when an object set changes dynamically, which cannot be obtained in real time. To solve this problem, we proposed the incremental feature selection algorithms based on information granularity. When a single object or a group of objects dynamically changes, the information granularity is obtained by updating incremental mechanism, a lot of repeated calculations can be avoided. When a group of objects dynamically changes, the group incremental algorithm is more efficient than the single incremental algorithm. Furthermore, a series of experimental results on six data sets from UCI and show that the proposed incremental feature selection algorithms can efficiently obtain new feature subset by using the incremental manner when an object set changes dynamically. At the same time, our next research work will consider how to efficiently select the feature subset according to the dynamic changes of the feature set in the partially labeled hybrid data.
Footnotes
Acknowledgments
This work is supported by National Natural Science Foundation of China (61662023 and 61966016), and Natural Science Foundation of Jiangxi Province (20202BABL202037 and 20192BAB207018).
