Abstract
With the extensive increase of the amount of data, such as text categorization, genomic microarray data, bio-informatics and digital images, there are more and more challenges in feature selection. Recently, feature selection has been widely studied in supervised learning, but there is significantly less work in unsupervised learning because of the absence of class information and explicit search criteria. In this work, we introduce a new measure to assess the importance of features in terms of feature separability. A clustering-based feature selection algorithm is then introduced to conduct the feature selection. The proposed algorithm with nearly linear time complexity selects final feature subset through a ranking procedure based on the separabilities of features and it is applicable to datasets of mixed nature. Experimental results on UCI datasets show that our method, by retaining relevant features, can obtain similar or even better results of classification and clustering for most datasets, and it outperforms other traditional supervised and unsupervised feature selection methods in terms of dimensionality reduction and classification accuracy.
Introduction
In many real-world fields, such as genomic microarray analysis, digital images recognition, text classification and uncertain data management, it is not uncommon to have data sets with hundreds of thousands of features [1, 2]. Feature selection has become an important way to reduce the dimensionality, but it is still considered as an intractable problem in machine learning and data mining. High dimensional data may largely degrade the learning performance and compromise the quality of clustering. Therefore, to retain important features and remove irrelevant and redundant ones, various robust and effective feature selection algorithms have been introduced [3–33]. Generally, feature selection methods can be classified into two groups according to the evaluation criterion, i.e., filter methods [4–17] and wrapper methods [31–33]. The filter methods evaluate the quality of features by using the intrinsic properties of the data and are independent of any learning algorithm. Many existing feature relevance measures are designed to evaluate the quality of the candidate subsets, such as, distance [6–8], correlation [9–12], consistency [13], and rough set [14, 15]. Distance is known as separability, divergence, or discrimination measure [6, 7, 17]. Correlation is widely used to measure the relevance between features and class feature [9, 10]. Consistency is treated as the ratio of the samples that can be used to evaluate the quality of categorical features [13]. On the contrary, the wrapper methods select features based on the results of learning algorithms. In general, the filter methods are generally much faster than the wrapper methods. In this paper, we will focus on the filter methods.
Feature selection has been widely studied in supervised learning, but it is rarely studied in unsupervised learning. It is believed that unsupervised feature selection is less effective than supervised feature selection because of the absence of class information and explicit search criteria. It is not until very recently that several algorithms have been developed to address these issues for clustering. Briefly, there are two fundamentally different approaches for feature selection based on clustering: feature clustering [18–25] and object clustering [26–29]. The former approach removes redundant features by partitioning the original feature set into distinct clusters formed by similar features. After the features are grouped into clusters, such filters select a subset of features from each cluster, discarding other less discriminative ones. Covões et al. [18] and Mitra et al. [19] defined some new indices to measure feature similarity so that feature redundancy is detected. Although they obtained promising results, both of them still require data with continuous features only. Li et al. [22] proposed a localized feature selection algorithm for clustering, which computes adjusted and normalized scatter separability for individual clusters. However, the algorithm must simultaneously compute the number of clusters and the local feature subset. Recently, Zeng et al. [24] presented a new feature selection approach for Gaussian mixture clustering, in which a new feature relevance measurement index was introduced to identify the most relevant features.
The object clustering approach first partitioned a set of objects into clusters by a certain distance measurement, so that objects in the same cluster are more similar to each other than objects in different clusters. Then it selected discriminate features based on the importance of the features according to some defined criteria. Modha et al. [27] introduced a new method for feature weighting in k-means clustering based on intra-cluster and inter-cluster matrices. However, finding optimal weights from a predefined set of variable weights may not guarantee that the predefined set of weights would contain the optimal weights. Wang et al. [28] proposed a feature-weight learning approach to improve the performance of FuzzyC-Means based on a defined similarity measurement and an evaluation function. However, they are complicated and difficult to interpret. Huang et al. [29] presented W-k-means algorithm that can calculate feature weights automatically. Based on the partition in the iterative k-means clustering process, the algorithm calculates a new weight for each feature based on the variance of the intra-cluster distances. The new weights are used to decide the cluster memberships of objects in the next iteration, and the optimal weights are found when the algorithm converges. This approach is dependent on the selection of k value. Unfortunately, many clustering algorithms, such as K-means, FCM, W-k-Means and so on, are only applicable to one type of features.
As far as continuous and categorical features are concerned, many approaches in supervised learning, such as rough set [14], neighborhood rough set [15] and correlation [4, 9], are employed to solve the mixture problems. However, there is very little work that can handle the mixed-type features for feature selection with object clustering in unsupervised learning. To solve this issue, we explore a feature selection method based on single-pass clustering algorithm that is applicable to categorical and continuous features by partitioning the similar objects into the same clusters. The proposed algorithm selects features according to variation of feature separability values. The final feature subset is determined by several iterative clustering results. Experimental comparisons with other supervised and unsupervised filter feature selection methods on UCI data of different dimensionalities shows that the proposed algorithm can obtain competitive results in terms of dimensionality reduction and classification accuracy.
This paper is organized as follows. Section 2 describes some basic concepts, and Section 3 presents the proposed feature selection approach based on single-pass clustering. The experimental results are discussed in Section 4. In Section 5, we conclude this paper by pointing out some issues for future research.
Basic concept
In this section, we begin with a brief introduction to some preliminaries in Section 2.1, and then describe single-pass clustering algorithm in Section 2.2.
Preliminaries
We suppose that dataset D consists of N instances, each instance with m features (m C categorical and m N continuous) where F i (1 ≤ i ≤ m) denotes the i-th feature. To reduce the effect of various measurement units, it is necessary to standardize numerical features.
(1) The difference or distance between objects p and q on F i is named as dif (p i , q i ). For categorical features, , For numerical features, dif (p i , q i ) = |p i - q i |.
(2) The distance between objects p and q, d (p, q) is defined as .
(3) The distance between object p and cluster C, d (p, C) is defined as , where C i is the summary of cluster C on feature F i , dif (p i , C i ) is the distance between object p and cluster C on feature F i . For categorical features, dif (p i , C i ) is defined as the average difference between p and every object in C on F i , that is, dif (p i , C i ) =1 - FreqC|F i (p i )/|C|, while for numerical features dif (p i , C i ) is defined as dif (p i , C i ) = |p i - c i |.
(4) The distance between clusters C1 and C2, d (C1, C2) is defined as , where dif i (C1, C2) is the difference between C1 and C2 on feature F i . For categorical features, dif i (C1, C2) is the average difference among any object p in C1 and object q in C2 and defined as , while for the numerical feature, dif i (C1, C2) is defined as .
Single-pass clustering algorithm
Before giving the feature selection algorithm, we first introduce the procedure of single-pass clustering algorithm [30]. The single-pass clustering algorithm employs the least distance principle to divide dataset. The single-pass clustering algorithm is described as follows: Initialize the set of clusters S, as the empty set, and read a new object p. Create a cluster with the object p. If no object is left in the database, go to (6), otherwise read a new object p, and find the cluster C
j
in S which is closest to the object p. Namely, find a cluster C
j
in S, such that for all in S. If d (P, C
j
) > r, go to (2). Merge object p into cluster C
j
and modify the CSI of cluster C
j
, go to (3). Stop.
The threshold r may influence the quality of clustering and the time-efficiency of the algorithm. As r decreases, the number of produced clusters increases. On the contrary, if r is large enough, we can obtain only a small number of clusters. Therefore, in order to gain meaningful clustering results, we have to choose a proper threshold r. According to the process of clustering, threshold r should be typically greater than the inter-cluster distance and smaller than the intra-cluster distance. Hence, we can logically assume that r should be close to the average distance of any pair of objects. Because of the large dataset, we adopt sampling techniques to develop our strategy for determining the proper threshold. The details are described as follows: Choose randomly N0 pairs of objects in the dataset D. Compute the distances between each pair of objects. Compute the EX which is equal to the average of distances from (b). Select r in [0.5 * EX, EX].
Feature selection algorithm
In this section, we first propose a definition of feature separability for measuring the importance of features and a new feature selection algorithm based on the single-pass clustering algorithm. Then we analyze the time complexity.
Feature separability measure and feature selection algorithm
Now we discuss how to evaluate the importance of features for unsupervised data. A feature is discriminative if it can separate the clusters well. The feature values of a good feature should be partitioned into groups of the same or similar values. Each group, called cluster, consists of values that are similar to each other and dissimilar to feature values of other groups. In other words, the good features are able to distinguish the clusters and should have greater separability values. Based on these considerations, we introduce a new definition to determine the important of features according to feature separability value.
Definition 4 describes the separability of a feature among the clusters. Therefore, it can be used to measure the importance of features.
Based on the aforementioned definitions and the single-pass clustering algorithm, a new feature selection algorithm, named UFS - SC (Unsupervised Feature Selection Based on Single-pass Clustering) is formulated. UFS - SC employs feature separability to measure the importance for the features. The whole procedure of the proposed algorithm is described in Fig. 1.
In the implementation of our algorithm, we observe experiments converge in fewer than 10 iterations, and typically within 3 to 5 iterations.
Time complexity analysis
To simplify the analysis, we start with the assumption that the final number of cluster summary information (CSI) is k in s (3 ≤ s ≤ 5) iterations and each categorical feature consist of distinct values n i . We analyze the time complexity for each step of the proposed algorithm. Step 1 of setting model depends on the size of the data set (N) and the number of features (m). In the worst case, the time complexity of step 1.1 is . In practice, the time complexity can be expected to be O (n · k · m). Step 1.2 computes the feature separability values between any pair of clusters, and its time complexity is . Since we have k << n, in the worst case, the time complexity of step 1 is , which approximates to O (n · k · m). The time complexity in step 2 is O (m log(m)) due to use of quick sort method, and linear time is O (m) in step 3.
The overall computational complexity of the algorithm is estimated as . The time complexity for each stage is nearly linear with the size of dataset, the number of features and the final number of clusters. This implies that our algorithm has good scalability and it can be used in feature selection for higher dimensional data.
Experiments
The evaluation of performance of feature selection algorithms includes three aspects: 1) the degree of dimensionality reduction; 2) accuracy; and 3) efficiency. However, it is inconvenient to compare efficiency because of the variation of experiment environment. In this section, we use experimental results to evaluate the effectiveness of the proposed feature selection algorithm with dimensionality reduction, classification and clustering accuracy.
Experimental design
To study the effect of the proposed algorithm, we perform the experiments on Wine data available from UCI database. Wine data set consists of 178 instances, each instance with thirteen continuous features.
The results with the proposed algorithm are shown in Fig. 2. After calculating the separability values of all features, we draw a figure with descending order according to their separability values. The ordered feature set is 13, 4, 3, 10, 11, 7, 2, 9, 5, 6, 1, 8, 12. From Fig. 2, a further observation is that the separability values of the features have the greatest descent point when ranking them in descending order. In this scenario, the greatest descent point which has the largest inclined rate is considered as the threshold. That is, the features of the first parts in which separability values are higher than the threshold are relevant features. Therefore, we choose the first six features (13, 7, 4, 3, 10, 11) as the goodness feature subsets on Wine data.
Experimental setup
All the experiments with eighteen datasets from UCI machine learning repository [34] are implemented by using Weka [35].
For each data set, we first run all the feature selection algorithms to obtain the newly selected features of each algorithm. The different feature selection methods have been validated using three different classifiers, i.e. C4.5, Ripper and Naïve Bayes. It should be noted that the nature of the decision and the learning process of each classifier are different. Thus, we are interested in checking the goodness of the subsets of selected features, independently from the type of classification rule applied.
Table 1 describes the summaries of each dataset, including the number of instances (Instances), continuous features (Con.), categorical features (Cat.) and classes (Classes). From the table, we can see that the datasets contain different number of instances, features and classes.
The compared methods
In order to assess the performance of the proposed method, an experimental comparison has been done with respect to other supervised and unsupervised feature selection methods. A brief description of the methods is as follows: CFS method [9]: The correlation feature subset method is a filter feature selection approach. It finds an optimal set of features by removing both irrelevant and redundant features and can be applicable to datasets of mixed nature. Consistency algorithm [13]: The study of consistency measure is capable of handling some noise and can be used to remove redundant and/or irrelevant features. In fact, this measure does not incorporate any search bias with regards to a particular classifier. FCBF algorithm [10]: Fast Correlation Based Filter determines feature subset by correlation measure. It can identify relevant features as well as redundancy among relevant features without pair-wise correlation analysis. FarVPKNN algorithm [14]: K-nearest-neighbour relation is computed with continuous features, and categorical features are computed as FarVPDN. NDEM algorithm [15]: Neighbourhood decision error minimization is defined and computed decision positive regions and decision boundary in metric space by neighbourhood rough set. SSF algorithm [18] and MMP algorithm [19]: The two methods select feature subset by the feature similarity measure for continuous features in unsupervised learning.
Among them, the first five algorithms are supervised feature selection algorithms. CFS, FarVPKNN, and NDEM can deal with data with mixed types, while Consistency and FCBF algorithms require every continuous feature to be discretized. In the experiments, we keep special search strategy for each algorithm in the experiments.
Performance of dimensionality reduction
The selected features with different feature selection algorithms, such as FCBF, NDEM, FarVPKNN and UFS-SC, are presented on the order of selecting in Table 2. From the table, we can find that the selected features are distinct when different algorithms are applied.
Figure 3 visualizes a change of classification accuracy with respect to the number of selected features for four data sets. As shown in the figure, in most cases, the performance of our method is significantly better than those of FCBF, FarVPKNN, and NDEM. Generally, increasing the number of features greatly improves the accuracy on Heart and Sonar datasets by the proposed algorithm. Except for our method, the other algorithms climb slowly with increasing number of features in the selection process, and sometimes even drop rapidly. This phenomenon occurs because there are many redundant or noisy features in the feature sets.
Classification and clustering performance on selected features
Table 3 records the number of features selected by each feature selection algorithm and compares the size of selected feature subset of UFS-SC with those of supervised feature selection algorithms (CFS, Consistency and FCBF). The average dimensionality reduction of the datasets shows that all the methods are able to remove nearly 3/5 of the original features. From the average dimensionality reduction of all the datasets, we conclude that our method achieves promising results and its performance of dimensionality reduction is close to FCBF.
In the experiments, each dataset is split into the training set and test set (2/3 for training and the rest for testing). Each dataset is tested by 10 times for randomly shuffling to make sure that the results are not biased by the data sequences.
Tables 4–6 present the learning error rates of C4.5, Ripper and Naive Bayes respectively on different feature sets. From the averaged error rates of all datasets, we observe that, in general, (1) the proposed algorithm UFS-SC improves the accuracy on three learning algorithms compared with other methods on all the data sets; (2) almost all the feature selection algorithms classified on the newly selected features can obtain similar or even less classification error rates than that on the full set.
It is obvious that UFS-SC can maintain or improve the accuracy for most of the data sets according to the individual error rate value. To summarize, UFS-SC provides competitive results to other feature selection algorithms, especially when the cardinalities of the feature subsets are taken into account.
Three imbalanced data sets are selected from Table 1 to compare with the Naïve Bayes accuracy of UFS-SC, SSF, MMP and full sets. Table 7 summarizes the newly selected features of each unsupervised feature selection algorithm (In Tables 7 and 8, the results of MMP and SSF reported here are the best on the three imbalanced datasets).
Table 8 shows the accuracy of Naïve Bayes learning algorithm on the final feature subset with 10-fold cross validation. The results on UFS-SC and full features are also tested 10 times. The acronyms Max, Min and Aver refer to the maximum, minimum and average value of the ten results with UFS-SC respectively.
Compared with MMP and SSF, UFS-SC has two advantages. Firstly, the computational complexity of the developed algorithm is less than those of MMP and SSF, when the number of features is less than the number of instances in the datasets. More precisely, the time complexities of MMP and SSF are estimated at O (m2· n), and the computational complexity of UFS-SC is . Secondly, from Tables 7 and 8, we also observe that UFS-SC not only achieves higher degree of the dimensionality reduction than MMP, but also obtains similar classification accuracy. At the same time, although the performance of dimensionality reduction of the proposed algorithm is less effective than SSF, it has better performance in classification.
From the comparisons, another interesting point deserves our attention is that the classification accuracy is easily affected by the distribution ofdata.
From the previous empirical study, we can conclude that UFS-SC can achieve higher degree of dimensionality reduction efficiently and enhance or maintain predictive accuracy on imbalanced datasets with selected features.
In the following part, we evaluate the efficiency of our method UFS-SC in clustering form the perspectives of number of clusters and the clustering accuracy. In general, 1) with similar or the same numbers of clusters, the higher clustering accuracy is, the better the performance; 2) with the same clustering accuracy, the smaller the number of clusters the better performance. Table 9 presents the clustering performance on full features and the newly selected features based on single-pass clustering algorithm. Each dataset is tested with 10 times of random shuffling to make sure that the results are not biased by the data sequences.
Table 9 shows that the clustering efficiency on UFS-SC is better than on the full feature sets in most cases. Specifically, in ten datasets out of eighteen, the effectiveness is improved on UFS-SC, not only in terms of the enhancement of the clustering accuracy but also in the reduction of the number of clusters. In the other two datasets (Breast, German), although the clustering accuracies on the selected features are lower than those on the full feature sets, the number of clusters is also smaller than that of the original. At the same time, the clustering accuracy of selected features becomes lower in the rest of seven datasets out of twenty, while the number of clusters after UFS-SC is the same as that before feature selection. Therefore, we conclude that the quality of clustering is improved for most datasets on selected features obtained with UFS-SC.
Analysis and discussion
We evaluate the proposed feature selection algorithm on UCI datasets in terms of dimensionality reduction and accuracy. The results of our method are obtained with single-pass clustering as the clustering algorithm. The experimental results discussed above reveal that the new filter method for feature selection introduced in this paper is practical for feature selection and is independent of classification and clustering learning algorithms. The proposed algorithm UFS-SC can handle data with mixed types of continuous and categorical features. It can also efficiently and effectively achieve higher degree of dimensionality reduction and enhance classification accuracy with the selected features. In addition, it can improve the quality of clustering for most benchmark data sets. Empirical results demonstrate that the proposed algorithm is superior to other supervised and unsupervised feature selection algorithms on most datasets in our experiments.
Conclusion
Data with high dimensionality and without class information may pose great challenges to machine learning and data mining. In this paper, we present an unsupervised feature selection algorithm based on single-pass clustering and by distinguishing important features through a ranking procedure by computing feature’s separability values. The proposed feature selection algorithm, called UFS-SC, can remove the irrelevant features by calculating separability value of each feature. The feature separability values are calculated after clustering within selecting of the clustering threshold r by sampling techniques. The final results are determined by clustering for s iterations. The proposed algorithm with nearly linear time complexity can be applied to datasets containing a mixture of categorical and continuous features. Experimental results on UCI datasets have shown that our method, by retaining relevant features, can achieve promising classification and clustering results for most datasets. Compared with other traditional feature selection approaches, the proposed algorithm can obtain similar or even better performance in terms of dimensionality reduction and accuracy in both classification and clustering.
Despite the promising results achieved in this paper, there still exist limitations in our work. Firstly, the comparability for the values of separability on different types of features is worth further analysis. Secondly, the performance of the proposed approach may be dependent on the distribution of data. Thirdly, it cannot remove all the redundant features as much as possible. We will introduce feature clustering method to solve these issues in our future work. We will also interested in examining our approach with other clustering algorithms and extend the application of our algorithm to data with higher dimensionality for parallel computation.
Footnotes
Acknowledgments
This research was supported by the Humanities and Social Sciences Research Youth Foundation of Ministry of Education of China (No.14YJC870021), the National Natural Science Foundation of China (No. 61572145, No. 61402119), the Science and Technology Planning Project of Guangdong Province (No. 2014A040401083, No. 2015A030401093), the Major Program of National Social Science Foundation of China (No. 12&ZD222).
