Abstract
Feature selection is a crucial aspect in classification problems, especially in domains such as text classification, where usually there is a large number of features. Recently, a two-stage feature selection method for text classification which combines class-based and corpus-based feature selection, was introduced. Based on their experiments, the authors conclude what parameter values for both, corpus-based and class-based approaches, allow a feature selection which improves the traditional methods in text classification. In this paper, we revisit this two-stage feature selection method and based on several experiments we come to a different conclusion: the parameters suggested by the original work do not necessarily provide the best results. Based on our experiments, we conclude that by combining the best parameter value for each stage, for the specific corpus under study, the two stage selection method based on coverage policies provides a subset of features which allows to get statistically significant increase over the traditional methods in the success rates of the classifier.
Introduction
The origins of automatic text classification date back to the 1960s. However, in recent years, the enormous growth of on-line documents, mainly due to the expansion of the Internet, has given rise to renewed interest in this paradigm. The primary objective is to assign one or more thematic categories to documents based on their content, that is, a learning task where predefined tags representing categories or classes are assigned to documents [13].
The main research topics in text classification are related to document representation [15], classification algorithms [7], performance metrics [19], feature selection [6, 16], among others. Feature selection is a commonly used task for reducing data dimensionality, taking into consideration the relevance and redundancy among dimensions. There are two different coverage policies employed during the feature selection process: corpus-based and class-based. The corpus-based approach is the traditional approach used in classification problems. It uses a global feature vector extracted from the whole corpus without taking into account any reference to the class. On the other hand, the class-based approach utilizes a distinct feature vector for each class separately. In this way, rare classes are represented equally well as the prevailing classes.
In [14], the authors presented a novel two-stage method for accomplishing the feature selection task, by combining a corpus-based term frequency pruning and a class-based tf-idf filtering. The authors performed experiments and they concluded what is the best combination of parameter values for the two stage feature selection method.
In this work, we revisited [14] and based on our experimental results, we show that the best parameter values found in [14] do not necessarily provide the best results. In addition, we demonstrate that the results can be significantly improved, simply by combining the best parameter value for each stage. In our experimental study, we added two new metrics for class based selection.
The remainder of this paper is organized as follows. First, in Section 2 the related work is presented. Section 3 describes the main concepts and steps of the two stage feature selection method presented in [14]. In Section 4, we show our experimental study including our proposal for increasing the success rate of the classifier over the proposed method. Finally, Section 5 presents some conclusions and future work.
Related work
Feature selection has been a widely studied aspect within text classification. Traditional approaches like [9] present satisfactory results in text classification using all the terms in the document as features instead of limiting the features to a subset of terms. Other approaches like [14] show how an optimal parameter tunning is necessary when only a subset of terms is used. On the other hand, studies like [5] reveal that feature selection may improve the classification performance in terms of accuracy with a significant cut in the solution vector size.
There are a lot of feature selection methods for text classification domains following the filter approach, These methods use different term-weighting metrics such as chi-square, tf-idf, odds ratio, information gain and document frequency, among others. Several studies are focused on comparing metric performances [4, 18], metric combination based on specific measurements [10], and proposing supervised and unsupervised selection algorithms [2, 17].
Another approach for feature selection in text classification is the two-stage feature selection. Under this approach terms of less importance are ignored in the first stage, while in the second stage, feature selection is applied to the terms of highest importance. In the literature several works about two stage feature selection have been proposed [8, 20]. Among them, the most recent is the work proposed in [14], where the authors propose a two-stage method for feature selection in text classification. They neither focus in the selection metric nor classification algorithm, instead, they deal with the coverage policy employed during the feature selection process. Corpus-based and class-based feature selection approaches are analyzed using different metrics. In the first stage they apply a pruning (filter the terms of less frequency) with the corpus-based approach, taking the frequencies of the terms throughout the dataset (corpus) and removing the less frequent ones (the authors compared different values between 2 and 30). In the second stage, they use a class-based filter over the remaining terms using tf-idf focusing on the frequency of terms in the document of a class and favoring those that do not occur commonly in other classes (they took five different numbers of terms between 250 and 4000). Finally, a linear SVM classifier is trained with the features resulting from the two previous stages. The experiments conducted by the authors show that the best classification results measured in terms of MicroF (mean of the document hit rate) and MacroF (average of the success rate within the classes) are obtained when a pruning level threshold (PL = 13) is used in the first stage and by selecting 2000 or 4000 terms in the second stage. However, as we will show later these parameter values do not necessarily provide the best results.
Two stage feature selection
This section briefly describes the method proposed in [14]. Additionally, we describe the term-weighting metrics: Gini Index [18] and Mutual Information [21] since these metrics were included in our experiments into the second stage of the method.
Preprocessing
During the document preprocessing stage, regular expressions are used to detect and discard e-mail addresses, web addresses and dates. In addition, those terms lacking of semantic meaning (stopwords) are eliminated by using the stop list provided by the free version of the MySQL package. Then, Porter’s algorithm is applied in order to obtain the root of the terms and finally, the classical Bag of Words model is used to represent the document collection.
Parameter tunning
Feature selection is performed focusing on both corpus-based and class-based approaches and combining them in such a way that together will increase the performance of the classifier. Thus, the first step was to identify the optimal parameters for each stage independently and then to apply those parameters successively in an second stage.
In order to find the optimal parameter for the first stage, 7 different values of PL between 2 and 30 were proved: 2, 3, 5, 8, 13, 20, and 30. For instance, by using PL = 5 means counting the frequency of all the terms in the corpus and removing those terms whose frequency value does not exceed 5. For each of these values, a supervised classifier was trained representing each document in the collection with the remaining terms. Taken into account the cross-validation classification results for each value of PL in every dataset, the authors determined the global best parameter for this step. It is important to notice, that authors do not provide a clear explanation about what criteria they followed for the selection. We believe that they based this selection in the mean of the results across all datasets.
Then, the best K terms of each class are selected, based on the tf-idf metric and combined into a final set using the classical union set operation. In this way, the final size of the set can be bigger than K, because different features can be selected from each class. It is worth noting that the set union operation was made assuming a possible form in which the authors could have made the selection of the features in the second stage, since this is not explicitly explained in the original paper. Another possible interpretation of a value of K = 2000 will be to select the best 2000 terms from each class based on a given metric and then to combine the selected terms into a final set and select again the best 2000 among all. As well as before, the goal is to identify the parameter K that better classification results provides across all datasets.
In addition to the metrics used by the authors, in this work we implemented two more term-weighing metrics. The metrics used in our experiments are described below.
Class-based Term Frequency - Inverse Document Frequency (tf-idf)
The tf-idf metric was computed according to expression 1:
Where tf i is the frequency of the term i in the document d, c is the total number of documents of class d and c i is the number of documents of the class d where the term i appears.
We included in our study the metric proposed in [18] called popularity within the class (wcp) which is an implementation of the Gini index. This metric addresses two important issues of the selection of characteristics for text classification, the unequal probability distribution of classes and the global goodness of characteristic. The (wcp) metric is computed in two steps: in the first step, given a characteristic, the objective is to transform the original sample space into a normalized sample space of equal class size without altering the distribution of characteristics within the class. To transform the sample space, first the popularity of a characteristic f within a class is defined by a conditional probability of f given a class label c
i
, see expression 2.
Where N (f, c
i
) is the number of occurrences of the term f in all documents of the class c
i
, and V is the vocabulary. In a second step, by normalizing the popularity of a term within all classes, it gives as result the popularity within the class (wcp) as indicated by expression 3.
Where C is the set of all classes.
Mutual information is a metric commonly used in statistical language modeling of term associations and related applications [1]. We compute MI as in [21] where given a term t and a class c, A is the number of times t and c co-occur, B is the number of times the t occurs without c, C is the number of times c occurs without t and N is the total number of documents. The mutual information between t and c is defined as in expression 4, and it is estimated by using expression 5.
Once identified the values of PL and K that provide the best results, the two stage feature selection method proposed in [15] consists in applying corpus-based selection followed by class-based selection. That means, filtering those terms in the corpus that appears less than PL times, and then to select the best K of the remaining terms. Finally, the resulting terms are used as features for the classification stage.
The authors in [14] used Support Vector Machines (SVM) with linear kernel for the classification of documents, arguing that it is one of the most commonly used classifiers today.
Experiments and results
In this section, we present the experiments performed in our study of the two stage feature selection method proposed in [14]. We start with a brief description of the data sets used. Then we replicate in the most faithful way the experiments carried out in [14], but we also add some experiments that show how we can increase the success rate of the classifier over the proposed method, changing the way the best parameter value for each stage is selected. Finally, an statistical test shows the significance of the improvement reached by our proposal over the original method proposed in [14].
Data sets
According to the reference work, three standard data sets were taken from the UCI Machine Learning Repository: Reuters-21578 (Reuters), National Science Foundation Research Award Abstract (NSF) of 2001, and Mini 20 Newsgroups (MiniNg20).
In addition, we generated new data sets from the existing ones by taking the 75%, 50% and 25% of the classes, so that were generated nine more datasets, in addition to the three original ones. This is done primarily to have more confident values for statistical tests.
For the classification stage, in our experiments we applied a five fold cross-validation for each dataset.
Results
We start by replying the experiments of the authors in [14]. Initially, for the corpus based selection we search for the optimal value of PL. Later, for the class based selection we searched for the optimal value of K. Tables from 13 shows the classification results for the different values of the parameter PL for each dataset (including those datasets generated from the original dataset). The best result, in terms of accuracy, for each dataset is marked in bold. On the other hand, Tables from 412 show the classification results for the different values of K for each dataset and those datasets generated from them, using the three metrics respectively. The best result, in terms of accuracy, for each dataset is marked in bold.
Classification accuracy (%) for MiniNg20 datasets applying the corpus based feature selection at different values of PL
Classification accuracy (%) for MiniNg20 datasets applying the corpus based feature selection at different values of PL
Classification accuracy (%) for Reuters datasets applying the corpus based feature selection at different values of PL
Classification accuracy (%) for NFS datasets applying the corpus based feature selection at different values of PL
Classification accuracy (%) for MiniNg20 datasets partitions applying the class based features selection using tf-idf at different values of K
Classification accuracy (%) for MiniNg20 datasets applying the class based feature selection using GI at different values of K
Classification accuracy (%) for MiniNg20 datasets applying the class based feature selection using MI at different values of K
Classification accuracy (%) for Reuters datasets applying the class based feature selection using tf-idf at different values of K
Classification accuracy (%) for Reuters datasets applying the class based feature selection using GI at different values of K
Classification accuracy (%) for Reuters datasets applying the class based feature selection using MI at different values of K
Classification accuracy (%) for NFS datasets applying the class based feature selection using tf-idf at different values of K
Classification accuracy (%) for NFS datasets applying the class based feature selection using GI at different values of K
Classification accuracy (%) for NFS dataset applying the class based feature selection using MI at different values of K
No matter PL = 13 and K = 2000 or 4000 have not appeared as the optimal combination in our experiment, it is important to highlight that in [14] the authors chose these values after testing all the values for PL and K in all the datasets. According to their experiments the combination PL = 13 and K = 2000 or 4000 obtained the best results in almost all the datasets and therefore they generalized suggesting this combination as the optimal one. However, it seems reasonable that the parameter values depend on the dataset under study rather than fixing a single value for all datasets. Following this idea, we propose to choose the optimal value for PL and K for the specific dataset under study. In other words, given a dataset, we propose separately testing all the values for PL and K in order to choose the best values which become the optimal combination for this dataset at applying the two stage feature selection method.
In Table 13, we show the best pairs (PL,K) for each dataset using the three metrics, following our proposal. For example, in this table the pair (5, 4000) in the intersection of row MiniNg20 and the column tf-idf comes from selecting the best value in the second column of Table 1, which corresponds to the value 5, and selecting the best value in the second column of Table 4, which corresponds to the value 4000. The remaining entries in Table 13 are computed in a similar way. As it can be noticed, PL = 13 and K = 2000 or K = 4000 is not frequent, but even more, there is not a predominant combination. This reinforces our idea that the selection of the best parameter for each stage depends on the dataset being studied instead of just fixing a single value for all the datasets.
Best parameter combination of (PL,K) using our approach, for each dataset
In the Tables 14, 15 and 16, we contrast the results, in terms of the accuracy (% of the documents correctly classified), obtained by the two stage feature selection method by using the optimal parameter values suggested in [14] against the results obtained by the values chosen by our proposal. In these tables, for each dataset, we included the combinations (PL = 13, K = 2000) and (PL = 13, K = 4000) jointly with the combination of parameter values according to our proposal which is denoted as (OurPL, OurK).
Classification accuracy (%) for MiniNg20 dataset after applying two stage feature selection based on coverage policies by using the parameter values suggested in [14] versus the parameter values following our proposal
Classification accuracy (%) for Reuters dataset after applying two stage feature selection based on coverage policies by using the parameter values suggested in [14] versus the parameter values following our proposal
Classification accuracy (%) for NFS dataset after applying two stage feature selection based on coverage policies by using the parameter values suggested in [14] versus the parameter values following our proposal
Notice that in our experiments, we included the results obtained by using the three term-weighting metrics tf-idf, Gini index and Mutual Information for the class-based filter selection (second stage of the method) and from the results shown in Tables 14, 15 and 16, we can see that independently of the term-weighting metric used, for most of the datasets, the parameter selection with our proposal (OurPL, OurK) outcomes better classification results.
In the following subsection we will describe the statistical significance tests used to validate our approach.
Since we have multiple data sets and multiple classifiers (the same classifier with different feature selection method) and the distribution of the data is unknown, then we applied a non-parametric test of statistical significance.
We performed a Friedman’s test with all the results, as suggested in [3]. This test resulted in a p-value of 6.657E-7 for which there are significant differences between samples. Then, when we found significant differences, we performed the Bergmann-Hommel’s dynamic post-hoc procedure.
Friedman’s test is a multiple comparison test that aims to detect significant differences between the results coming from two or more classifiers. Post-hoc results will be shown using CD (Critical Distance) diagrams, which present the order of the classifiers, the magnitude of differences between them, and the significance of the observed differences in a compact form.
In a CD diagram, the rightmost classifier is the best classifier, the position of the classifier within the segment represents its rank value, and if two or more classifiers share a thick line it means they have statistically similar behavior [3]. Figure 1 shows the CD diagram with a statistical comparison of the results for all the tested databases.

CD diagram with a statistical comparison of the results.
As we can see in Fig. 1, our proposal with the tf-idf and Gini Index are significantly better than the rest of the methods included in the experiments. In the case of our proposal with Mutual Information, although it does not have significant differences with the other methods, it appears in the third place in the ranking far from the other methods.
Rather than focusing on finding the optimal parameter values for the method such that these values be useful for any dataset, in this paper, we propose that these parameter values would depend on the dataset being studied. For this reason, unlike the original work, we propose to select the optimal parameter values but specifically for the database being studied. Based on our experiments, we can conclude that this simple modification allows to the two-stage feature selection method obtaining a selection of variables that significantly improves the classification results. In addition, in our experimental study we included two more term-weighting metrics, Gini index and Mutual Information. Our results show that independently of the term-weighting metric used, the parameter selection with our proposal gives the best classification results.
As future work, we suggest applying the two-stage feature selection approach to more semantically-oriented text classification methods, such as those using language models, linguistic features, or lexical dependencies. By integrating the concepts of pruning and term selection into those methods, as two consecutive steps may lead to a higher classification performance.
Footnotes
Acknowledgments
The first and second authors gratefully acknowledges CONACyT for their master scholarships 611564 and 611489 respectively.
