Abstract
Obtaining interesting and topic-relevant information is a very important task in Web mining. Text classification using a small proportion of labeled data and a large proportion of unlabeled data, also called semi-supervised learning, is a well-known problem. Despite plenty of research on text classification, however, how to effectively and efficiently apply valuable frequent patterns and deal with high-dimensional data in text classification is still an open issue. Due to the increasing data volumes and plenty of high-dimensional data, both distance measures and time complexity could be influenced by the noisy data. This paper targets on this problem and presents a novel method for text classification called CTFP (Classification based on TFP-tree), which uses TFP-tree (Text-Frequent-Pattern-tree) to generate frequent patterns in tremendous amount of texts and conduct text classification in a relatively low dimensional data space. It effectively reduces the data dimensionality during constructing the classifier. Substantial experiments on three datasets (RCV1, SRAA and Reuters-21578) show that our proposed method can achieve better performance than many existing state-of-the-art methods on precision, efficiency and many other evaluation metrics.
Introduction
Searching or extracting useful information from tremendous amount of data on the Web could boils down to analyzing and classifying features and texts, which are the backbones of increasingly Web application such as opinion mining [1], Web page classification [2, 3], filtering spam [4, 5] and so on. Text classification, as a very important part, has attracted many researchers’ attention. Conventional PU (Positive and Unlabeled) text classification, also known as semi-supervised classification, utilizes labeled and unlabeled data to build a classifier. It usually contains two-step. The first step is collecting reliable negative examples (also called RN) from unlabeled dataset (identified by U). The second step is training classifier using positive examples (identified by P), reliable negative examples and unlabeled examples. Some classification techniques are statistical-based such as Bayes classification (naïve Bayesian classifier and simple Bayes classifier) [6] and some classification techniques are geometrical-based such as SVM [7]. There are also some rule-based classification methods such as decision tree induction [8], IF-THEN rules for classification. However, the texts to be processed, more often than not, are represented as feature vectors, i.e. D = {x1, x2, …, x m }. For example, the feature x1 can be “classification”, and the feature x2 can be “learning”. The parameter m denotes the total number of the features in a dataset instead of the total number of the features in a document. When a term x i is not included in a document, its corresponding feature weight is zero, but it still takes up its own memory space. If so, the algorithms will process a lot of high-dimensional data, which probably leads to high computational cost. Some techniques have been developed to remove redundant or irrelevant features such as removing stop words. Removing stop words plays a role in dimensionality reduction in text classification. It is not sufficient to just applying stop words removing in the process of dimensionality reduction.
To solve this problem, this paper proposes a dimensionality reduction method for text classification based on a novel text frequent pattern generation technique called TFP-tree. Traditional frequent patterns are usually extracted from transaction or relational database and can be used to show the relations among all kinds of items. Frequent patterns can also be used in text classification. For examples, in the economic domain, the features occurring frequently can be “statistic”, “market”, “dollar”, “stock”, etc. In the sport area, the features occurring frequently can be “NBA”, “football”, “race”, “Kobe Bryant”, etc. Intuitively, the features in economic domain and the features in sport area are negative correlation. The typical features occurring in economic domain would probably not appear in sport area or the occurrence frequency would be low. The features in the same class should have some correlation in most cases. Due to the relation among the extracted features, the irrelevant or redundant features can be removed effectively according to their correlation shown in frequent patterns. Frequent itemset mining methods have been applied to many research areas. However, it is still a new attempt to apply them to text classification. Moreover, as we know, the extracted reliable negative examples play an important role in the performance of classifier. So, we also conduct frequent pattern mining on reliable negative examples from the perspective of dimensionality reduction and performance improvement. Motivated by this hypothesis, we extend the conventional two-step PU text classification method to three-step method. The first step, collecting reliable negative examples, remains unchanged. It is not the focus of this paper, so we will not explain this part in detail. You can find the details in our previous work [9–11]. In the second step, a new method for generating frequent patterns is put forward. Also, the benefits and experimental results of dimensionality reduction are illustrated in detail later. In the third step, the final classifier is obtained by iteratively training classifier using the text frequent patterns obtained from the second step. Since training data and testing data do not have the same data distribution, additional processing is needed. We conduct experiments on three datasets to verify the effectiveness and efficiency of our proposed method. The experimental results show that our proposed dimensionality reduction method for text classification outperforms traditional semi-supervised text classification methods both on precision and efficiency with the help of applying frequent patterns generated using TFP-tree.
The contributions of this paper are summarized as follows: We propose a new concept called TFP-tree to deal with high dimensional data. It is used to generate positive frequent patterns and negative frequent patterns, which could be applied to conduct dimensionality reduction effectively. We provide an effective IU (Intersection-Union) framework to solve the problem that training data and test data do not have the same data distribution. Constructing TFP-tree and applying frequent patterns to text data instead of transaction processing are new attempts. Also, taking negative frequent patterns into account is more efficient in the process of removing the redundant and irrelevant features. The experiments demonstrate that the effectiveness and efficiency of TFP-tree-based dimensionality reduction method, in comparison with other state-of-the-art approaches.
The outline of this paper is organized as follows. We first review the related work in Section 2. Section 3 explains how to construct TFP-tree and how to generate frequent patterns. Section 4 illustrates the process of constructing the classifier. Several comprehensive experiments are performed to evaluate our proposed method in Section 5 in which experimental settings, performance metrics, and results are provided. Section 6 draws the conclusion.
Related work
Text classification and frequent patterns generation both have a long history. Text classification aims to automatically collect, organize and obtain the information which users are interested in. Researchers put forward a lot of methods to improve the accuracy and efficiency [12, 13]. However, most of automatic text classification techniques need to handle the high-dimensional data such as SVM, which is quite effective but takes up a lot of time and space.
So far, many researchers have delved into PU text classification and proposed a number of approaches. Support Vector Machines is a very famous text classification technique proposed by Cortes and Vapnik [7]. Since then, there are many other improvements and applications showing up. It is a two-step classification method which can be used in PU text mining. The first step is collecting reliable negative examples (also called RN) from unlabeled dataset (identified by U). Feature selection and feature reduction are major problems of this step. They have been widely studied in PU text mining. Wang et al. [14] put forward a feature selection method which made a lot of innovation ideas in the field of feature reduction. The core feature set improves the classification effect a lot. Saidani and Rassoul [15] proposed a weighted genetic algorithms which optimized the feature selection process for sentiment analysis. They combined a supervised weighting method and stochastic search methods to generate feature subset, which can extracts and selects more efficient features. Jiang and Wang [16] introduced a clustering feature selection method which selected final feature subset based on the separabilities of features. The second step is constructing the classifier using positive examples and reliable negative examples collected by the first step. In the second step, Baccianella et al. [17] put forward a new feature selection method using micro-documents which broke down each training document of length k into k training micro-documents. It makes the features more sensitive to term frequency. Bayes theorem, as another method for classification is based on statistics [6]. It combined posterior probability and class-conditional independence to predict the class label of a document. Pan et al. [18] utilized transfer learning for text classification to tackle the challenge of different data distribution. It aims to deal with the situation that source domain and target domain are related but different in accordance with original features. Tang and Liu [19] proposed a novel feature selection method for high dimensional data which used social context to social data during selecting features. The effectiveness of linked data is shown in their work. However, there are lots of noisy and incomplete data in the whole Web environment. How to get rid of it to improve the effectiveness and efficiency of feature selection is still a challenging problem.
Frequent patterns play an important role in mining the relationship among massive amount of data. Most of frequent pattern mining algorithms use support-confidence framework. The first algorithm for mining frequent patterns was the Apriori algorithm presented by Agrawal and Srikant [20]. There are also a lot of variations of Apriori. They need to generate a huge amount of candidates, which takes a lot of time and space. Another approach, called FP-growth (Frequent Pattern growth), presented by Han et al. could mine frequent patterns without generating the candidates [21]. Cheng et al. [22, 23] applied discriminative frequent patterns in classification. It uses the feature selection to remove the features which are included in the discriminative and redundant frequent patterns. Also, many algorithms have been proposed to deal with the PU classification task which makes use of mining frequent patterns or association rules. The CBA (Classification Based on Associations) [24], CMAR (Classification based on Multiple Association Rules) [25], and CPAR (Classification based on Predictive Association Rules) [26] are three of them and also our baseline algorithms. CBA iteratively constructs the classifier using the generated association rules. Those rules make up a decision list which is similar to the decision tree. It utilizes the generated rules to classify the testing tuples, which seems to be better than C4.5. CMAR constructs the FP-tree only through scanning the database just twice. CPAR generates far fewer association rules than CMAR and CBA. However, it outperforms CMAR and CBA on accuracy and efficiency with large amount of training set. Our proposed CTFP is different from the three methods above. The strategies for mining the frequent pattern and constructing the classifier are both different. As CBA, CMAR, and CPAR use frequent patterns to generate association rules and are rule-based method for classification, we choose these three methods and two clustering-based classification methods as our baseline algorithms.
Generating frequent patterns using TFP-tree
Text classification is the process of assigning predefined category labels to new documents on the basis of the classifier learnt from training examples. Each document can be in multiple, exactly one, or no category at all. By using machine learning technique, the objective is to learn classifier from examples that do the category assignments automatically. Most text classification tasks can be formulated in a way to use binary classifiers. And, multiclassification can also be regarded as a variant of the binary classification. That is, for a task with n classifications, there will be n binary classifiers. Consider an L (L > 2) categories classification task, the method called the one-versus-all works uses L classifiers. When training the ith classifier, ith category document examples are treated as positive examples, and the documents of all the rest of classes can be treated as the negative examples. When classifying a document, the class label choice depends on the largest value yielded by classifier.
As described in Section 1, the whole procedure contains three steps: (1) extracting reliable negative documents, (2) generating text frequent patterns, and (3) constructing text classifier. In the first step, we choose 1-DNFC [11], which uses a threshold lambda to make up a defect of traditional 1-DNF and corrects a bug of 1-DNFII, to identify the reliable negative documents. Suppose the training data
Generating frequent patterns is one of the most important steps in our proposed method. The basic idea is to extract the representative features according to the importance of the features appearing in both positive and negative document sets, so the data dimensionality can be reduced to save the computational cost. FP-growth approach [27], as we know, has already been used for mining frequent itemset in order to avoid generating a costly candidate sets, especially mining the transaction database. However, applying FP-growth into text classification is a relatively novel research method. An obvious difference between transaction and a huge number of texts is that the item in one transaction only occurs once but the words in one document may occur several times. In this paper, we do not focus on FP-growth. Instead, some additional procedures are added into original FP-growth approach. We split all documents, denoted as D, into several paragraphs PGs(D), which include the same number of documents. That is, each dataset yields a series of paragraphs including several documents in order to avoid the situation that too many branches in a tree. In that situation, it is difficult to generate useful frequent patterns. The number of document, also called paragraph length (denoted by parameter n), will be discussed in detail in the following section. We will explain it later.
Let’s look at a motivating example to illustrate how to use the TFP-growth to conduct text mining and classification. Assume that d ={ x1, x2, …, x m } be a set of words representing the document d. What’s more, given a document d i = {(xi1 : wi1) , (xi2 : wi2) , …, (x im : w im )} , w im is used to represent the occurrence times of word x im that shows up in document d j . Suppose there are five documents containing in a paragraph PG1 as shown in the first column in Table 1. Each document is scanned to calculate the occurrence times of each word as shown in the third column. Then we begin constructing the TFP-tree as follows. After creating the root node “null” of the tree, the documents are scanned the first time in the order that shows in Table 1. In the table of Fig. 1, the words are listed in the order of descending occurrence frequency just as processed in TFP-tree (Word_ID). The first document of Table 1 is “{x1 : 2, x3 : 3, x4 : 1}”. The first branch of the tree is constructed with three nodes x1, x3 and x4 in the order that shows in Fig. 1. The following steps are different from those in constructing original FP-tree. Scan the set of paragraphs a second time. There is only one document that covers x1, x3 and x4. So, we mark <x1 : 2> , < x3 : 3 >, and <x4 : 1> in each node of the branch instead of <x1 : 1> , < x3 : 1 > and <x4 : 1>. Next, scan the second paragraph {x1 : 1, x2 : 2, x3 : 1}. It contains x1, x3 and x2 in the order of Word_ID, which results in a branch where x2 is linked to x3. Then the value of node x1 becomes 3 (2 + 1 =3). With this method, the TFP-tree is built after scanning all the documents shown in Table 1. By now, all the leaf nodes are marked with a number which is called support. Notice that the support count of x3 is 3, which is the occurrence number that x3 shows in d1 instead of just incrementing the support count by 1. In addition, < x1, x3>also appear in d2 and d4, and the occurrence numbers are 1 and 2, respectively. So, the node x3 is marked with six (3 + 1 +2 = 6) in the end. In this way, we get the TFP-tree as shown in Fig. 1 as follows. The way of generating the frequent patterns from TFP-tree are as same as the method of generating frequent patterns from original FP-tree, which is not detailed in this paper. According to the TFP-tree above, for item x5, we could obtain the generated frequent patterns: {x1, x5 : 2}.

A TFP-tree covering the words showing up in paragraph PG1 (documents d1 to d5).
The representation of document during generating frequent patterns
In this section, we will focus on how to construct the classifier using the generated frequent patterns. Moreover, we define an IU (Intersection-Union) framework to deal with the situation where the inconsistence of data distribution exists.
Before constructing TFP-tree, document pre-processing and feature extracting are needed. After mining frequent patterns from TFP-tree, text classification is transformed to deal with a small number of low-dimensional data instead of handling a lot of high dimensional data. In the training set, there are a small proportion of positive documents (denoted by P) and a large proportion of unlabeled documents (denoted by U). The frequent patterns generated from the positive document set are called positive frequent patterns (denoted by PFP). Similarly, the frequent patterns generated from the reliable negative documents are called negative frequent patterns (denoted by NFP). It is easy to understand that the frequent patterns in positive examples may be the representative words in the documents we are interested in. Similarly, the words in the negative frequent patterns are probably not included in that kind of documents. So, in the documents of the training data set, we remove all the features which do not occur in the frequent pattern set. The features showing up in the frequent pattern set are regarded as “useful” features in the process of classification and vice versa. However, we do not simply remove the features showing up in the frequent pattern set. As we know, frequent patterns contain k-itemset, where k could be one, two, three and more. It is easy to know that almost every 1-itemset belongs to frequent pattern set. Any feature, that satisfies the min_sup requirement, would be added to 1-itemset. Once it is added to frequent pattern set, it would be removed from the training set. Using 1-itemset to represent the specific-topic document is meaningless. Therefore, we do not treat 1-itemset as the frequent pattern. The positive documents, which get rid of the features not occurring in PFP and NFP, are marked by DP. Also, the negative documents, which get rid of the features not occurring in NFP and PFP, are marked by DN. DP, DN and unlabeled documents make up the whole training set.
The whole dataset contains the training dataset and test dataset. After conducting dimensionality reduction in training dataset, most irrelevant and redundant features have been removed from the training dataset, which results in the inconsistence between training data and test data. Different from the previous methods for constructing SVM classifier, the training data and the test data have the same data distribution in traditional SVM classifier. It means that the dimensionality of the training data and the test data is the same. Then, how to make the dimensionality of each document remain the same after removing those redundant features. We give a simple example to illustrate it. Traditional support vector and the maximum margin like Fig. 2 are used to show how to obtain the maximum margin. However, they are used to show how to reach the consistency of data distribution. In Fig. 2, the nodes like “⚫” represent the features of positive document examples. And the nodes like “◍” indicate the features of negative document examples. All the nodes are the features of the documents after removing the redundant and irrelevant ones. For example, suppose the original feature set includes x1, x2, x3, x4, x5 and x6. After conducting dimensionality reduction, node A is denoted as {x1 : w1, x2 : w2, x3 : w3, x4 : w4} (w i ≠ 0, i = 1, 2, 3, 4) and node B is denoted as {x1 : w1, x2 : w2, x3 : w3, x5 : w5} (w i ≠ 0, i = 1, 2, 3, 5). All the features, say, x1, x2, x3, x4 and x5, belong to the frequent pattern set.

Support vectors and the maximum margin.
Figure 2 shows an example of support vectors and maximum margin. Only the support vector that has the same dimensionality could be used to support vector machine. How could node A and node B be shown in the same hyperplane when the dimensionalities are the different? In this paper, we create a new dictionary and obtain the union of each feature of the frequent patterns to get the same data distribution. If there are only two documents, say, A and B, in the training set, then the dictionary would be x1, x2, x3, x4, x5. The feature x6, which is regarded as an irrelevant feature, is removed from the feature set. The dimensionalities of node A and node B are both five with the corresponding term weighting values. The feature representation of node A would be {x1 : w1, x2 : w2, x3 : w3, x4 : w4, x5 : 0} and the feature representation of node B would be {x1 : w1, x2 : w2, x3 : w3, x4 : 0, x5 : w5}. As we can see, a lot of redundant and irrelevant features have been removed while creating the frequent patterns. However, there exists missing redundant and irrelevant features which would results in the mismatch with the data distribution of test data. Each document in the test data still has the original data dimensionality which is higher than that of the new generated documents after removing the redundant features. To get the final category of the under-classified documents, we plug DP and DN into SVM to find the maximum margin hyperplane. SVM has been applied to text classification over the past decades and achieved good performance.
Then, how to deal with this kind of inconsistence between the new training data and the test data? Here, we propose an IU (Intersection-Union) framework to solve this problem. Note that we have already built a dictionary which contains all the features of the frequent patterns no matter positive ones or negative ones when we construct the classifier. It can be easily to get intersection between the features in the dictionary and the features in the test documents. Suppose dictionary vector
In order to extract the relevant and useful information, in the intersection part, we just intersect the vector
Until then, a certain amount of redundant features have been removed. It is also needed to get the union between the obtained features and the original features in the dictionary. In the union part, to reach the consistence of the data distribution between the test data and the training data, the union of
DP, DN and Dic are used as input. We first expand each document to make them have the same dimensionality. For each document in the test data, we extract the common feature with the dictionary and reserve the corresponding weight. Then, the missing features are restored. DPL and DPN are applied to construct the classifier. Finally, the test data can be labeled as the corresponding category. Also, the dimensionality inconsistence between the training data and test data can be solved.
Datasets
In this paper, we perform several experiments on Reuters Corpus Volume 1 1 (RCV1), Simulated/Real/ Aviation/Auto UseNet data 2 (SRAA) and Reuters-21578 3 , respectively.
Reuters dataset is currently the most widely used standard benchmarks of test collection for text classification. Reuters Corpus Volume 1, which has 804414 documents collected from the Reuters newswire, is used as our training and test dataset. We use “topic codes” set in category codes (topic, industry, and region). In the “topic” hierarchy, four hierarchical groups are included: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets), which contain 789,670 documents. Among 789,670 documents, only 3000 documents of each class are used. 70% of them are selected as training set, and the remaining 30% of them are used as test set.
SRAA dataset, which contains 73218 UseNet articles including four categories: simulated auto, real auto, simulated aviation and real aviation, was generated by Andrew McCallum. Since we conduct binary classification in this paper, 4 topics are used to demonstrate the effectiveness of our proposed method. We first separate auto from aviation, then real from simulated.
Reuters-21,578 dataset, which has 21578 documents collected from the Reuters newswire, is treated as another training and test data set. Of the 135 topics in Reuters 21578, 9980 documents from the most populous 10 topics (as shown in Table 2) are used in this paper.
The number of documents in each topic from Reuters-21578
The number of documents in each topic from Reuters-21578
Two most common metrics (precision and recall) are also applied to reflect the availability of the system [8]. precision for text classification is the fraction of documents assigned that are relevant to the class, which measures how well it is doing at rejecting irrelevant documents. recall is the fraction of relevant documents assigned by classifier, which measures how well it is doing at finding all the relevant documents. We assume that T is the set of relevant documents in test dataset, and C is the set of relevant documents assigned by classifier. Therefore, we use precision and recall in Equations 4 and 5 as follows:
F-Measure, which is defined as the harmonic mean of precision and recall value, is also used to measure the performance of our method. For different specific request, according to the importance of the precision and recall, we define F-Measure in Equation 6 as follows:
In this section, we conduct four experiments. The first experiment is conducted to determine the parameter n which is used to represent the size of a paragraph, that is, paragraph length. It is easy to imagine that if the size of a paragraph is too small, then it will result each feature may occur just once or fewer times. Even a small min_sup could extract it as a frequent feature, which would not reach the goal of dimensionality reduction. However, if the size of a paragraph is too big, the decreasing number of the branches of TFP-tree will lead to a small number of frequent patterns. Therefore, we need to make a tradeoff on the size of a paragraph. Fig. 3 dynamically plots the precision versus paragraph length on three datasets.

Dynamic plot of precision versus paragraph length on three datasets.
Consider the Fig. 3, which plots the precision versus the paragraph length for three datasets. The theoretical upper bound on precision could be obtained dynamically and is plotted on the curve. It shows that the precision is almost unchanged, for a start, and then, the paragraph length continues increasing, it means more and more features are compressed into one paragraph. We realize that one document could form one branch at most. Of course, fewer frequent features in the pattern would drop the final accuracy of classification. The value of precision would drop suddenly due to the too fewer frequent patterns. Before we determine what number the paragraph length should be, min_sup is assigned by 10% in this experiment. Therefore, it can be seen from Fig. 3, pattern length is set by 400 for RCV1 and Reuters-21578 and 300 for SRAA.
Our second experiment demonstrates how the min_sup influences the precision in text classification. We call the number of items in an itemset its size, and an itemset of size k a k-itemset. A frequent itemset is an itemset that has support above min_sup. So, we make selection of frequent n-itemset by comparing selection of min_sup. In the following, we determine the min_sup dynamically by conducting a series of experiments. In this paper, minimum support, denoted by min_sup, is used to represent a threshold to determine which pattern could be regarded as a frequent pattern. We look at the number of frequent patterns according to the min_sup and paragraph length, which is described in Fig. 4.

Dynamic plot of the number of frequent patterns on Reuters-21578 according to the min_sup and the paragraph length.
Once the threshold is given, say 5%, a slightly bigger min_sup at the beginning would greatly decrease the number of frequent patterns. For example, the number of frequent patterns are cut in half with the increasing of min_sup from 5% to 6%. Figure 5 plots the min_sup versus precision for three datasets as follows. The curve on Fig. 5 shows a common feature between min_sup and precision on three datasets. It is found that the points remain largely unchanged at the beginning, but, the precision goes down as min_sup continues to increase. Moreover, when the min_sup increases, it is easy to imagine that the computational time would be reduced. All the datasets show the similar results. That is, the precision could not be greatly influenced by the increasing min_sup at the beginning. Let’s get back to the example in Section 3. {x1, x5 : 2} is a pattern mined from the TFP-tree, it is regarded as a frequent pattern once the min_sup is equal or less than 2, say, 1 or 2. On the contrary, it could not be considered as a frequent pattern if the min_sup is more than 2, such as min_sup = 3. So, the features, x1 and x5, would be removed from the dataset when constructing the text classifier. However, if features x1 and x5 are just the redundant or irrelevant ones because of their small support, it would not decrease the final precision, neither. At the same time, the time complexity and space complexity could both be reduced a lot. Our goal is to make a tradeoff between the precision and time complexity. The reasonable min_sup should obtain the frequent patterns as small as possible under the circumstance the precision does not be affected a lot. We search for the optimal minimum support dynamically by increasing it from 5% to 30%.

Dynamic plot of precision versus minimum support on three datasets.
Thereby, we can conclude that the corresponding minimum support is chosen as long as it would not impact the final precision and the computational cost is reduced a lot. In addition, another character on this figure is that the plotting curves on three datasets are decreasing smoothly without severe fluctuation on them. Despite the corresponding minimum support values are different, it could still reflect a trend that the performance is the best at the points where the minimum support is around 10% on RCV1 and Reuters-21578, and minimum support is 9% on SRAA. The paragraph length and minimum support are both obtained by conducting a series of experiments. They are selected when the precision reaches its peak.
In the third experiment, we fix the two parameters, paragraph length and min_sup, for each dataset for training text classifier. We provide F-Measures of our proposed CTFP and five baseline algorithms C-CRNE (Clustering-based method for Collecting Reliable Negative Examples) [10], CPUE (Clustering Positive and Unlabeled Examples) [29], CPAR, CMAR, and CBA on three datasets in Fig. 6. Obviously, the average F-Measures of our proposed CTFP outperforms three other methods on three datasets.
CPUE identifies the reliable negative examples by clustering positive and unlabeled examples at the same time, which is time-consuming and reduces the accuracy. The error rates increase because the positive examples are used to cluster unlabeled examples directly, which many unlabeled examples are mistaken for non-reliable-negative examples. It would make the number of reliable negative examples reduce a lot. C-CRNE does not extract RN directly. It removes as many probable positive examples as possible. It defines a radius between a document and a class center. After clustering, the rest documents in U are regarded as reliable negative examples. Actually, CBA, CMAR, and CPAR, which conduct text classification through mining frequent patterns and derived rules, are all rule-based classifiers. CBA constructs the classifier based on a heuristic method. It conducts classification according to the ordered rules which are generated based on minimum support and confidence specified in advance. Once the rules are a little strict, the under-classified document would be categorized into the default class, which could affect the final precision. Besides, CBA uses a frequent pattern mining method similar to Apriori, which generates a lot of candidates taking up a lot of space. Different from CBA, CMAR adopts a variant of FP-tree algorithm to mining frequent patterns and rules. Besides, it utilizes a measure called χ2 to find the “strongest” rules. The experimental results show that the precision of CMAR is slightly higher than that of CBA. CPAR filters out a lot of low-ranked rules which leads to the efficiency is much better than CMAR and CBA. Our proposed CTFP employs a dimensionality reduction method to construct a SVM classifier, which collects reliable negative documents first and then generates frequent patterns from positive document set and negative documents set, respectively. The experimental results show that our proposed method, CTFP, performs better than the three baseline algorithms from the perspective of precision, recall, and F-Measures.

Dynamic plot of average F-Measure of CTFP and five other baseline algorithms on different topics over three datasets.
The fourth experiment showed in Fig. 7 compares the running time between CTFP and other six baseline algorithms. As mentioned previously, in the classification process of CTFP, the dimensionality has been reduced significantly when constructing the SVM classifier, which would save a lot of time and memory space. So, it is more efficient. As shown in Fig. 7, the running time of our proposed CTFP can be probably reduced from 20% to 30% compared to the traditional SVM algorithm. C-CRNE and CPUE are both clustering-based classification methods. They are very time consuming at the stage of clustering. Figure 7 only shows the classification time of C-CRNE and CPUE. Compared with C-CRNE and CPUE, the speed up scale is significant during the whole process including clustering and classifying.

Real running time of our proposed CTFP and six baseline algorithms on different topics over three datasets.
We have presented a novel method for text classification based on frequent patterns generated from TFP-tree. This method utilizes a TFP-tree to capture frequent patterns from datasets. After constructing TFP-tree, text classification could be transformed to deal with a relatively low-dimensional data instead of high-dimensional data. There are a lot of researches on text classification and how to generate and make use of frequent patterns. However, combining frequent patterns from positive documents set and negative documents set, and applying them into the area of Web mining are a relatively new attempt. The experimental results suggest that our proposed method is able to mining and classifying data effectively and efficiently from the datasets which has a small positive training data. Besides, it shows that our proposed method outperforms the previous state-of-the-art methods on accuracy, time complexity and space complexity.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China under grant Nos. 60903098 and 71473035, MOE (Ministry of Education in China) Project of Humanities and Social Sciences (No. 14YJA870010), the Project of Jilin Provincial Industrial Technology Research and Development (JF2012c016-2), Jilin Provincial Science and Technology Key Project (20150204040GX).
We are grateful to NIST for permitting and mailing RCV1, which is distributed on two CDs, to us.
