Abstract
Readability and accuracy are two important features of any good classifier. For reasons such as acceptable accuracy, rapid training and high interpretability, associative classifiers have recently been used in many categorization tasks. Although features could be very useful in text classification, both training time and the number of produced rules will increase significantly owing to the high dimensionality of text documents. In this paper an association classification algorithm for text classification is proposed that includes a feature selection phase to select important features and a clustering phase based on class labels to tackle this shortcoming. The experimental results from applying the proposed algorithm in comparison with the results of selected well-known classification algorithms show that our approach outperforms others both in efficiency and in performance.
1. Introduction
With the explosive growth of information, the requirements for information acquisition and management have increased rapidly. Information can be presented in both structured and unstructured format. The task of automatically classifying a set of documents into predefined categories is called class labels in this paper. Text classification offers a better understanding of information presented as text in terms of the most frequently used terminology. Various types of applications have been reported in this domain such as genre classification [1], opinion mining [2], classification of web pages based on text [3] and spam filtering [4].
Classification algorithms can be categorized into two main groups, (a) traditional algorithms such as K-Nearest Neighbours [5], Decision Trees [5] and C4.5[5], and (b) association classification algorithms [6–10]. Experimental results in Li et al. [6] and Lie et al. [7] show that association classification algorithms have higher accuracy in comparison to traditional algorithms.
The overall process of the association classification algorithm begins by using one of the association rule mining algorithms like Apriori [11] or Fp-Growth [12]. A dataset of association rules is produced and then a small set of high-quality rules is selected and finally these rules are used for prediction. Since the associative classifier is a rule-based classifier, humans can easily understand its operation, and the prediction result provides a simple and direct interpretation [8]. However, associative classification suffers from efficiency problems owing to the fact that it often generates a very large number of rules in association rule mining, and it also takes an effort to select high-quality rules from among them [9].
In this paper, we have proposed an algorithm called BACA (Bit-priori Association Classification Algorithm) for classifying documents. BACA inherits the idea of CACA (Class-based Associative Classification Algorithm) [10] in using a class-based strategy to filter high-quality features. It reduces the search space of frequent patterns and uses OR-Tree structure in order to synchronize the existing phases in the feature selection process. In comparison with text classification algorithms, BACA has the following advantages: (a) it generates a small set of high-quality rules to predict class labels; (b) it has the ability to predict more than one class label for each unseen document (multilabel classification); (c) it is independent of language in classification; and (d) it uses a binary structure for storing data, producing rules and classification.
The remainder of the paper is arranged as follows: Section 2 gives an overview of the related works in association classification classifiers; in Section 3 the general aspect of Associative classifications is described; in Section 4 our proposed algorithm in text classification is discussed integrally; in Section 5 experimental results using our proposed algorithm are demonstrated and compared with the results of some known algorithms; and finally Section 6 provides a conclusion for this paper.
2. Related work
Many text classifiers have been proposed in the literature, such as using machine learning techniques and probabilistic models, for example, decision trees, naive-Bayes [5], nearest neighbours. A group of classifiers called association rule-based classifiers has recently been proposed, which performs well, and its effectiveness is comparable to that of most well-known text classifiers [13]. Some well-known association rule-base classifiers include CBA [7], CMAR [6], CPAR [9], HARMONY [14], MCAR [15] and CACA [10], which employ different approaches in rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. A brief description of each of them is described as follows.
One of the first algorithms to use an association rule mining approach for classification was proposed in Lie et al. [7] and named CBA. It implements an a priori algorithm [16] in order to discover frequent itemsets in transactional data. Once the discovery of frequent items is completed, CBA proceeds by generating rules from any frequent itemsets that pass minimum confidence. These generated rules are evaluated and a subset of these rules is selected and used for the final classifier. The finding of frequent itemsets and rule generation are implemented in two separate phases in CBA [15].
CMAR has extended the frequent pattern mining method, FP-growth, and constructs a class distribution-associated FP-tree and attempts to mine large databases efficiently. It uses CR-tree structure to store and retrieve association rules and prune them based on confidence, correlation and database coverage. HARMONY directly mines, for each training instance, one of the highest-confidence frequent classification rules that it supports and builds the classification model from the union of these rules over the entire set of instances. CPAR seeks to combine the advantages of both associative classification and traditional rule-based classification. It inherits the basic idea of FOIL [17] in rule generation and integrates the features of associative classification in predictive rule analysis, using a ‘greedy algorithm’ to generate rules directly from training data. HARMONY employs an instance-centric rule generation framework and is guaranteed to find and include the best possible rule for each training instance [14]. However, such a one-rule-per-example principle may work adversely when handling a database with uneven distribution of class labels, because a class with many training examples may have a high prediction score on that class label, which leads to a wrong prediction [18]. The MCAR algorithm uses an intersection technique for discovering frequent itemsets. MCAR consists of two main phases: rule generation and a classifier builder. In the first phase, the training dataset is scanned once to discover a frequent ‘1-item’ in an itemset; the combined rules are then generated to produce candidate itemsets with more items. Any rule with support larger than the minimum support is created as a ‘candidate rule’. In the second phase, rules created are used to build a classifier by considering their effectiveness on the training dataset. Only rules that cover a certain number of training cases are kept in the classifier [15].
Another association classification algorithm called CACA was proposed in Tand and Liao [10]. CACA first scans the training data set, before storing data vertically like the MCAR algorithm. Next frequency counts of each attribute value are produced and sorted into descending order. All frequent ‘disjoint attribute value’ TIDs are intersected to reduce the search space of frequent patterns. A TID of a frequent attribute value holds the row numbers where these attribute values occur in the training dataset. Lastly, each attribute in a class group that passes the minimum confidence is inserted in the Ordered Rule Tree (OR-Tree) as a path from the root node and its support, confidence and class are contained within the last node in the path. CACA classifies the unseen data like the CBA algorithm. Experimental results suggest that CACA performs better with reference to accuracy and computation time than other associative algorithms on the UCI datasets [19, 20].
Our algorithm is inspired by CACA, by its filtering of the search space of frequent patterns, but obtains the high-quality features in a different manner.
The classifiers described above are not implemented for text classification and have a general application. Some algorithms [12, 13, 18] have been specially proposed for text classification and are within the family of association classification algorithms.
3. Association classification
The overall structure of association text classification algorithms is shown in Figure 1. As Figure 1 shows, this process consists of two phases, a training phase and a testing phase. In the training phase, raw data of training documents are pre-processed. In this part, depending on need, stop words and irrelevant characters are removed and words are stemmed and indexed (step 1). By mining data from this pre-processed database, we obtain the classification rules (step 2). Owing to the very large number of classification rules, some redundant rules are filtered out and a small number of important rules are selected. This process is called pruning (step 3). Now we have a database of classification rules. In continuing in the testing phase, before any activity, unknown documents should be pre-processed and then converted to a pattern of words and matched to the classification rules (step 4). In the final step (step 5), according to the matched rules, class labels of unknown documents are predicted.

Association text classification structure.
3.1. Association rule problem
The problem of mining association rules over basket data was introduced by Agrawal [16]. An example of such a rule might be that 98% of customers who purchase tyres and auto accessories also require automotive services. Finding all such rules is valuable for cross-marketing and attached mailing applications. Other applications include catalogue design, add-on sales, store layout and customer segmentation based on buying patterns. As databases involved in such applications are very large, it is imperative to execute fast algorithms during thess tasks [21, 22].
Let
There are two important basic measures for association rules, support(s) and confidence(c). These are usually two predefined thresholds employed by users to drop the rules that are perceived as less interesting or useful to the user. The two thresholds are called minimum support and minimum confidence, respectively. For
3.2. Selecting high-quality features
Most text data has a high-dimensional feature space that causes it to generate many association rules. There are two common approaches to generating high-quality rules. First, all of the association rules are produced. Therefore, many high-order association rules are generated in the induction procedure for the associative classifier. Generally, high-order rules are more informative; hence the classification with those rules performs better [23, 24]. In these methods computational complexity is very high [8]. Second, a small number of high-quality rules have been used regardless of the order of rules (or preferring high-order rules) [14, 25]. To reduce computation time, these methods remove the redundant words and select high-quality features only.
In this paper we use a method of the second type. In our proposed method, high-frequency words are first found and then clustered based on class labels. By this means, high-quality features for each class label are found. Using these features, association rules for each class label are produced; we therefore do not need to consider all of the words for each class label.
4. Proposed method
The general approach of the proposed algorithm will be as follows. First, pre-processing will be applied to documents to convert them to a structured format. In the next step, frequent items for each class label will be identified. This step follows the CACA approach [10]. By this means we have the features (words) that are more effective in documents for each class label. Rules are then produced using only these effective words. In finding frequent items and creating rules, a type of a priori algorithm called Bit-Apriori [26] is used that first saves items in a database in binary mode and second, with the help of binary operations, attempts to find the rules (Figure 2). Finally unseen documents are classified using these rules.

Steps of proposed algorithm.
As explained above, it is clear that feature selection is at the heart of the algorithm that finds desired features based on the occurrence of words in document and class labels.
4.1. Convert text to structured form
Most text corpora have a high-dimensional feature space. Since they are not relational databases, there is no predetermined length for an example record. The average number of words in a document is far less than the vocabulary size, which indicates the sparsity of word distribution in text collections. A document can be modelled statistically by treating the words that constitute the document as the events. There are two different document models; the Multivariate Bernoulli Model and the Multinomial Model [8]. The former ignores the count information of words and uses only binary information (present or not-present), while the latter includes the word count [8]. As it is known that information about the multiple occurrences of words does not yield significant additional assistance for an exact classification [23, 27], we have chosen a combination of these two ways. This means that first we count the occurrences of the word in the document (Table 1) and then we ignore the words with less than the given threshold (Table 2). The threshold is selected based on the assumption that high threshold implies few rules while low threshold implies more rules. Following a number of experiments, we set the threshold to 2 which indicates that words used more than once in a document are considered frequent. A 4-fold cross-validation is used to confirm the selected threshold.
The number of word repetitions in the documents.
Binary table containing frequent items in the documents.
Finally we save the documents in a binary table where the rows of the table are documents and the columns are the frequent words used in each document. All existing words in the document are marked as ‘1’ in the table. As the class of each document is known, a column has been added to the table to contain the value of the class.
4.2. Find frequent words for each class label
Since we want to produce rules that will eventually predict class labels of unseen documents, we are going to find the effective words for each class label. This means that we aim to find commonly used words that are repeated in many documents and have the same class label.
To achieve this goal, it is sufficient to perform a binary ‘And’ operation between columns of each class (C[i]) with every column of words. Subsequently, a count of the number of occurrences of ‘1’ in each column is produced (the final row of Tables 3–5). In the new tables, every cell of the table shows frequent words that are used in different documents and have same class label. In order to clarify this fact, based on the previous example in Table 2, we have seven documents and three classes, namely, C1, C2 and C3. Tables 3–5 present the binary ‘And’ of each class label and words following by counting the number of occurrences of ‘1’ in each column. For example, in Table 3, we can conclude that W2 is presented in three different documents; therefore, using the aforementioned approach, the frequency of W2 in class label C1 is 3.
Frequent words in the documents for class C1.
Frequent words in the documents for class C2.
Frequent words in the documents for class C3.
Consequently, we should save a list of words that are effective in predicting class labels. To this purpose, if the number of occurrences of a word in documents related to a class label is more than a threshold (Min_Document_Repetition), then it will be considered as a frequent word for the class label. In output, the most frequent words that are used in a document with the same class label as the considered class take the value ‘True’. This stage is the second phase of feature selection, for example, for Min_Document_Repetition=2 in our example, see Table 6.
Frequent words for each class label.
The considered constant for Min_Document_Repetition could be different based on our needs. If it is considered as the minimum value, most of the words will be selected as frequent and consequently a huge range of rules will be produced, most of which are not suitable. On the contrary, by considering a big constant, a small number of frequent words will be found that cannot satisfy our purpose. Based on our experiments, setting Min_Document_Repetition to 2 provides the most effective results.
4.3. Producing subsets of frequent words and saving rules
As mentioned above, Table 6 shows the frequent words that could be more useful in predicting the class label of unseen documents. Therefore, if we find the rules that are produced from these words, then we have completed the phase of pruning before producing the rules. As a consequence, the domain of rules is greatly reduced.
For example, considering our example in in Table 6; there are (23) – 1 = 7 subsets in class label C1, 3 subsets in C2 and 15 subsets in C3 whereas, without this work, (26) – 1 = 63 subsets for each class could be produced. It should be noted that, in our example, the word W2 that is frequent for each class label is identified as a stop word and is not included in the generation of rules. Overall, frequent words with more than 75% in class labels are identified as ‘stop words’ and removed.
At this stage the Apriori algorithm is applied. First, every subset of the class labels is generated. If the produced subset (that shows frequent words of a class label) is found to be frequent, the rule will be created. In order to determine a subset as frequent words, a binary ‘AND’ operation is applied between it and the content of Table 2. If the result of this ‘AND’ operation is the same as the subset, it means that the words in the subset will be contained in an unseen document. This is applied to all training documents and finally if the number of documents that satisfy this condition is more than the threshold, the subset is accepted as a predictive rule and is saved for next stage.
The generated rule consists of two parts. The generated subset of the words is placed on the left-hand side of the rule and the name of the class label is placed on the right-hand side. The created rule states that, if an unknown document contains the words of the rule, it could be predicted that the class label of this document will be same as the rule.
Since generating all the subsets of a set is of exponential order (
Figure 3 is an example of finding subsets and counting and checking frequency for an item. We have considered C1, which contains three frequent words. In order to produce suitable rules, first we should generate all the subsets of C1 and then Binary ‘And’ should be applied for every subset of C1 with every row of Table 2. As shown in Figure 3, the result of the logical ‘AND’ operator for the specified subset is the same as the three training documents. If the threshold = 3, the selected subset is considered as frequent words and saved as a rule.

An example of finding a rule.
Now we have found the rules that can predict the class label of an unseen document. Before doing anything, first we save these rules to a table of words. In order to save generated rules, a process similar to the previous steps is followed. In this case, all of the used words in the rules will be identified and then a table of Boolean values is created so that its columns are words used in the rules and its rows are rule numbers. In every row, the cell of a word used in the rule then becomes ‘True’ and the others become ‘False’. Thus a table is created so that in each row the constituent words of the rules are ‘True’.
Among the rules created, there are some rules where the constituent words are the same but they predict different classes, for example, rules like
4.4. Classifying unseen documents
In this step, a binary table is created so that its columns are key words (words that are used in rules) with the order of key words list and its rows are the indices of pre-processed unknown documents. In the next stage, for each document, if the key word is used in it, the cell of that word for the document becomes true. This means that we are going to find key words in unknown documents. After filling in this table, we have found all of the important features of the unknown documents. Now it is sufficient to binary ‘AND’ all rules (saved in binary format) with all rows of the new table. If the result of the binary ‘AND’ operation is equal to the binary value of the rule, it means that the words of the rule are identified as key words for the unknown document. Therefore this rule can predict the class label of the document.
After finishing this step, a set of rules that can predict the class label of the document is collected. At the end, we will examine predicted class labels by counting generated rules for each document. The goal is to extract class labels with the maximum number of rules.
Figure 4 is an example of predicting the class label of an unknown document. In this sample every rule could consist of six words (because the length of each rule is 6) and only three of these six words are used in the unknown document (because three bits have True value). In order to predict the class label of this document, an ‘AND’ operation should be performed between the bit stream of the document and all of the rules. As aforementioned, if the result of this operation is the same as the word-set of the rule, it shows that all of the words that formed the rule exist in the unknown document so this rule could be used for prediction for the document. In Figure 4, it is clear that seven rules could be used for prediction and in this situation the class label that most of the rules refer to is selected.

An example of predicting an unknown document.
In general, we may encounter the same number of votes for some classes, which is natural. In the real world we cannot usually categorize a news article into specific class labels. For instance a number of articles are both political and financial. In this situation, owing to the structure of the algorithm, where only one prediction should be proposed, one class label is selected randomly.
5. Experimental results
5.1. Data
As in any text mining approach, pre-processing is the first phase of the proposed algorithm. Tokenization, normalization and stemming are the main tasks of pre-processing phase which help to convert documents to structured format. The focus of our proposed algorithm is on the output of the pre-processing phase. In addition, because of the binary nature of the approach, documents from any language can be used to evaluate the performance of the algorithm.
To demonstrate the performance of the algorithm, a dataset of Persian news articles, containing 565 documents, including five classes – social, financial, cultural, political and sport – has been collected. This dataset contains almost equal numbers of documents in each class which are collected from several Persian news agencies [28–30]. The categories of the articles have been determined according to the classification of the news agency.
5.2. Cross validation
A supervised learning algorithm needs some of the data to be labelled as training data and some as test data [31]. These two datasets must be separate to prevent false results in evaluating the performances of the methods. Therefore multiple runs of the experiments are usually required with different datasets at each turn. One of the approaches splitting the dataset and running the experiment is N-fold cross-validation. N-fold cross-validation consists of splitting the dataset into N subsets of equal size. At each turn, one set is used for testing and the rest for training the system. In our case, 4-fold cross-validation is used. At each turn, one fold has been used for training and learning and three for testing, in such a way that every subset will be used once for testing purposes.
5.3. Evaluation metrics
We use precision, recall and F-score to measure the effectiveness of the proposed approach. The precision, recall and F-score are calculated based on Table 7 as:
Contingency table.
To deal with multiple datasets (products), we adopt the macro- and micro-average (Yang 1999) to assess the overall performance. The macro- and micro-averaged precision and recall across the n datasets are defined as follows:
The macro-average is calculated by simply taking the average obtained for each dataset, which assigns an equal weight for every dataset and product, whereas the micro-average assigns each dataset a relative weight on the basis of the number of extracted or manually extracted features for the dataset.
5.4. Comparative study
Raw documents are pre-processed into word vectors. For this purpose, a Persian pre-processing method [32] is applied to the dataset. In this step, redundant words, characters and symbols are removed and words with the same root are stemmed. Then the above algorithms are used to classify the documents. Applying stemming strongly reduced the dimension of the vectors, so the number of produced rules was reduced and subsequently the time consumption of the algorithm was reduced.
In order to find the best results for BACA, different ranges for minimum support and minimum confidence are examined. The results show that considering the lowest value for confidence (Conf >0) and a combination of different support ranges for different class labels gives the best answer. The reason for this result is the way that we have produced the rules. As mentioned before, first we find the effective words and then we produce the rules. Since Association Classification algorithms produce readable and understandable rules, they could be modified by domain experts. After reading the rules, we notice that, by considering different ranges of supports for different class labels, the overall result is improved (Table 8). It should be noted that the thresholds that are used in our algorithm are related to the words of the documents. Each document consists of a finite range of words. The number of words repeated in the documents is important for our proposed algorithm, which is not related to the number of documents that are classified.
Comparison of accuracy of different classes in different range of supports.
As mentioned in Section 4, BACA consist of two parts – first feature generation and second the classification algorithm. In order to show better evaluation of the proposed algorithm, different parts of the algorithm were compared separately with some known methods of performing each part. To begin with, we compared generated features by the proposed method (BACA) with those obtained by TF_IDF [33] and Entropy [34] methods. Figure 5 shows a comparison of the accuracy measure of BACA, TF_IDF and Entropy for five main classes, namely, cultural, financial, political, social and sport classes. It is clear that, in each class, BACA presents better performance in terms of feature generation. BACA has shown significant improvement in generating features for political, sport and cultural documents, which are 17.51, 15.41 and 12.83% higher than those for Entropy, respectively. Overall, TF_IDF does not have good accuracy for all classes in comparison with BACA and Entropy.

Comparison of accuracy measure in different classes for Entropy, TFIDF and BACA.
To assess the overall performance of techniques, Figure 6 shows the macro- and micro-average for precision and recall, respectively. As shown, the macro-averaged precision and recall of Entropy are 60.84 and 53.08%, respectively. Moreover, the macro-averaged precision and recall of the TF_IDF are 40.56 and 29.19%, respectively, whereas, those of BACA are 78.64 and 57.53%, respectively. Therefore, the effectiveness of BACA is better than that of Entropy and TF_IDF, as presented in macro-averaged precision and recall. Specifically, macro-averaged precision obtained by BACA is 17.8 and 38.08% higher than that reached by Entropy and TF_IDF. BACA reaches a macro-averaged recall of 57.53%, which is an improvement over Entropy of 4.45% and over TF_IDF of 28.34%. Considering the micro-average measures, we observed similar results to those achieved using macro-average measures.

Comparison of macro- and micro-average for precision and recall for Entropy, TFIDF and BACA.
To evaluate the performance of the second part of the proposed algorithm, the algorithms SVM, NB and KNN [5] were used. In the KNN algorithm, the distance type is cosine and the weighting type is LTC. In addition cross-validation was used to determine k = 20 for the number of nearest neighbours. For this purpose, after pre-processing and generating features, these algorithms were used for classification in the same conditions as BACA. Figure 7 shows the results.

Comparison of accuracy measure between KNN, NB, SVM and BACA.
Figure 7 clearly shows the accuracy of BACA in two classes (financial and sport); in another two classes (cultural and political) the results are almost identical. However, in the social class, BACA is weaker than the other algorithms. One reason for this could be that the definition of social news is not very clear and it could cover various topics like social activities and various topics related to society. Therefore, it covers a wide range of words that might not be suitable for BACA classification. These different results in various classes could be due to the different types of words used in documents of different classes.
To assess the overall performance of techniques, Figure 8 shows the macro- and micro-average for precision and recall, respectively. From the figure, the macro-averaged precision and recall of the SVM are 50.9 and 23.02%, respectively. Moreover, the macro-averaged precision and recall of the NB are 44.61 and 35.18%. KNN has 46.7 and 32.06% macro-averaged precision and recall, whereas the values for BACA are 67.91 and 45.03%, respectively. Therefore, the effectiveness of BACA is better than that of SVM, NB and KNN as presented in macro-averaged precision and recall.

Comparison of macro- and micro-average for precision and recall KNN, NB, SVM and BACA.
6. Conclusion
A new association classification algorithm, named BACA, has been proposed that aims to classify documents. The proposed algorithm has many improved features against traditional and association classification methods that are used for document classification. It (a) produces understandable rules that can easily be interpreted by humans, (b) contains a feature selection part in order to decrease the text dimension, (c) clusters important features based on class labels, (d) requires only one scan on training data, (e) saves features as a binary stream and uses binary operations in all of the processes in order to reduce the memory space needed, (f) has the ability to predict more than one class label for each unseen document (multi-label classification) and (g) it is independent of language in classification.
Using association classification algorithms in text classification is a new method that has been used recently by some researchers. Our experiments on a dataset of Persian documents of five different classes, collected from diverse news agencies, show that BACA can perform well and has shown better accuracy in comparison with KNN, NB and SVM.
As future work, we propose to implement similar association rule-based algorithms and compare their performance with ours. In addition, investigating the multilabelling feature of our classification algorithm could be of interest to a researcher in this domain.
Footnotes
Funding
This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.
