Abstract
Although the association classification approach based on frequent patterns has been recently presented, the majority of the methods proposed so far do not deal with the quantitative data directly, and also do not consider the problem of exploring these rules to predict the future behavior of certain variables based on some other known variables. In light of these issues, a new algorithm based on quantitative association rules tree(CRQAR-tree) that synergizes association classification and rule-based TS fuzzy inference is developed to generate the rule tree structure and realize the classification and regression prediction. The classification and regression quantitative association rules are built on the improved Apriori algorithm which offered an efficient way for frequent itemsets learning. To manage the model complexity without sacrificing its predictive accuracy, CRQAR-tree can effectively match the rules to predict new samples that have little contribution over time. The proposed approach is applied to UCI benchmark datasets and a real application, the simulation results show that the performance of the CRQAR-tree is better than other methods, so it is a promising classification and regression algorithm.
Introduction
Today, data mining is being used to analysis a variety of data and discover relationship that may be used to make valid predictions. Association Rule is a type of model and algorithm used to mine data and look for potentially meaningful links among variables (such as values that often occur together).
An new technique [1] that finding classification rules based on Association Rule Mining has attracted an increasing number of attempts in recent years. For example, CBA(Classification based on associations) algorithm [2] is firstly proposed to build classification model, which is similar to Apriori algorithm and also need to scan the database more times to get all the support of candidate frequent itemset. In classifying an unseen case, the basic idea of CBA is to choose the best rule (with the highest confidence) to satisfy the case. In [3], a CMAR (Classification based on multiple association rules) approach is proposed to store and retrieve mined classification association rules by adopting a CR-tree structure. Given a new case for prediction, CMAR selects a small set of high related rules and analyzes the correlation among those rules based on the weighted χ2. CPAR (Classification based on predictive association rules) [4] adopts a greedy algorithm to generate association rules directly from training data, which uses expected accuracy to evaluate each rule and uses the best k-rules to predict the classification. In [5], a multi-objective evolutionary approach is exploited to learn the parameters of information granules to generate the rule-based classifier, which can enhance the classification performance. However, a training data set often generates a huge set of rules. It is challenging to identify, store and match a large number of rules efficiently for classification. Moreover, the practitioners and researchers hope to explore these rules to predict the future behavior of certain variables based on some other known variables. So far, mined Association Rules are not used to solve such regression estimation problems by mining the relationship between variables.
Accordingly, this paper investigates the above issues to present a new algorithm to realize the classification and regression prediction based on the association rules, which consist of three contributions: A DFAC(density-based fuzzy adaptive clustering) algorithm [6] is adopted to discretize each quantitative variable. An improved Apriori algorithm is developed to find association rules. With these rules, the model is reconstructed to implement classification and regression estimation to improve prediction accuracy. In order to facilitate the rules storage and matching, a new data structure called CRQAR-tree (Classification and Regression Quantitative Association Rule tree) is proposed, which combined the advantage of both association classification and rule-based TS fuzzy inference to realize the classification and regression. Unlike FP-tree [7], the CRQAR-tree is built on the basis of the association rule bases, which only scan the rule base twice but not the whole database. By exploring the depth-first search for CRQAR-tree, it is helpful for improving the efficiency and accuracy of the association classification and regression.
The main steps of the algorithm is as follows: (1) Constructing a categorical variable matrix to analyzing the data by means of association rules mining; (2) Finding all kinds of frequent pattern set by exploring an improved Apriori algorithm to generate quantitative association rules; (3) According to each frequent pattern and category, the rules are reconstructed to implement classification and regression prediction with confidence; (4) Building CRQAR-tree for rules storage and matching; (5) Identify the classification and regression quantitative rule to match the new sample and predict its output.
The paper is arranged as follows. Section 2 gives the preliminary definitions related to the work briefly. The quantitative association rule mining based on the improved Apriori algorithm is described in detail in Section 3. After that, Section 4 represents the details of the structure of CRQAR-tree and prediction process respectively. Experimental results are shown in Section 5.And a discussion is given in Section 6.The last Section gives conclusions and future works.
Preliminary definitions
In this section, some preliminary notations and definitions will be presented.
Let Ds ={ T1, T2, ⋯ , T N } be original dataset, where N is the number of data samples, each sample is T j ={ x1, x2, ⋯ , x n }, where T j (1 ≤ j ≤ N) ∈ D S , and x i (1 ≤ i ≤ n) is i-th quantitative variable. Based on the discretization with clustering algorithm, x i is discretized as itemsets {ci1, ci2, ⋯ , ci,f(i)}, where ci,f(i) is a discrete state interval, these state intervals can be denoted by a set of integers such as 1, 2, …, f (i). Notice that such integers are used just for data handling and do not represent any specific ordering.
A categorical variable is an extension of the binary variable in that it can take on more than two states. For an object with a given state value, if the current state exists, the binary variable representing that state is set to 1, while the remaining binary variables are set to 0. For example, the object is {x1, x2}, x1 and x2 have 3 states and 2 states respectively. For the discrete itemsets {1, 2} that x1 is at state 1 and x2 is at state 2, binary variables {100⋮ 01 } is generated to encode the categorical variable map. Here, {100} represents the current three state of x1, the first binary variable is set to 1 corresponding to the first state of x1, while the remaining two binary variables are set to 0 corresponding to the other two states of x1.
Let D S be sample dataset, X B be the corresponding categorical variable map matrix, and C L k be the categorical vector of frequent k-itemsets L k , the support S k of frequent k-itemsets L k is counted by the number of “k” in the product of matrix X B and categorical vector .
Let F S k be the support storage vector which records support counts for all frequent k-itemsets. For frequent k-itemsets L k , the binary vector that obtained by encoding the categorical vector C L k is converted to the decimal number d, based on which the support of itemsets L k is stored in the d-th bit of vector F S k .
Given the minimum support threshold is minsup, itemset L k can be called frequent pattern if the support is greater than minsup or equal to it.
Mining the quantitative association rules based on the improved apriori algorithm
Apriori is a classical algorithm [8, 9] for association rule mining, but it is impossible to directly handle the quantitative variable. Moreover, classical Apriori algorithm needs to scan database many times. This extensive scan makes the system consume more time for rule generation. There are a few others’ work in trying to solve this problem for quantitative variables. In [10], the quantitative variables are tackled by fine partitioning the values of the variable and then combining adjacent partitions. In [11], continuous quantitative variables are partitioned by using fuzzy clustering method to transform the original continuous quantitative variable data into fuzzy membership function matrix. However, these partition methods may not be in conformity with the distribution law of the data, and consequently the found association rules have been deviated from the hidden knowledge of data.
To overcome all above shortcomings, the discretization process based on the DFAC clustering algorithm is adopted to deal with the quantitative variable. On this basis, an improved Apriori algorithm is developed to extract the quantitative association rules from the quantitative data, in which the categorical variable matrix is developed to avoid generating a large number of candidate itemsets.
The specific process is as follows:
conf (Rule 1) > conf (Rule 2). conf (Rule 1) = conf (Rule 2) but sup(Rule 1) > sup(Rule 2). conf (Rule 1) = conf (Rule 2) and sup(Rule 1)= sup(Rule 2), but the Rule 1 has fewer itemsets than Rule 2 in the antecedents.
The rule that has the importance value less than 0.5 is excluded.
In order to improve the prediction accuracy of association rules mining models, the classification and regression quantitative association rules is firstly reconstructed by combining the TS fuzzy model with the extracted quantitative association rules, and then a CARA-tree is built to storage and match the rules efficiently, on this basis, the best rule is directly identified to match the new sample and predict its output.
Reconstructing the classification and regression quantitative association rule
For the quantitative association rule A ⇒ B, the input and output variables of frequent itemsets L k are described as the rule antecedent (record as A) and consequent (record as B) respectively. The existing quantitative association rules are rewritten with a similar expression by combining with TS fuzzy inference rules to predict the output values to evaluate the category or regression function (see Equation (2)).
All the existing association rules are grouped according to the output category variable. In the same output category, the antecedent variables corresponding to the rule that has the highest importance calculated according to (1) is determined as the antecedent variables of fuzzy rules. Besides, all data within the output category are selected to fit the consequent function of the rules. So, the rules can be used to not only classify new data to get the output category, but also estimate the future tendency with regression function to realize TS fuzzyinference.
For example, for the input x, the degree of activation of the l-th rule is calculate by
Where represents the membership of x
p
in the l-th rule, as (4):
Where and represent the mean and the standard deviation of x p belong to the l-th rule respectively, which are generated from the clustering results of various variables. The output y of the model is a weighted sum of rule contributions (see Equation (5)):
When mass rules are generated, classifying and predicting the new sample enjoys matching so many rules one by one as to decrease the efficiency and accuracy. In order to solve this problem, a classification and regression quantitative association rule tree is presented.
Unlike FP-tree, the classification and regression quantitative association rule tree is built on the basis of the association rule bases, which only scan the rule base twice but not the whole database. The process is divided into the following two steps.
1) The first scan of the association rule base is to generate the header table.
This process is the same as Apriori, which derives the set of frequent items (1-itemsets) and their support counts (frequencies). The set of frequent items is sorted in the order of descending support count. Moreover, the input variable sets and output variable sets of frequent 1-itemsets are ranked as their support counts respectively to form two part of head table, where one is denoted as filist-1 and the other is denoted as filist-2. If the support counts of two itemsets are same, the itemsets are ranked as their importance by (6).
Where O j represents the itemset corresponding to the output variable and k O represents the category number for the output variable, p (I i , O j ), p (I i ) and p (O j ) represent the support counts of itemsets (I i , O j ), item I i and item O j in the association rule base respectively.
For instance, the scan of the Iris dataset, the header table generated by scanning association rule base is shown in Table 1, which recorded the item variables and other features such as category, support count, importance, maximum and minimum values and spread(square root of data variance). According to the head table, each item variable in association rule base is checked whether matching the corresponding variable category when the new sample is to be classified. The items in each association rule can be denoted as the serial number of header table. For example, in Table 1, the item 1 ≤ x4 ≤ 1.5 corresponding the first row whose variable serial number is 4 and category number is 3 can be denoted as “1”. In this way, all items in association rules can be represented as the serial number.
For example, an association rule is described as:
Which can be transformed into such expression:
2) The second scan of the association rule base is to construct the CRQAR-tree.
First, create the root of the tree, labeled with “null.” Scan association rule base a second time. The items in first association rule are processed in L order (i.e., sorted according to descending support count), and a branch is created for the first association rule. Scan association rule base a third time. If the next association rule contains the same items with the first branch, a common prefix would be shared with the existing path. Otherwise, a new branch is linked to the root and a new node is created and linked as a child of root node. In general, when considering the branch to be added for a association rule, the nodes for the items following the prefix are created and linked accordingly. When the last item of the association rule is scanned (assuming the i-th item of the rule), a new leaf node is added which recorded the output category, linear expressions and confidence of the consequent of the association rule in the corresponding layer (the i-th layer). Such a process can be recursively applied to generate the CRQAR-tree.
When predicting a new case, the best rule in the CRQAR-tree is selected to satisfy the case. There are three possible occasions for matching a case with a set of rules in the CRQAR-tree: 1) the case matches only one rule; 2) the case does not match any rules; 3) the case matches more than one rule. If the matched rules do not agree on the class labels, the rule with the highest confidence is selected to determine the class label C.
The matching process is described as follows: The first step is to discretize new data. The discrete items are denoted as the serial number in the head table. The next step is to predict the category and the real value. By navigating the CRQAR-tree you can assign a value or a category to a case by deciding which branch to take, starting at the root node and moving to each subsequent node until a leaf node is reached. Each node uses the sample data from the head table to choose the appropriate branch. If this branch has no node to match this sample, you can change the other branch. If the example does not match any rules, you can employ the TS fuzzy inference to derive the category and the real value.
Time complexity analysis
For a training dataset comprising S transactions (I itemsets in each transaction) and getting K rules, |L k | and |C k | is the number of frequent and candidate k-itemsets respectively. In the worst case, each item variable will be discretized into S intervals, each transaction contains I × S interval itemsets by the categorical variable map.
According to the learing process of Apriori algorithm, to get the frequent 1-itemsets, the time complexity is O (I × S2), to produce the candidate k-itemsets, the time complexity is O (|L
k
| × |L
k
|), and to get the frequent k-itemsets from candidate k-itemsets, the time complexity is O (|C
k
| × (S + 1)), therefore, the total time complexity of Apriori learning process is summarized by (9):
While the time complexity with our proposed method is summarized by (10):
By adopting the categorical variable matrix to calculate the support, it avoid from generating a large number of candidate itemsets, so the time complexity is O (I × S). To get frequent k-itemsets, the time complexity is O (2|C k |), which is fairly low complexity than Apriori algorithm.
To classifying a new case, the time complexity is O (K × (I – 1) × S) with the quantitative association rule mining algorithm(see Section 3), while the time complexity is O ((I – 1) × S) with the CAQRA-tree, which improve the efficiency of the classification and increase the rule matching speed.
In order to analyze the performance of the proposed CRQAR-tree algorithm, the UCI benchmark datasets [12] (Iris, Wine, Breast Cancer, Pima, Servo, Boston Housing, Abalone, Wisconsin Breast Cancer, Australian, Heart, Hepatitis and Sonar datasets) and a real application are chosen to compare with the other algorithm.
In all experiments, a complete ten-fold cross-validation is carried out. For CRQAR-tree, the minimum threshold of support, confidence and importance are set as 0.05, 0.8 and 1 respectively. For CBA, CMAR, CPAR, GARC methods, the support and confidence threshold are set to 0.01 and 0.5 respectively as their authors suggested. For C4.5, KNN and SVM methods, they are implemented by the Weka toolkit [13] with their default settings. For MLP, the number of hidden neurons is adjusted by using 9 values between 1 and 30 with 1000 training epochs. For ELM, the number of hidden neurons is adjusted between 3 to 200 to get the best results. For RBF, the spread parameter is adjusted by using 19 values between 0.01 and 10 to get the best results.
Datasets
Here, For Iris dataset, the header table generated by scanning association rule base is shown in Table 1. According to Table 1, each item in the corresponding association rules can be denoted as the serial number of header table, based on which the CRQAR-tree is constructed for Iris dataset as shown in Fig. 1. When the CRQAR-tree is applied to predict the new case, the efficiency can be greatly improved.
As shown in Fig. 2, we make a comparison between our proposed CRQAR-tree algorithm and quantitative association rule mining algorithm(see Section 3) on eight datasets in terms of executing time. The executing time of our proposed approach through Iris, Wine, Hepatitis, Pima, Breast Cancer, Australia, Sona and Heart datasets are 0.022 s, 0.043 s, 0.053 s, 0.065 s, 0.081 s, 0.090 s 0.098 s and 0.115 s respectively, while that of the quantitative association rule mining algorithm through above datasets are 0.030 s, 0.100 s, 0.150 s, 0.170 s, 0.209 s, 0.276 s, 0.294 s and 0.312 s respectively. It is obvious that the proposed approach achieves higher speed to match therules.
For the purpose of comparative analysis, other nine methods are applied on eight datasets. The accuracy of different algorithms in each dataset is summarized in Table 2. Here, the classification rate is adopted as follows:
Where n represents the number of data samples, m represents the number of data samples which match the association rules respectively.
It is obvious that the CRQAR-tree shows a better performance than other models especially in the Breast Cancer, Pima and Australia datasets, although it is a slightly worse compared with RBF in the first two datasets, and it still perform better than other eight models. For Sona dataset, the CRQAQ-tree also perform better than other eight models except KNN with a small gap. For Hepatitis and Heart datasets, although the performance of the CRQAR-tree is not the best, it also perform better than other most of models. The last column is the average accuracy of the methods for all the datasets, which also yields that our proposed method can achieve a better classification performance and good scalability.
In order to better illustrate the results shown in Table 3, ties using the paired t-test [20] with a confidence level of 0.95 is detected, and the results is shown in Table 4, which composed of five parts(ties, wins, loses, wins-loses and ranking) of all methods. The first three columns are the number of ties, wins and loses, and the fourth column is the difference between the number of wins and loses, which is used to generate the ranking in the last column. It is obvious that the CRQAR-tree is superior to other methods significantly.
In order to further verify the validity and the performance of regression forecasting, a comparison of RMSE(Root Mean Square Error) with different methods on benchmark datasets are compared as shown in Table 4. Here, The RMSE is adopted for generalization performance evaluation:
Where n represents the number of data samples, and y i represent predictive output and actual output respectively.
It is apparently that the CRQAR-tree exhibits an even better performance than other four methods, where the RMSE values of our proposed method are smaller than other methods especially in the first three datasets. The average value of REMS is also be calculated for every method on different datasets. For CRQAR-tree, the average value of REMS is 7.41, which is also smaller than those of the other methods, so the CRQAR-tree shows better results for regression forecasting.
In recent years, medium plates are widely used in industries, such as construction, ship building, national defense and so on. Therefore, more and more attention is paid to the quality of medium plates [21]. In the manufacturing process of hot medium plates, the mechanical properties of the medium plate plays a very important role. But the boundary conditions in hot rolling process is so sophisticated that it would be difficult to build model to describe the relation between the mechanical properties and the process parameters. In this paper, a CRQAR-tree model is proposed to predict the mechanical properties.
After data preprocessing, 15 input variables are included: original chemical contents such as C, Mn, P, S, Si, some alloying elements of Cr, Ni, Cu, Mo, and three gas composition of H, N, O, and the process parameters such as the during time and discharging temperature of the slab, the middle billet height, the beginning and finish temperature, the roughing rolling temperature, the rolling finish temperature, the thickness of rolled coil, the coiling average temperature and pressure rate. The output variable of model is yield strength.
In order to mine the medium plate dataset, the discretization method based on DFAC is applied and the improved Apriori algorithm is used to generate the quantitative association rules based on the discrete data(see Section 3). The rules are reconstructed by combining with the TS fuzzy inference algorithm to enhance the classification and regression accuracy (see Section 4).
Here only two quantitative association rules for the medium plate are shown as follows:
Where Ci _ T represents the thickness of rolled coil, Ft _ Aver represents the roll-finish average temperature, T _ Aver represents the average temperature, w(Mn) represents the mass fraction of Manganese element, w(O) represents the mass fraction of Oxygen element, w(Si) represents the mass fraction of Silicon element, w(S) represents the mass fraction of Sulfur element, w(Cr) represents the mass fraction of Chromium element, w(H) represents the mass fraction of Hydrogen element, w(Mo) represents the mass fraction of Molybdenum element, ysrel represents the yield strength of l-th rules. 0.0945 < Ci_T ≤ 0.1435 represents the 1-itemsets of the input variable Ci_T and 0.5913 ≤ ys_rel2 ≤ 0.7275 represents the 1-itemsets of the output variable.
In order to improve the efficiency to predict a new case with these rules, the header table of medium plates is shown in Table 5. Based on it, the CRQAR-tree is constructed (see Section 4), in which each intermediate node corresponds to the item of the header table and each leaf node records three parts: the output category, confidence and the linear coefficients for the rule consequent part.
Figure 3 shows the structure of CRQAR-tree for the medium plate dataset. It can be seen that the CRQAR-tree for the medium plates dataset has eight layer, the serial number of each node in each layer corresponds to the serial number of header table for the specific item, the leaf nodes of each path record the output category, linear coefficients for the rule consequent and confidence. Moreover, if the case matches more than one branch, the output category is determined for the leaf node with the highest confidence. If there is no branch to match the case through the CRQAR-tree, the TS fuzzy inference rules derived from all the samples are used to induce the category and the real value according to (5).
In order to evaluate the performance of the CRQAR-tree, the proposed approach is compared with other five methods by using 1000 data for training and 500 data for testing. The results are shown in Table 6.
Notice that the prediction performance of the CRQAR-tree method is superior to that of the remaining methods. The parameters of the remaining methods were selected according to the recommendation of the corresponding authors within each proposal. For the training data, the RMSE based on the proposed approach is 0.1858, while that of the other five methods are 0.2064, 0.2205, 0.9760, 1.1040 and 1.0423 respectively. Similarly, the testing results are also derived that our proposed method has achieved higher performance.
The fitting curve of the mechanical properties with the CRQAR-tree algorithm for training and testing data are given in Figs. 4 and 5. From Fig. 4, one can observe that CRQAR-tree algorithm is responsive to change in the underlying characteristics of the hot rolling system in a timely fashion, where the straight line represents the actual output value and the dotted line represents the predictive output value. Although the fitting errors of testing data in Fig. 5 is slightly higher than that in Fig. 4, the CRQAR-tree algorithm still tracks the system well. It can be concluded that the quality model of medium plates based on CRQAR-tree algorithm is effective and practical.
As we have known, the performance of the association rule mining algorithm can be affected seriously by the given support and confidence, in order to analysis this, the comparisons between the classification rate and the number of association rules with different support and confidence values are shown in Fig. 6 respectively.
It is obvious that the support is much more important than confidence for classification rate and the number of association rules as described in Fig. 6(a) and (b), with increasing the support, the number of association rules and classification rate decreased gradually. For all datasets except Pima dataset, the classification rate is more than 90% when the support value is in the range from 0.05 to 0.08, the number of corresponding association rules is the least. It is benefit to improve the efficiency of the algorithm that the support is in [0.05, 0.08]. Furthermore, as described in Fig. 6(c) and (d), with increasing the confidence, there is little change for the number of association rules and classification rate, the algorithm has a high classification performance when the confidence is set at 0.8. So, setting appropriate threshold of support and confidence will be solved in our future researches.
Conclusion and future works
Accuracy and efficiency are crucial factors in classification and regression tasks in data mining. In this paper, a new classification and regression approach called CRQAR-tree is proposed. CRQAR-tree combined the advantage of both association classification and rule-based TS fuzzy inference, which enhanced the prediction accuracy by employing the quantitative association base on the improved Apriori algorithm and also promoted the efficiency of inducing the rules by utilizing the tree structure. Our experimental results show that the techniques developed in this paper are feasible and shed some light on the great potential application in industrial field.
For the future research, the current algorithm will be extended to the incremental algorithm to mine the association rules with dynamical change of the system, and the optimization strategies will be investigated to improve the efficiency and accuracy of the algorithm for new case.
Footnotes
Acknowledgments
This paper is supported by the Fundamental Research Funds for the Central Universities (No. FRF-UM-15-052), the National Natural Science Foundation of China (Grant No. 61572073), the Higher Education Teaching Reform Project (JG2014Z06) and the Graduate Education Development Funds for University of Science and Technology Beijing.
