Abstract
Machine-learning and data-mining techniques have been developed to turn data into useful task-oriented knowledge. The algorithms for mining association rules identify relationships among transactions using binary values and find rules at a single-concept level or multiple levels. Mining associations among itemsets only by using support and confidence thresholds at different levels of hierarchical data would not give interesting rules both for binary or quantitative data. This paper proposes a two phase algorithm that mines rare generalized fuzzy coherent rules at inter-cross level hierarchies. During phase-I both positive and negative fuzzy coherent rules are mined and in Phase-II, rare generalized fuzzy coherent rules are extracted from the resultant rules obtained from Phase-I. The algorithm framework works on top down methodology in generating positive and negative fuzzy coherent rules and mining rare generalized rules from it. Experiments conducted using synthetic dataset show the performance of the proposed algorithm in terms of the number of rare generalized rules generated, compared to fuzzy multiple-level association rule mining algorithm.
Keywords
Introduction
Data mining is the process of extracting hidden information from large amount of data. Numerous Algorithms for mining frequent and optimal association rules have been proposed [1–3, 32]. Han et al developed a frequent pattern tree [12] to mine frequent rules in two scans of the entire database. Fukuda et al. [10] introduced optimized association rule problem and permitted single instantiated conditions on the left-hand side of association rules. Wu et al. [34] designed an algorithm to enhance the performance of mining both positive and negative association rules. Fuzzy set theory [35] has been increasingly used in intelligent systems for its simplicity. Fuzzy learning algorithms for generating membership functions and inducing rules for a given dataset are proposed [9, 15] and used in specific domains.
Ma et al. [27] proposed fuzzy interesting measures for generating more interesting and meaningful fuzzy rules. Most of the previous method is based on single level concept, with single minimum support and minimum confidence for a whole database. For example if the minimum support is set higher, large amount of potential rules may be discarded or if the minimum support is set low large amount of frequent rules generated may not help the user to get knowledge in business point of view. To overcome this problem Liu et al. [25] proposed an algorithm to specify different minimum support to different items for mining rare itemsets. Lee et al. [21] used maximum constraint (i.e., maximum of the minimum supports) for items having more than one support thresholds.
But mining interesting rules without threshold measures have rather become a challenge in field association rule mining. To overcome the problem Sim et al. [29] introduced concept of mining coherent rules based on the properties of propositional logic. These rules are used in deriving associations among items without using the concept of minimum support and confidence threshold. A frequent generalized itemset and its frequent descendant usually represent the same knowledge. Hence in a multilevel environment, one is bothered in mining information without producing a large number of frequent rules. In the proposed approach, we mine generalized fuzzy coherent rules whose descendants are not coherent. Because of its uniqueness, the generalized rules mined are considered as rare generalized coherent rules which would help analysts in decision making in market-basket analysis. It is a two-phase approach where in phase-I all possible positive and negative coherent rules are generated for a predefined taxonomy. In the second phase rare generalized coherent rules are extracted from the rules obtained from the Phase-I. Let us consider a positive coherent rule X → Y is a rare generalized rule if none of its antecedent’s descendants are coherent. For example, in Table 9, the fuzzy inter-cross coherent rule 1** . Middle → 21* . high that indicates the correlation of a level 1 item with that of level-2 item is considered as rare generalized rule because none of its antecedent’s descendants like {11*, 12*, 111, 112, 121, 122} for each fuzzy region are coherent with the corresponding level-2 consequent items {21*, 22*}. Empirical evaluations show that the percentage of rare generalized fuzzy coherent rules obtained from the proposed approach is high compared to the rare generalized fuzzy association rule obtained from fuzzy multi-level association rule mining (FMARM) [20].
Review of related works
This section reviews related works of mining coherent rules including fuzzy coherent rules.
Fuzzy association rule mining
In real world applications the transaction data mostly consists of quantitative value. To mine the quantitative values, fuzzy mining has been proposed to derive the fuzzy rules from the quantitative transactions. Chan et al. [4] introduced F-APACS algorithm that converts quantitative values into linguistic terms for mining fuzzy association rules. The methodology uses adjusted difference analysis and weight of evidence measure to identify interesting associations. The main advantage is that statistical analysis used for mining overcomes the problem of using user-threshold measures. Kuok et al. [20] proposed a technique for mining fuzzy association rules using interesting measures like significance factor and certainty factor. The rules satisfying these threshold limits is said to be interesting fuzzy association rules. Hong et al. [17] proposed Apriori like algorithm where the linguistic term with maximum cardinality of each item is taken for mining fuzzy association rules. This approach uses predefined minimum support and confidence threshold measure for deriving interesting rules. Mangalampalli et al. [26] proposed fast mining of fuzzy Association rules in large dataset. The computational speedup is achieved by the use of 2-phased tidlist-style processing using horizontal partitions.
Similar to FP-tree, Papadimitriou et al. [28] introduced a method to find fuzzy association rules where quantitative value of every item is transferred to 2-fuzzy regions using membership functions. In their approach local frequent itemsets in each transaction were used for future mining. Hong et al. [18] introduced multiple fuzzy frequent pattern (MFFP) tree methodology for mining complete fuzzy frequent itemset. In this approach, all frequent fuzzy regions of a single itemset are taken up for mining which allows deriving of more fuzzy association rules than the previous models. After several tree based approaches for mining fuzzy frequent item sets, Lin et al. [23] proposed compressed fuzzy frequent pattern tree that would combine the paths of identical transactions of same fuzzy regions with different orders using fewer computations. The entire above approaches practices user defined minimum support threshold for the process of mining fuzzy association rules. The user defined threshold value is hard to set for a quantitative dataset which exposed the mining of fuzzy coherent rules.
Multi-level fuzzy association rule mining
Han and Fu [11] & Srikant and Agrawal [30] were the first to address the method of finding level-wise association rules in a taxonomical data with the later using a uniform threshold across different levels of taxonomy while the former using different min-support thresholds for different levels of taxonomy. The idea of fuzzy generalized association rule was first addressed by Wei et al. [33]. The same authors used the extended version of apriori algorithm called FGAR and HFGAR [8]. The FGAR algorithm deals with fuzzy taxonomies that reflect partial belonging of one item with another item and the HFGAR algorithm deals with linguistic hedges. Hong et al. [14] proposed complete mining of fuzzy multi-level rules by considering all the important linguistic terms for a particular attribute. This method mines all the possible multi-level association rules for a given taxonomy using minimum support and minimum confidence threshold.
Lee et al. [22] developed fuzzy mining at multiple levels of taxonomy using multiple minimum support thresholds for each itemsets. Lin et al. [24] developed a new technique for mining and updating fuzzy generalized association rules under dynamic taxonomies. Chen et al. [5] proposed the use of cumulative probability distribution approach for generating level by level fuzzy association rules. Kousari et al. [19] proposed a new improvement algorithm for mining fuzzy association rules by using different membership functions at different levels of taxonomy. Compared with the significant amount of research work on mining fuzzy association rules there is a lack of effort in the problem of mining fuzzy associations at multiple-levels without predefined minimum support and minimum confidence thresholds. In this paper coherent rule mining is used to mine associations of items at multiple levels for a fixed taxonomy.
Basic concepts of fuzzy coherent rule mining
In association rule mining appropriate minimum support is hard to define, in general association rules are mined by the minimum support. In order to solve this problem, sim et al. [29] proposed a coherent concept which maps the association rules to equivalences. An association rule (p → q) is mapped to an equivalence x ≡ y implication if and only if p → q is true, ¬p → ¬ q is true, ¬p → q is false and p → ¬ q is false. It requires alteast single transaction data to conclude for each condition as true or false; as there are four conditions mapping an association rule to equivalence implication, it cannot be carried out by a single transaction record. For multiple transactions an association rule p → q is mapped to equivalence implication x ≡ y only if
In the same way, other association rules such as ¬p → q, p → ¬ q, and ¬p → ¬ q is mapped to their corresponding implication based on the comparison between supports called pseudo implication. Using the concept of pseudo implications sim et al derived coherent rule.
The importance of any rule in a domain can be easily found out by using coherent logic without the prior knowledge of the domain.
Coherent rules can be mined in two different ways. 1. By considering Anti-monotonic property i.e., For example, if the rule p → q is not coherent, the candidate coherent rules (superset of p) are not generated like pr → q, pm → q etc., 2. Not considering Anti-monotonic property i.e., For example, if the rule p → q is not coherent, possibility of generating candidate coherent rules and checking them for coherency. In the proposed method coherent rules are generated considering the anti-monotonic property. Fuzzy coherent rule mining integrates the concept of mining fuzzy association rules from quantitative data [16] and coherent rule mining [29]. Chen et al. [6] introduced upper limit and lower limit support boundaries to mine high meaningful coherent rules. Chen et al. [7] proposed the first fuzzy coherent mining algorithm. The quantitative transactions are converted into fuzzy set in linguistic terms using predefined membership functions. The complimentary values for each fuzzy region are calculated. The concept of coherency is applied by maintaining a single fuzzy region as consequent and remaining regions as antecedent. This approach mines fuzzy coherent rules and also solves the problem of setting appropriate minimum support in fuzzy data mining.
However, the above approaches mine coherent rules without any specific constraints which motivate to explore the mining of rare generalized fuzzy coherent rules at different levels of taxonomy. By means of Definitions 1 and 2, for a given taxonomy and for a given consequent (i.e., item from any level of taxonomy) generalized inter-cross coherent rules can be mined.
Proposed generalized positive and negative inter-cross fuzzy multiple-level coherent rule mining algorithm
The proposed approach integrates the concept of fuzzy coherent rule mining and mining association rules at multiple-level taxonomical quantitative data. The proposed multiple-level fuzzy coherent rule mining without predefined minimum support and confidence is described below.
Phase-I: A complete set of positive and negative inter-cross fuzzy multiple–level coherent rules.
Phase-II: Rare generalized fuzzy coherent rules.
Encode each item in the given predefined taxonomy using sequence of numbers and the symbol “*”. The encoding schema starts from the root node with value zero and internal node from left to right increasing by one. Transfer the item name in the quantitative data to the encoding format. Set k = 1, where k is used to point the level number being processed, ranges from 1 to t. The items of the first k digits in each transaction (T
i
) of same type are grouped together and their quantitative amounts are summed. The amount of jth group for T
i
is denoted as . Transform the quantitative value of in each transaction T
i
into fuzzy set denoted as through the given membership functions, where h the number of fuzzy regions for , is the lth fuzzy region of , 1 ≤ l ≤ h, and is ’s fuzzy membership value in region . Calculate the scalar of each fuzzy region from the transaction as . Calculate complement value for every fuzzy region , If is the fuzzy value of the lth fuzzy region of , then its fuzzy compliment is calculated as . The result is given as . Collect all level k fuzzy regions into set . Therefore set where q = 1 to p and n = 1 to m. q is the number of items and n is the number of linguistic terms in the given membership functions. The item Y (consequent) is selected by the user from any taxonomical level of the input data set. If the item Y is chosen from the lowest taxonomical level (say k = t) then all its predecessor items (fuzzy regions) are moved to L
k
and its sibling items (fuzzy regions) in the same level are moved to the set . If the item Y is chosen from the middle taxonomical level (say k = t - 1) then all its predecessor and successor items (fuzzy regions) are moved to L
k
and its sibling items (fuzzy regions) in the same level are moved to the set . If the item Y is chosen from the uppermost taxonomical level (say k = 1) then all its successor items are moved to L
k
. Eliminate the level k fuzzy regions of item Y from the set . The new set K
k
consists of . The candidate level k fuzzy coherent rules are generated using the sets K
k
and Y using the following steps. Do the following sub steps to generate candidate fuzzy coherent rules for level k using the two sets K
k
and Y. For each element X in the set of K
k
and for the given element Y, calculate the contingency table given in Table 1 for antecedent X and consequent Y (minimum amount of x, y is taken for calculating amount
xy
) where Q1 : amount
xy
, Q2 : amountx∼y, Q3 : amount∼xy and Q4 : amount∼x∼y. If the candidate fuzzy coherent itemset (X, Y) satisfies the four conditions i.e., Q1 > Q2 and Q1 > Q3 and Q4 > Q2 and Q4 > Q3 then . Otherwise If the candidate fuzzy coherent itemset (X, Y) satisfies the four conditions i.e., Q1 > Q2 and Q1 > Q3 and Q4 > Q2 and Q4 > Q3 then . Go to SUBSTEP 10.1 to calculate the contingency table for next candidate fuzzy coherent rule. If all the rules are checked go to SUBSTEP 10.3 else SUBSTEP 10.1. Check the sets and is empty or not. If yes, go to step 13 else go to STEP 11. Collect derived fuzzy coherent rules of level k in and set h = h+1 to form candidate fuzzy coherent rule X → Y using L and K, where the length of K is h. Fuzzy regions with the same item cannot be used to form candidate fuzzy coherent rules. If there is no candidate fuzzy coherent rule go to STEP 14 else go to SUBSTEP 10.1. Output the derived fuzzy coherent rules of level k in . i.e., . Check for level k fuzzy coherent rules in . If empty, no coherent rules are generated and set k = k + 1 and go to STEP 3.
For each fuzzy coherent rule (For level k = 1) in the set FPCR, check whether the antecedent of the rule has any descendants in the level k = k + 1 to t - 1. (i.e., t is the total number of levels) If the descendant is present. Fetch the next rule. Otherwise put the rule in the set Positive Generalized Rule (PGR). If there are no fuzzy rules in k = 1, set k = k + 1, go to the Step 1.
The same concept is repeated for mining Negative Generalized Rule (NGR) in the set FNCR. For further understanding the pseudo code of the proposed system is given in the Appendix A.
An example for mining generalized positive and negative inter-cross fuzzy multiple-level coherent rule mining algorithm
In this section an example is shown for the proposed phase-I model. A stationary sales transaction from sales data set is shown in Fig. 1 is taken as the example. Each item is represented with its name, and purchasing amount. In Fig. 1 the stationary sales transaction, taxonomy is divided into three classes namely Pen, Note, and Rubber. Each class represents the category and brands. The Triangular fuzzy membership function in Fig. 2 is applied for all the items given in the Fig. 1. In the example the quantity of item are represented by three fuzzy regions: Low, Middle and High. The proposed Fuzzy coherent mining algorithm for the transaction data in Table 2 is given below. Each item in transaction data is first encoded using the predefined hierarchy tree based taxonomy. For example, the item “Natraj gel pen” is encoded as ‘121’ in which the first digit ‘1’ represents the ‘pen’ at level 1 and second digit ‘2’ represent flavors ‘gel’ at level 2, and the third digit ‘1’ represent the brand ‘Natraj’ at level 3 as shown in Fig. 1. Transfer of all the items in quantitative transaction data set in to encoded format are shown in Table 3. The level number currently being processed is stored in k. Initially k is set as 1. The transactions are grouped by (k = 1). The level 1 representation of the transaction is shown in Table 4. Transform the level 1 representation of quantitative transaction into fuzzy set by using predefined triangular membership function. Take the first item transaction T1 as an example the item (1**, 7) is converted into fuzzy set using membership function. The result of all items is shown in Table 5. Generate the complement set from the fuzzy set. Take the one item in transaction T1 using membership function as an example. The fuzzy value of item is converted into complement fuzzy set the result for all items as shown in Table 6. The level 1 item for the given data set is collected in the set (Similarly collect items for next level) Let fuzzy region Y(Consequent) ={21 * . low, 21 * . middle, 21 * . high is chosen from the middle taxonomical level, Then, L
k
= {2** . low, 2** . middle, 2** . high, 21* . low 21* . middle, 21* . high, 211 . low, 211 . middle, 211 . high, 212 . low, 212 . middle, 212 . high} .
Tables 7 and 8 displays the transformed fuzzy set for level2 and level3 data set respectively. After removing L
k
from at each level. The set K
k
say (k = 1) = {1** . low, 1** . middle, 1** . high, 3** . low, 3** . middle, 3** . high} The set K
k
say (k = 2) = {11* . low, 11* . middle, 11* . high, 12* . low, 12* . middle, 12* . high, 22* . low, 22* . middle, 22* . high, 31* . low, 31* . middle, 31* . high, 32* . low, 32* . middle, 32* . high} The set K
k
say (k = 3) = {111 . low, 111 . middle, 111 . high, 112 . low, 112 . medium, 112 . high, 121 . low, 121 . middle, 121 . high, 122 . low, 122 . middle, 122 . high, 221 . low, 221 . middle, 221 . high, 222 . low, 222 . middle, 222 . high, 311 . low, 311 . middle, 311 . high, 312 . low, 312 . middle, 312 . high, 321 . low, 321 . middle, 321 . high, 322 . low, 322 . middle, 322 . high} Calculate the contingency table for every candidate coherent rule whose antecedent in K
k
and consequent in Y. i.e., for k = 1 and h = 1, the CO1, CO2, CO3, CO4 values of the candidate coherent rule (1 ** . medium → 21 * . high) are 0.8, 0.6, 0.2, 4.4 respectively where CO1and CO4 are greater than CO2 and CO3. The rule meets the positive coherent condition, so it is placed in . But for the same level none of the candidate coherent rules meets negative coherency for the given dataset There is no possibility of generating candidate coherent rules at level 1 for h = 2as there is only single fuzzy region in level 1. Possibility in the occurrence of 2-itemset Antecedent is checked for the rest of the levels. Output the derived positive and negative fuzzy coherent rules of each level in and respectively. Check for level k fuzzy coherent rules in and If empty no coherent rules are generated and set k = k + 1 and go to STEP 3. In this example only twelve positive inter-cross fuzzy coherent rules and two negative inter-cross fuzzy coherent rules are obtained from 3 levels. They are outputted in the set ICCR
K
and the results are displayed in Table 9.
From Table 9 using the property of linguistic descendant the following positive and negative rare generalized rules displayed in Table 10 are mined.
Experiments and results
This section presents the evaluation results of the proposed approach. They are implemented in Java on a personal computer with Intel Core i7, 2.93 GHz and 4 GB RAM. The dataset is described in Section 6.1. The performance results are given in Section 6.2.
Dataset descriptions
A synthetic dataset is generated for 42 purchased items as terminal nodes on level 3, 14 generalized items in level 2 and 7 universal items in level 1. Each non terminal node in level 1 and level 2 have 2 and 3 branches respectively. Only terminal nodes would appear in transaction. In each data set, numbers of purchased items in transactions were randomly generated and then the quantities of each purchased items are generated. An item cannot be generated twice in a transaction. Thousand transactions with an average of seven purchased items were considered for the experimental analysis. Three Fuzzy regions used in the proposed approach are Low, Middle and High. Results are shown in the Section 6.2.
Experimental results
The total numbers of positive and negative fuzzy coherent rules generated for an average support threshold of 0.04 and an average confidence threshold of 0.76 is shown in Fig. 3. For the same threshold measure the number of fuzzy coherent rules generated at each level is displayed in Fig. 4. From the Fig. 4 it is observed that irrespective of the number of transactions there may be an increase or decrease in the number of coherent rules mined from higher level to lower level. As per the property of multi-level association rule mining if there are no rules mined in level 1 then there will be no rules generated in the descended levels for the same support and confidence. But in multi-level coherent rule mining, regardless of the number of rules generated in the higher levels there may be an increase or decrease in rules in the bottom levels which are purely based upon positive and negative correlations among itemsets.
The Generalized Positive and Negative Inter-cross Multilevel fuzzy coherent rule mining (GMCRM) is compared with Fuzzy Multiple-Level Association Rule mining (FMARM) [19] for a fixed taxonomy and the statistical analysis is given in Table 11.
From Table 11, it can be observed that the total number of rare generalized rules derived from GMCRM is less than that by FMARM. It should be noted that from the 1332 fuzzy association rules mined, only 71 rules are rare generalized rules however in GMCRM out of 64 fuzzy coherent rules mined 26 rules are rare generalized rules. In addition, the proposed approach uses an average support and confidence values of 0.04 and 0.76 which is larger when compared to the existing approach. Hence from the observation it is evident that association rule mining at a multi-level environment produces more rules of less significance compared to multi-level coherent rule mining (i.e., from the number of rules mined only 5% of rare generalized rules are produced by the FMARM compared to the 41% rare generalized rules by GMCRM). Figure 5 displays the execution time for generating coherent rules in the proposed approach. It shows that the execution times of the mining process increases along with the increase in the number of transactions. The experimental evaluations are also conducted by changing the number of fuzzy regions. From Fig. 6 it is evident that the number of coherent rules mined decrease along with increase in the number of fuzzy regions.
It is justified because the increase in number of regions would further distribute the quantity into different regions. From the above evaluations we can conclude that the proposed approach provides a pathway in finding more reliable rules in a multi-level environment without minimum support and confidence thresholds.
Conclusion and future work
The proposed system gives an insight of mining coherent rules in a multilevel environment. It generates both positive and negative rare generalized fuzzy coherent rules at multi-levels of taxonomy without minimum support threshold and minimum confidence thresholds. The proposed approach works on two phases, in phase-I fuzzy coherent rules are mined at multi-levels for a given taxonomy. During phase-II rare generalized rules are extracted from the fuzzy coherent rules. The experiments conducted with the synthetic dataset shows a large number of rules mined from the existing method are of less significance as the percentage of rare generalized rules produced from existing method is less when compared with the proposed method. From this we arrive to a conclusion that proposed approach produces more significance rules. In the future, the proposed methodology can be extended for time-series databases for producing significant rules.
Footnotes
Appendix
Pseudo code for the proposed system
Acknowledgments
The authors like to thank the anonymous referees for their valuable comments and Sri Ramakrishna Engineering College for providing resources for our experiments.
