Comprehensive concept description based on association rules: A meta-learning approach

Abstract

This paper presents a novel approach to post-processing of association rules based on the idea of meta-learning. A subsequent association rule mining step is applied to the results of “standard” association rule mining. We thus obtain “rules about rules”, which can help us better understand the association rules generated in the first step. We define various types of such meta-rules and report some experiments on benchmark data from the UCI Machine Learning Repository as well as on data from atherosclerosis risk domain. When evaluating the proposed method, we use the LISp-Miner system.

Keywords

Association rules meta-learning LISp-Miner

1. Introduction

Concept description is one of the typical data mining tasks. According to CRISP-DM methodology concept description “…aims at understandable descriptions of concepts or classes” [10]. Concept description is thus similar to classification in the sense that there are predefined classes (given by the values of the target attribute) we are interested in. But unlike classification, concept description focuses on understandability, not on classification accuracy of the discovered knowledge. Therefore association rules, decision rules or decision trees are preferred tools to model these concepts.

In our paper, we focus on concept description using association rules. Association rules have been proposed by R. Agrawal in the early 1990s as a tool for the so-called market basket analysis [1]. An association rule has the form of an implication

$\displaystyle X\Rightarrow Y$

where $X$ and $Y$ are sets of items and $X\cap Y=\emptyset$ . An association rule expresses that transactions containing items from set $X$ tend to contain items from set $Y$ , so, e.g., a rule

$\displaystyle{\{}A,B{\}}\Rightarrow{\{}C{\}}$

says that customers who buy products $A$ and $B$ also often buy product $C$ . Such statements can be used to guide the placement of goods in a store, for cross-selling or to promote new products. The two basic characteristics of an association rule are support and confidence. Support of an itemset X is defined as the proportion of those transactions in the data set that contain the itemset $X$ . Confidence of an association rule $X\Rightarrow Y$ is defined as support ( $X\cup Y$ )/support ( $X$ ).

This idea of association rules can be applied to any data in a tabular, attribute-value form. Data describing values of attributes can be analyzed in order to find associations between conjunctions of attribute-value pairs (categories). Let us denote these conjunctions as Ant (antecedent) and Suc(succedent), and rewrite the association rule as

$\displaystyle\textit{Ant }\Rightarrow\textit{Suc}$

When using association rules for concept description, Suc will be a category of the target attribute.

We can again characterize the “strength” of an association rule by support and confidence. Now support is the estimate of the probability $P(\textit{Ant}\wedge\textit{Suc})$ (the frequency of $\textit{Ant}\wedge\textit{Suc}$ is the absolute support) and confidence is the estimate of the probability $P(\textit{Suc}|\textit{Ant})$ .

In association rule discovery, the task is to find all syntactically correct rules $\textit{Ant }\Rightarrow\textit{Suc}$ (i.e., rules in which two different values of an attribute cannot occur) with support and confidence taking values above the user–defined thresholds minconf and minsup.

There are a number of algorithms capable of performing this task. The probably best-known algorithm apripri proceeds in two steps. All frequent itemsets are found in the first step during breadth-first search in the space of all frequent itemsets. This search is controlled using the so-called downward closure property: if a pattern of length $l$ is not frequent, then none of its extensions of length $l^{\prime}>l$ can be frequent (A frequent itemset is a set of items that is included in at least minsup transactions). Then, association rules with a confidence value of at least minconf are generated in the second step [1]. Another well-known algorithm is FP-Growth. This algorithm uses an FP-tree to generate frequent itemsets. This way of representing the itemsets reduces the computational cost because (unlike apriori) it requires only two scans of the whole data [14]. The frequent itemsets found in this way are then again split into antecedent and succedent to create a rule.

The search space of all possible itemsets (or conjunctions of categories) can be very large. For $K$ items, there are 2 ${}^{K}$ itemsets, for $K$ categorical attributes $A_{1}$ , $A_{2}$ , $\ldots$ $A_{K}$ , having $v_{1}$ , $v_{2}$ , $\ldots$ $v_{K}$ distinct values, the number of all possible conjunctions is

$\displaystyle\prod\limits_{j=1}^{m}(1+K_{Aj})-1$

The main problem when using association rules for data mining is their interpretation. Usually we end up with a huge number of associations and each of them might be interesting for the domain expert or end-user. So a kind of automatic support for the interpretation in the form of post-processing association rules would be of great help. We present some ideas in this direction and show their experimental evaluation using LISp-Miner, a data mining toolbox for mining different types of rules, which is under development at the University of Economics, Prague [19, 21].

The rest of this paper is organized as follows. Section 2 reviews work related to the problem of post-processing of association rules, Section 3 gives an overview of the GUHA method and the 4ft rules, Section 4 introduces the concept of association meta-rules and describes how they can be obtained using the 4ft-Miner procedure, Section 5 summarizes the experimental evaluation of the proposed approach on some benchmark datasets from the UCI Machine Learning Repository, Section 6 shows the experimental evaluation of the proposed approach on data from atherosclerosis risk domain, and Section 7 concludes the paper. This paper is an extended version of a paper presented at the IDA 2013 Conference [7].

2. Related work

Various approaches have been proposed in the past to post-process a large list of found associations. Baesens et al. [4] distinguish four types of such post-processing: pruning, where redundant rules are identified and removed from the rule list; summarization, where rules are summarized into more general or abstract concepts that are easier to understand by the user; grouping, where rules are grouped together on various levels of abstraction; and visualization. As visualization, together with filtering or selection, is a rather standard option in most systems, we will not discuss this approach in this paper. The other types of post-processing can be carried out either based on the found association rules only or using some domain knowledge.

An application of deduction rules to post-process the results of the GUHA method is described in [18]. A deduction rule has the form

$\displaystyle\frac{\textit{Ant}\approx\textit{Suc}}{\textit{Ant}^{\prime}% \approx\textit{Suc}^{\prime}}$

where $\textit{Ant }\approx\textit{Suc}$ and $\textit{Ant}^{\prime}\approx\textit{Suc}^{\prime}$ are association rules. If the deduction rule is sound, then knowing that $\textit{Ant}\approx\textit{Suc}$ is true we also know that $\textit{Ant}^{\prime}\approx\textit{Suc}^{\prime}$ is true. When interpreting $\approx$ as an implication, then deduction rule

$\displaystyle\frac{\textit{Ant}\Rightarrow_{p}\textit{Suc}_{1}}{\textit{Ant}% \Rightarrow_{p}(\textit{Suc}_{1}\vee\textit{Suc}_{2})}$

saying that if $p$ per cent of examples satisfying Ant also satisfy $\textit{Suc}_{1}$ then we are certain that $p$ per cent of examples satisfying Ant also satisfy $\textit{Suc}_{1}$ or $\textit{Suc}_{2}$ ; or deduction rule

$\displaystyle\frac{\textit{Ant}_{1}\Rightarrow_{1}\textit{Suc}}{(\textit{Ant}_% {1}\wedge\textit{Ant}_{2})\Rightarrow_{1}\textit{Suc}}$

saying that if all examples satisfying $\textit{Ant}_{1}$ also satisfy Suc then all examples satisfying $\textit{Ant}_{1}$ and $\textit{Ant}_{2}$ also satisfy Suc, are examples of sound deduction rules. Such deduction rules allow us to remove association rules that are logical consequences of other association rules and thus reduce the size of the rule set.

A similar idea, but applied to “Agrawal-like” association rules, can be found in the work of Jorge et al. [15]. The authors define a number of operators that can transform a rule into a set of rules. Antecedent generalization or antecedent least general generalization are examples of operators that can be applied to association rules created to describe a concept. In the former case, a rule $\textit{Ant}\Rightarrow\textit{Suc}$ is transformed into a set of rules $\textit{Ant}^{\prime}\Rightarrow\textit{Suc}$ , where $\textit{Ant}^{\prime}$ is a generalization of Ant. That is, $\textit{Ant}^{\prime}$ is obtained from Ant by deleting one or more categories (attribute-value pairs). In the latter case, rule $\textit{Ant}\Rightarrow\textit{Suc}$ is transformed into a set of rules $\textit{Ant}^{\prime}\Rightarrow\textit{Suc}$ , where $\textit{Ant}^{\prime}$ is again a generalization of Ant, but now $\textit{Ant}^{\prime}$ is obtained from Ant by deleting only one category.

Toivonen et al. understand redundancy of rules on the semantic level, i.e., as rules that cover the same examples, and they define a rule cover as a minimal set of rules that covers all examples [23]. This rule cover can thus be understood as the reduced set of rules presented to the user. To find the rule cover, they propose a method very similar to the set covering algorithm for creating decision rules. The standard set covering algorithm works in an iterative way: in each iteration it creates a rule and removes covered examples from the data. The process is terminated when there are no uncovered examples. Here, the found association rules are first sorted in a decreasing order of supports. Then, in each iteration a rule (with highest support) is moved to the next rule and examples covered by this rule are removed from the data; this process is again terminated when there are no examples to be covered by a rule not yet added to the rule cover. This paper also describes clustering of association rules that have the same succedent; the distance between two rules is again defined semantically, i.e., as the number of examples covered by only one of the rules.

KEX is an algorithm that learns rules in the form $\textit{Ant}\Rightarrow C(w)$ where Ant is a conjunction of categories, $C$ is a category of the target attribute (the class) and $w$ (called weight) expresses the uncertainty of the rule [6]. KEX performs a heuristic top-down search in the space of candidate rules. In this algorithm the covered examples are not removed during learning, so an example can be covered by several rules. More rules can thus be used during classification, each contributing to the final assignment of an example. KEX uses a pseudo-Bayesian combination function borrowed from the expert system PROSPECTOR [12] to combine contributions from several rules. KEX works in an iterative way, testing and expanding an implication $\textit{Ant}\Rightarrow C$ in each iteration. This process starts with a default rule weighted with the relative frequency of class $C$ and is terminated after testing all implications created according to user-defined criteria. The induction algorithm inserts only those rules into the knowledge base for which the confidence (defined in the same way as the confidence of association rules) cannot be inferred from the weights of the applicable rules found so far. The KEX procedure has been designed to create a set of classification rules. So, for a new example, all rules that cover this example are used to compute the composed weight of the target class. But we can also use KEX to “compress” the set of association rules to get a more condensed representation. The rules generated by KEX can be understood as a “core” set of association rules in the sense that, for each association rule created using the same settings for minconf, maxlengthA and minsup, either the composed weight exactly corresponds to the confidence of the association rule (if this association rule is a part of the core set), or the composed weight does not significantly differ from the confidence of the association rule (if this rule is not a part of the core set).

Both semantical (i.e., based on number of instances covered by a rule) and syntactical (i.e., based on the lists of attribute-value pairs that occur in the rules) clustering of association rules can be found in [21]. A similarity of association rules is defined there as

$\displaystyle d_{sc}(\textit{Ant}_{1}\Rightarrow\textit{Suc}_{1},\textit{Ant}_% {2}\Rightarrow\textit{Suc}_{2})=(1+\textit{diff}_{\sup}(\textit{Ant}_{1},% \textit{Ant}_{2}))\textit{sim}_{\textit{att}}(\textit{Ant}_{1},\textit{Ant}_{2% })w_{1}+(1+\textit{diff}_{\sup}(\textit{Suc}_{1},\textit{Suc}_{2}))\textit{sim% }_{\textit{att}}(\textit{Suc}_{1},\textit{Suc}_{2})w_{2}+(1+\textit{diff}_{% \sup}(\textit{Ant}_{1}\wedge\textit{Suc}_{1},\textit{Ant}_{2}\wedge\textit{Suc% }_{2}))\textit{sim}_{\textit{att}}\qquad(\textit{Ant}_{1}\wedge\textit{Suc}_{1% },\textit{Ant}_{2}\wedge\textit{Suc}_{2})w_{3}$

where $\textit{diff}_{\sup}(A,B)=\textit{support(A)}+\textit{support(B)}-2\times% \textit{support}(A\wedge B)$ expresses the difference in support and $\textit{sim}_{\textit{att}}(A,B)=\frac{\left\|{A\;\textit{xor}\;B}\right\|}{% \left\|{A\wedge B}\right\|}$ expresses the similarity in attribute-value pairs that occur in $A$ and $B$ (note that $||X||$ denotes the number of attribute-values of $X$ ). A hierarchical agglomerative algorithm is used to cluster the rules.

Zaki [25] proposes a method that only mines for non-redundant association rules. He defines a rule $\textit{Ant}^{\prime}\Rightarrow\textit{Suc}^{\prime}$ as redundant if there exists another rule $\textit{Ant}\Rightarrow\textit{Suc}$ such that $\textit{Ant}\subseteq\textit{Ant}^{\prime}$ and $\textit{Suc}\subseteq\textit{Suc}^{\prime}$ and both rules have the same confidence. This method thus favors less complex rules.

A set of reduction techniques for redundant rules was proposed and implemented by Ashrafi et al. [3]. Their techniques are based on the generalization/specification of the antecedent/succedent of the rules. The authors distinguish between redundant rules with a fixed antecedent and redundant rules with a fixed succedent. A rule $\textit{Ant}\Rightarrow\textit{Suc}$ is said to be redundant if and only if $n$ such that $\textit{Ant}\Rightarrow e_{1}$ , $\textit{Ant}\Rightarrow e_{2}$ , $\ldots$ , $\textit{Ant}\Rightarrow e_{n}$ satisfy the minimum confidence threshold and $\textit{Suc}=e_{1}\wedge e_{2}\wedge\ldots\wedge e_{n}$ . A rule $\textit{Ant}\Rightarrow\textit{Suc}$ is said to be redundant if and only if $n$ rules such that $e_{1}\Rightarrow\textit{Suc}$ , $e_{2}\Rightarrow\textit{Suc}$ , $\ldots$ , $e_{n}\Rightarrow\textit{Suc}$ satisfy the minimum confidence threshold and $\textit{Ant}=e_{1}\wedge e_{2}\wedge\ldots\wedge e_{n}$ .

There is a third possibility, namely, post-processing the rules using some domain knowledge, usually in the form of an ontology. So, e.g., An et al. use expert-supplied taxonomy (is-a hierarchy) of items for clustering the discovered association rules with respect to the taxonomic similarity. The taxonomy has a form of a tree where upper-level nodes represent generalizations of their children. Both leaf and non-leaf nodes of this taxonomy can be present in the antecedent and succedent of a rule. Each node of this taxonomy is associated with a pair of numbers (horizontal position, vertical position) representing the relative position of the node in the taxonomy tree. A rule is then represented by the average value of these two numbers computed for nodes from antecedent and by the average value of these two numbers computed for nodes from the succedent. This representation is then used during clustering [2]. Domingues and Rezende iteratively scan the itemset rules and update a taxonomy that is then used to generalize the association mining results. A rule generalization is done by substituting an item in the rule (in antecedent or succedent) by its parent w.r.t. to the taxonomy tree. The contingency table is updated accordingly. Repeatedly occurring rules are pruned out [11]. Marinica and Guillet propose an interactive post-processing approach to pruning and filtering the discovered rules. Their approach consists of two main parts: creating a knowledge base and post-processing. The knowledge base consists of an ontology (which expresses general knowledge about the domain); and the post-processing task iteratively applies a set of filters (based on the knowledge base) to the extracted rules in order to extract interesting rules [17].

3. GUHA method and LISp-Miner

3.1 Basic principles of the GUHA Method

GUHA is an original Czech method of exploratory data analysis which originated in the 1960s. Its principle is to offer all interesting facts following from the given data with regard to the given problem. The book [13] introduces the main principles of this method, a general theory of mechanized hypothesis formation based on mathematical logic and statistics. Hypotheses defined and studied in that book are relations between two conjunctions derived from values of attributes (columns) of an analyzed data table. Various types of relations between these conjunctions are used, including relations corresponding to statistical hypothesis testing. The original GUHA procedure ASSOC understands the knowledge pattern (called hypothesis) as the expression

$\displaystyle\textit{Ant}\approx\textit{Suc}/\textit{Cond}$

where Ant (called antecedent), Suc (called succedent) and Cond (called condition) are conjunctions of literals and $\approx$ (called quantifier) denotes a relation between Ant and Suc for the examples from the analyzed data table that fulfill the condition Cond. The rule $\textit{Ant}\approx\textit{Suc}/\textit{Cond}$ is true in the analyzed data table if the relation associated with $\approx$ is satisfied for the frequencies $a$ , $b$ , $c$ , $d$ in the corresponding contingency table shown in Table 1. We can denote this fact by

$\displaystyle\approx(a,b,c,d)=1.$

The relation $\approx$ need not be only the so-called founded implication, defined using the confidence measure $a/(a+b)$ of a “standard” association rule proposed by Agrawal (see, e.g., [1]), but also, e.g., a statistical association evaluated using the $\chi^{2}$ test.

Table 1
Fourfold contingency table

	Suc	$\neg$ Suc	$\sum$
Ant	$a$	$b$	$r$
$\neg$ Ant	$c$	$d$	s
$\sum$	$k$	$l$	$n$

The basic idea of the GUHA method is to find all hypotheses (knowledge patterns) that are true in the analyzed data in the above-described sense. A GUHA procedure generates (in a top-down way) each particular pattern and tests if it is true in the analyzed data, again in the above-mentioned sense. The output of the procedure consists of all such patterns that do not immediately follow from other, simpler output patterns.

Since the 1960s, the GUHA method has been implemented several times on different computer platforms. We will describe the GUHA procedures which are currently implemented in the LISp-Miner system in the next Section. These procedures differ in the types of knowledge patterns that are produced on the output.

3.2 The LISp-Miner system

LISp-Miner, a freely available system that has, since 1996, been developed at the University of Economics, Prague, implements various GUHA procedures that mine for different types of (mostly rule-like) knowledge patterns.

LISp-Miner is oriented on work with categorical attributes. We can distinguish between binary attributes (that can have only two values, e.g., the attribute “gender”), nominal attributes (that can have more than two values and the values are not ordered, e.g., the attribute “color”), and ordinal attributes (that can have more than two values but their values are ordered, e.g., the attributes “education” or “military rank”). Numeric attributes must be discretized (turned into intervals) in advance before running any of the mining procedures. The data preprocessing module LMDataSource can be used to do this – the module offers an equidistant discretization (the created intervals have the same width values), equifrequent discretization (the created intervals contain roughly the same numbers of examples) or discretization where the intervals are defined by the user.

By now, LISp-Miner consists of 10 data mining procedures: 4ft-Miner (derived from the original GUHA procedure ASSOC), SD4ft-Miner, AC4ft-Miner, KL-Miner, CF-Miner, SDKL-Miner, SDCF-Miner, KEX, ETree-Miner, and MCluster-Miner. The 4ft-Miner is based on the original GUHA procedure ASSOC from the 1960s, the other procedures have been designed during the development of the LISp-Miner system. Most of the procedures mine for various types of rule-like patterns – this makes LISp-Miner more focused on a particular type of models than standard data mining tools. The core of the LISp-Miner system includes the procedures mentioned above. Each procedure is realized using two modules: the data processing module Task (e.g., 4ft-Task) is used to run the analysis; and the data interpretation module Results (e.g., 4ft-Results) is used to display, sort or select the resulting rules. In addition, LISp-Miner also implements (in the DataSource module) a variety of data transformation and data preprocessing methods that can be used to select attributes for the specific data mining tasks, create derived attributes, or discretize numeric attributes [22]. There is also one administration module, LMAdmin, which assigns data and meta-data to a given task (see Fig. 1 for a scheme of LISp-Miner). The concept of meta-data allows (like in RapidMiner, IBM SPSS Modeler or SAS EM) the user to store and reuse inputs for the tasks, as well as the results obtained during the analysis. Both data and meta-data are stored using a database (MS Access or MySQL are usually used).

Figure 1.

Scheme of the LISp-Miner system.

In our study, we will focus only on the 4ft-Miner procedure. An interested reader should refer to [20, 22] for more information about the other LISp-Miner procedures. The 4ft-Miner procedure mines for knowledge patterns that can be understood as 4ft association rules (4ft rules for short) in the form

$\displaystyle\textit{Ant}\approx\textit{Suc}/\textit{Cond}$

where Ant (antecedent), Suc (succedent) and Cond (condition) are cedents and $\approx$ is a quantifier (relationship between Ant and Suc), which is evaluated on the subset of examples that satisfy the condition Cond. If the condition is empty then the procedure analyzes the whole data table. The quantifier $\approx$ is defined using the frequencies $a$ , $b$ , $c$ , and $d$ of the corresponding 2 $\times$ 2 contingency table shown in Table 1.

Table 2

Implemented 4ft quantifiers

4ft quantifier	Parameters	$\approx$ ( $a, b, c, d$ ) $=$ 1 iff
Founded implication	$0<p\leqslant$ 1 $\textit{Base}>$ 0	$\frac{a}{a+b}\geqslant p\wedge a>\textit{Base}$
Lower critical implication	$0<p\leqslant 1$ $0<\alpha<1$ $\textit{Base}>$ 0	$\sum\limits_{i=a}^{a+b}{\frac{(a+b)!}{i!(a+b-i)!}p^{i}(1-p)^{a+b-i}\leqslant% \alpha\wedge a>\textit{Base}}$
Upper critical implication	$0<p\leqslant$ 1 $0<\alpha<$ 1 $\textit{Base}>$ 0	$\sum\limits_{i=0}^{a}{\frac{(a+b)!}{i!(a+b-i)!}p^{i}(1-p)^{a+b-i}\leqslant% \alpha\wedge a>\textit{Base}}$
Double founded implication	$0<p\leqslant$ 1 $\textit{Base}>$ 0	$\frac{a}{a+b+c}\geqslant p\wedge a>\textit{Base}$
Double lower critical implication	$0<p\leqslant$ 1 $0<\alpha<$ 1 $\textit{Base}>$ 0	$\sum\limits_{i=a}^{a+b+c}{\frac{(a+b+c)!}{i!(a+b+c-i)!}p^{i}(1-p)^{a+b+c-i}% \leqslant\alpha\wedge a>\textit{Base}}$
Double upper critical implication	$0<p\leqslant$ 1 $0<\alpha<$ 1 $\textit{Base}>$ 0	$\sum\limits_{i=0}^{a}{\frac{(a+b+c)!}{i!(a+b+c-i)!}p^{i}(1-p)^{a+b+c-i}% \leqslant\alpha\wedge a>\textit{Base}}$
Founded equivalence	$0<p\leqslant$ 1 $\textit{Base}>$ 0	$\frac{a+d}{a+b+c+d}\geqslant p\wedge a>\textit{Base}$
Lower critical equivalence	$0<p<$ 1 $0<\alpha<$ 1 $\textit{Base}>$ 0	$\sum\limits_{i=a}^{a+b+c+d}{\frac{(a+b+c+d)!}{i!(a+b+c+d-i)!}p^{i}(1-p)^{a+b+c% +d-i}\leqslant\alpha\wedge a>\textit{Base}}$
Upper critical equivalence	$0<p<$ 1 $0<\alpha<$ 1 $\textit{Base}>$ 0	$\sum\limits_{i=0}^{a}{\frac{(a+b+c+d)!}{i!(a+b+c+d-i)!}p^{i}(1-p)^{a+b+c+d-i}% \leqslant\alpha\wedge a>\textit{Base}}$
Simple deviation	$\delta>0$ $\textit{Base}>$ 0	ab $>e^{\delta}\textit{cd}\wedge a\geqslant\textit{Base}$
Fisher’s quantifier	$0<\alpha\leqslant$ 0.5 $\textit{Base}>$ 0	$\sum\limits_{i=a}^{\min(r,k)}{\frac{\left({{\begin{array}[]{{20}c}k\hfill\\ i\hfill\\ \end{array}}}\right)\left({{\begin{array}[]{{20}c}{n-k}\hfill\\ {r-i}\hfill\\ \end{array}}}\right)}{\left({{\begin{array}[]{*{20}c}n\hfill\\ r\hfill\\ \end{array}}}\right)}\leqslant\alpha\wedge a\geqslant\textit{Base}}$
Chi-square quantifier	Base $>$ 0 $\chi^{2}_{\alpha}$	$\textit{ad}>\textit{bc }\wedge\frac{n(ad-bc)^{2}}{klrs}>\chi_{\alpha}^{2}% \wedge a\geqslant\textit{Base}$
Above average	$q>$ 0 $\textit{Base}>$ 0	$\frac{a}{a+b}\geqslant(1+q)\frac{a+c}{a+b+c+d}\wedge a\geqslant\textit{Base}$
Below average	$p>$ 0 $\textit{Base}>$ 0	$\frac{a}{a+b}\leqslant(1+p)\frac{a+c}{a+b+c+d}\wedge a\geqslant\textit{Base}$
E-quantifier	$0<p<$ 1	$\max\left({\frac{b}{a+b},\frac{c}{c+d}}\right)<p$
Base	$\textit{Base}>$ 0	$a\geqslant\textit{Base}$
Support	$0<s\leqslant$ 1	$\frac{a}{a+b+c+d}\geqslant s$
Ceil	Base $>$ 0	$a\leqslant\textit{Base}$
Ceil Support	$0<s\leqslant$ 1	$\frac{a}{a+b+c+d}\leqslant s$

Table 2 shows the implemented quantifiers (types of 4ft rules). This table also shows the relation that the quantifiers must fulfill to consider the 4ft rule to be true. Here $p$ , Base and $\alpha$ are user-defined parameters for probability, number of examples, and significance level respectively; $p\in(0,1]$ , Base $\in$ (0, $n$ ] and $\alpha\in$ (0,0.5]. A-quantifier, support and founded implication are closely related to the “standard” association rules, The above-average and below-average quantifiers are related to the lift criterion $P(\textit{Suc}|\textit{Ant})/P(\textit{Suc})$ . All these quantifiers together with the upper critical founded implication and the lower critical founded implication, express different forms of the relation $\textit{Ant}\Rightarrow\textit{Suc}$ . The double founded implication, double upper critical implication and double lower critical implication quantifiers express different forms of the relation ( $\textit{Ant}\Rightarrow\textit{Suc})\wedge(\textit{Ant }\Rightarrow\textit{Suc}$ ). The founded equivalence, upper critical equivalence and lower critical equivalence express different forms of the relation $\textit{Ant }\Leftrightarrow\textit{Suc}$ . Fischer’s quantifier and chi-square quantifier are based on statistical tests, as they are symmetrical (we can swap $b$ and $c$ in the formulas), they can also be interpreted as a kind of equivalence between Ant and Suc. In addition to these quantifiers, each frequency from the 2 $\times$ 2 contingency table can be restricted by the relations $<,\leqslant,=,\geqslant,>$ .

Unlike in the original GUHA procedure ASSOC, where Ant, Suc and Cond were conjunctions of literals, Ant, Suc and Cond in 4ft-Miner are conjunctions of partial cedents, where each partial cedent is a conjunction or disjunction of literals and a literal is defined as $A(\textit{coef})$ or $\neg A(\textit{coef})$ , where $A$ is a categorical attribute and coef is a list of possible values. In 4ft-Miner, coef can be:

•

one category; this is simply a single value of an attribute $A$ ,

•

a subset of a given length (e.g., city(London, Paris) is a literal that contains a subset of length 2),

•

an interval of a given length (e.g., age (10–20, 20–30) or age (0–10, 10–20) are intervals of length 2),

•

a cyclic interval (e.g., if values of age are 0–10, 10–20, 20–30 and 30–40, then age (30–40, 0–10, 10–20) is a cyclic interval),

•

a cut, i.e., an interval of a given length that contains the boundary value (e.g., age (0–10, 10–20) is a cut of length 2 but age (10–20, 20–30) is not a cut),

•

a left cut, a cut that contains the first value (e.g., age (0–10, 10–20)),

•

a right cut, a cut that contains the last value (e.g., age (30–40) or age (20–30, 30–40)).

While subsets can be created for all attributes, creating intervals and cuts only makes sense for ordinal attributes.

When comparing this notion of association rules (we will call them 4ft rules) with the “standard” understanding, we will find that:

•

a 4ft rule not only consists of an antecedent Ant and a succedent Suc but can also contain a condition Cond; this condition is generated during the rule learning process as well.

•

4ft rules offer a more expressive syntax for Ant, Suc and eventually Cond. If, e.g., the analyzed data contain attribute $A$ with values $a$ , $b$ , $c$ , attribute $B$ with values $x$ , $y$ , $z$ , and attribute $C$ with values $k$ , $l$ , $m$ , $n$ , then a 4ft rule can be, e.g.,

$\displaystyle A(b)\wedge B(x\vee y)\Rightarrow\neg C(k)$

4ft rules offer more types of relations between Ant and Suc, as shown in Table 2; we can search not only for implications (based on standard definitions of support and confidence of a rule), but also for equivalences or statistically based relations.

When generating a rule, the system starts to generate an antecedent (as a conjunction of cedents) and then generates all possible succedents to this antecedent. If the condition should occur in the rules as well, it is generated after fixing the antecedent and succedent parts of the rule. The generating of conjunctions proceeds in a depth-first way. The user-given parameters are:

•

the definition of partial cedents for Ant, Suc and Cond respectively; each partial cedent is defied by the type of formula (conjunction/disjunction), min. and max. number of literals, and a list of possible literals (each literal is defined by the corresponding attribute, type of the coefficient, minimal and maximal length of the coefficient, specification if the literal should be used without negation, with negation or both with/without negation and specification if the literal is basic – i.e., if it must occur in the cedent or if the literal is the so-called remaining one),

•

the maximal length of Ant, Suc and Cond,

•

a type of relation $\approx$ and the threshold values for their parameters (as shown in Table 2),

•

other task parameters, e.g., how to handle missing values.

To simplify the parameter setting, reasonable default values are assigned to these parameters. So the default settings for 4ft-Miner correspond to the standard association rules (the quantifier is set to a founded implication with parameter $p=$ 0.9, Ant and Suc are created only from attribute-value pairs (i.e., the coefficient in the definition of a literal only corresponds to a single value of the attribute). It is also possible to “clone” an existing task, i.e., to reuse and only slightly modify its parameters.

When using 4ft-Miner for concept description, we should prefer the relations expressing implications with the meaning “IF examples have some values of input attributes, THEN they belong to concept C”. So we should choose from among the so-called implicational quantifiers. A quantifier is implicational iff

$\displaystyle(\approx(a,b,c,d)=1\wedge a^{\prime}\geqslant a\wedge b^{\prime}% \leqslant b)\Rightarrow\quad\approx(a^{\prime},b^{\prime},c^{\prime},d^{\prime% })=1$

is true for all fourfold contingency tables ( $a$ , $b$ , $c$ , $d$ ) and ( $a^{\prime}$ , $b^{\prime}$ , $c^{\prime}$ , $d^{\prime}$ ) [19]. The base quantifier, founded implication, lower critical implication and upper critical implication are such quantifiers (see Table 2).

When looking at the specifications for Ant, Suc and Cond, we get:

•

Cond need not be specified

•

Suc should be the concept specification, i.e., the corresponding value of the target attribute

•

we should decide about the complexity of Ant; here the simplest case is to use only distinct values of input attributes

The experiments described in this paper were carried out using the founded implication quantifier and using only positive literals with a coefficient of length 1. Figure 2 shows a screenshot of the input pane with the input parameters for the 4ft-Miner procedure running on the Monk1 data. Here Ant of the generated rules consists of a conjunction of attribute-value pairs (categories) created from all available input attributes (the length of the conjunction is not restricted), Suc is set to the target concept Class(+), Cond is not used, the relation between Ant and Suc is a founded implication (this corresponds to the “standard” association rule) with minsup $=$ 5% and minconf $=$ 0.9.

Figure 2.

Screenshot with input setting for 4ft-Miner applied to Monk1 data.

4. Association meta-rules

The inspiration of our method comes from the area of meta-learning. Meta learning is a subfield of machine learning where automatic learning algorithms are applied to meta-data about machine learning experiments. The most widely used approaches to meta-learning (or combining classifiers) are bagging, boosting and stacking [5]. In bagging, each classifier in the ensemble votes with equal weight when classifying a new example; in order to promote model variance, the bagging trains each classifier in the ensemble using a randomly-drawn subset of the training set. In boosting, the ensemble of classifiers is built incrementally by training each new classifier to emphasize the training instances that previous classifiers miss-classified. In stacking, a meta-classifier is built on top of the results of the so-called “base” classifiers that are each separately trained to classify the data.

We propose application of the association rule mining algorithm to the set of “original” association rules obtained as a result of a particular data mining task. This idea thus follows the stacking concept, used to combine classifiers, which however has not been presented yet for descriptive tasks. The input to the proposed meta-learning step will comprise association rules, encoded in a way suitable for the association rule mining algorithm; the result will be a set of association meta-rules uncovering relations between various characteristics in the original set of rules.

We introduced two types of association meta-rules: qualitative and quantitative, and frequent cedents (as an analogy to frequent itemsets) in [8]. Qualitative meta-rules represent the meta-knowledge in the form “if original association rules contain a conjunction Ant, then they also contain the conjunction Suc”, i.e., qualitative rules have the form

$\displaystyle\textit{Ant}\Rightarrow\textit{Suc}$

Quantitative meta-rules represent the meta-knowledge in the form “if original association rules contain a conjunction Ant, then they have quantitative characteristics $Q$ ”, i.e.,

$\displaystyle\textit{Ant}\Rightarrow Q$

Frequent cedents represent the meta-knowledge in the form “conjunction Ant frequently occurs in the original association rules”. With respect to the concept description task, these cedents represent a meta-knowledge about frequent co-occurrence of specific categories in the concept description.

When using the meta-rules and frequent cedents to condense the concept description based on association rules, we get:

•
qualitative meta rules should be interpreted as “if the concept is described using literals from Ant, it is also described using literals from Suc”,
•
quantitative meta rules should be interpreted as “if the concept is described using literals from Ant, then this description is “strong” in the sense of values of quantitative characteristics from Q”,
•
frequent cedents should be interpreted as “the concept is frequently described using literals from Ant”.

Encoding of the original rules as data (for the meta-learning step) is the key problem in our approach. We propose four different representation schemes in the original paper [8]. Ant and Suc can be encoded using either (1) binary attributes, where each attribute represents one possible literal, or (2) using the attributes from the original data set. In both cases we can (or need not) also consider whether the literal occurs in Ant or Suc. We can thus consider four different representation schemes. So, to encode an example rule

$\displaystyle\textit{body(o)}\wedge\textit{head(o)}\Rightarrow\textit{class(+)}$

1.
when using the encoding based on binary attributes without distinguishing between Ant and Suc, this rule will be represented using the categories Body_o(true), Head_o(true), and Class_+(true)
2.
when using the encoding based on original attributes without distinguishing between Ant and Suc, this rule will be represented using the categories Body(o), Head(o), and Class(+)
3.
when using the encoding based on binary attributes and distinguishing between Ant and Suc, this rule will be represented using the categories Ant_Body_o(true), Ant_Head_o(true), and Suc_Head_o(true)
4.
when using the encoding based on original attributes and distinguishing between Ant and Suc, this rule will be represented using the categories Ant_Body(o), Ant_Head(o), and Suc_Class(+).

In all four examples above, the notation $A(v)$ denotes the value $v$ of attribute $A$ . Because the succedent Suc is fixed to the attribute-value pair describing the concept within the concept description task, only the first and second representation schemes make sense.

The representation schemes mentioned above can directly be used in situations when literals consist only of attribute-value pairs, i.e., for “standard” association rules. When using the richer syntax of cedents as defined in LISp-Miner, we have to transform the resulting rules into rules containing only attribute-value pairs. To do this, we first transform a negative literal into positive one by creating a coefficient (set of values) from all values not occurring in the negative literal. Then we change a rule containing a literal with set of values into a set of rules with a single category using the following tautology:

$\displaystyle A\wedge(B\vee C)\equiv(A\wedge B)\vee(A\wedge C).$

So, e.g., a 4ft rule

$\displaystyle\neg\textit{Head(r)}\wedge\textit{Jacket(r,y)}\wedge\textit{Tie(y% )}\Rightarrow\textit{Class(+)}$

must be first transformed (by transforming the negative literal $\neg$ Head(r)) into the rule

$\displaystyle\textit{Head(o,s)}\wedge\textit{Jacket(r,y)}\wedge\textit{Tie(y)}% \Rightarrow\textit{Class(+)}$

assuming that the possible values for attribute Head are {r,o,s}, and then each literal containing a list of such values must be transformed into literals containing a single value. After this step, the newly created rules are

$\displaystyle\textit{Head(o)}\wedge\textit{Jacket(r)}\wedge\textit{Tie(y)}% \Rightarrow\textit{Class(+)}$ $\displaystyle\textit{Head(s)}\wedge\textit{Jacket(r)}\wedge\textit{Tie(y)}% \Rightarrow\textit{Class(+)}$ $\displaystyle\textit{Head(o)}\wedge\textit{Jacket(y)}\wedge\textit{Tie(y)}% \Rightarrow\textit{Class(+)}$ $\displaystyle\textit{Head(s)}\wedge\textit{Jacket(y)}\wedge\textit{Tie(y)}% \Rightarrow\textit{Class(+)}$

Such transformation of original 4ft rules can only be applied when searching for qualitative meta-rules and frequent cedents as there is no straightforward way to decompose the quantitative characteristics of the original 4ft rule (e.g., support or confidence) into quantitative characteristics of the transformed rules.

Another open question concerning the representation of a rule is whether categories not occurring in the original rule should be treated as missing or as “negative” ones. In the first approach, attributes not used in the rule will be encoded using a missing value code. In the second approach, when using the binary representation, categories not used in the rule will get the value “false”, and when using the original attributes, categories not used in the rule will get a new special value interpreted as “not used”. Our initial experiments show that using a missing value code is more suitable as it will prevent the meta-learning step from generating a great number of meta-rules about non-occurrence of literals in the original rules. This option also corresponds to the original notion of association rules where only items that do occur in the market baskets are taken into consideration.

In the experiments reported in this paper, we:

•
Search only for qualitative meta-rules and frequent cedents.
•
Represent a rule using the original attributes. This formally leads to the same structure of data table as for the original data (i.e., the table representing the association rules has the same columns as the table representing the data).

Table 3
Example 4ft rules for Monk1 data

Body(o) $\Rightarrow$ Class(+) (0.1951, 1)

Body(o) $\wedge$ Head(o) $\Rightarrow$ Class(+) (0.1301, 1)

Body(o) $\wedge$ Head(o) $\wedge$ Holding(b) $\Rightarrow$ Class(+) (0.0569, 1)

Body(o) $\wedge$ Head(o) $\wedge$ Holding(s) $\Rightarrow$ Class(+) (0.0569, 1)

Body(o) $\wedge$ Head(o) $\wedge$ Smile(n) $\Rightarrow$ Class(+) (0.0569, 1)

Body(o) $\wedge$ Head(o) $\wedge$ Smile(y) $\Rightarrow$ Class(+) (0.0732, 1)

Body(o) $\wedge$ Head(o) $\wedge$ Tie(n) $\Rightarrow$ Class(+) (0.0894, 1)

$\ldots$

Table 4
Example 4ft rules for Monk1 data encoded as data

id Head Body Smile Holding Jacket Tie Class Support Confidence

1 ? o ? ? ? ? $+$ 0.1951 1

2 o o ? ? ? ? $+$ 0.1301 1

3 o o ? b ? ? $+$ 0.0569 1

4 o o ? s ? ? $+$ 0.0569 1

5 o o n ? ? ? $+$ 0.0569 1

6 o o y ? ? ? $+$ 0.0732 1

7 o o ? ? ? n $+$ 0.0894 1

$\ldots$

•
Used a missing value code to represent categories not occurring in the original rule. This option is more suitable as it will prevent the meta-learning step from generating a great number of meta-rules about non-occurrence of literals in the original rules; this option also corresponds to the original notion of association rules where only items that do occur in the market baskets are taken into consideration.

Our approach is not limited to binary classification problems where the data can be considered as examples or counter-examples of a single concept. We can use this approach to data containing examples of an arbitrary number of concepts (such data is represented by the Iris dataset in our experiments reported in Section 4). In such a case, when creating the association meta-rules using LISp-Miner we can set the condition Cond to take the values of the target attribute and thus obtain meta-rules only from the original rules covering a specific concept.

We will illustrate the concept of association meta-rules using the Monk1 dataset from the UCI repository [24]. This dataset consists of 123 examples and following six attributes: head_shape, body_shape, smile, holding, jacket_color, tie (these attributes are the input ones), and class (this is a binary target attribute). Running 4ft-Miner with the input parameters maxlenA $=$ 6, minsup $=$ 5% and minconf $=$ 0.9 (i.e., the same settings as shown in Fig. 2), we obtain 34 association rules, some of them listed in Table 3. These rules have been turned into examples for the subsequent run of 4ft-Miner (Table 4, shows the representation of rules from Table 3). Table 5 shows some of the meta-rules we obtained (we used the same settings for input parameters as before), and Table 6 shows some obtained frequent cedents.

Table 5
Example 4ft meta-rules for Monk1 data

Body(o) $\Rightarrow$ Head(o)

Body(s) $\Rightarrow$ Head(s)

Head(o) $\Rightarrow$ Body(o)

Head(o) $\Rightarrow$ Jacket(r)

$\ldots$

Table 6
Example frequent cedents for Monk1 data

Body(o)

Body(s)

Body(o) $\wedge$ Head(o)

Head(s)

$\ldots$

Notice that, among the meta-rules and cedents, there are rules with syntax similar to the original association rules. But their interpretation is of course different. Let us compare the association rule

$\displaystyle\textit{Body(o)}\wedge\textit{Head(o)}\Rightarrow\textit{Class(+)}$

the meta-rule

$\displaystyle\textit{Body(o)}\Rightarrow\textit{Head(o)}$

and the frequent cedent

$\displaystyle\textit{Body(o)}\wedge\textit{Head(o)}$

The association rule says that the concept Class(+) can be described using the conjunction Body(o) $\wedge$ Head(o), the meta-rule says that whenever the concept is described using Body(o), it is also described by Head(o), and the frequent cedent says that the conjunction Body(o) $\wedge$ Head(o) frequently occurs in the concept description of Class(+).

We can also see a significant “compression” of the list of the built association rules; while we obtained 34 4ft rules, we have only 14 4ft meta-rules (this represents a reduction to 41% of the number of association rules) and 13 frequent cedents (this represents a reduction to 38% of the number of association rules). We use these numbers to evaluate the results for other data as presented in Sections 5 and 6 as well.

Table 7
Basic characteristics of the used UCI benchmark data

Data set No. examples No. attributes No. concepts

Agaricus 8124 22 2

Australi 690 14 2

Breast 289 9 2

Diab 769 8 2

Iris 150 4 3

JapCred 125 10 2

Kr-vs-kp 3196 36 2

Monk1 123 6 2

Tic-tac-toe 958 9 2

Vote 435 16 2

Table 8
Results on UCI benchmark data

Data 4ft rules 4ft meta-rules Frequent cedents

No.rules No.rules Reduction to No.rules Reduction to

Agaricus 2072 12 0.6% 22 1%

Australi 1734 226 13% 25 1%

Breast 93 16 17% 8 9%

Diab 106 44 42% 19 18%

Iris 58 36 62% 16 28%

JapCred 777 23 3% 14 2%

Kr-vs-kp 972 74 8% 47 5%

Monk1 34 14 41% 13 38%

Tic-tac-toe 18 12 67% 9 50%

Vote 5554 25 0.5% 9 0.2%

5. Experiments on benchmark data

id	Head	Body	Smile	Holding	Jacket	Tie	Class	Support	Confidence
1	?	o	?	?	?	?	$+$	0.1951	1
2	o	o	?	?	?	?	$+$	0.1301	1
3	o	o	?	b	?	?	$+$	0.0569	1
4	o	o	?	s	?	?	$+$	0.0569	1
5	o	o	n	?	?	?	$+$	0.0569	1
6	o	o	y	?	?	?	$+$	0.0732	1
7	o	o	?	?	?	n	$+$	0.0894	1
$\ldots$

Data set	No. examples	No. attributes	No. concepts
Agaricus	8124	22	2
Australi	690	14	2
Breast	289	9	2
Diab	769	8	2
Iris	150	4	3
JapCred	125	10	2
Kr-vs-kp	3196	36	2
Monk1	123	6	2
Tic-tac-toe	958	9	2
Vote	435	16	2

Data	4ft rules	4ft meta-rules	Frequent cedents
	No.rules	No.rules	Reduction to	No.rules	Reduction to
Agaricus	2072	12	0.6%	22	1%
Australi	1734	226	13%	25	1%
Breast	93	16	17%	8	9%
Diab	106	44	42%	19	18%
Iris	58	36	62%	16	28%
JapCred	777	23	3%	14	2%
Kr-vs-kp	972	74	8%	47	5%
Monk1	34	14	41%	13	38%
Tic-tac-toe	18	12	67%	9	50%
Vote	5554	25	0.5%	9	0.2%

Our initial experiments were carried out using some benchmark data sets from the UCI Machine Learning Repository [24]. The characteristics of the data (number of examples, number of attributes and number of concepts) are shown in Table 7. Table 8 summarizes the results. The first column in this table shows the numbers of 4ft rules that were created in the first step. Here we were looking for strong concept descriptions, so we set the parameters minsup $=$ 5% and minconf $=$ 0.9. Then the number of 4ft meta-rules and frequent cedents is shown together with the relative size of the corresponding rule set with respect to the 4ft rules created in the first step. So the relative size is computed as the number of rules in the respective rule set divided by the number of rules in the 4ft rule set. To make the numbers comparable, we set minconf $=$ 0.9 and minsup between 1% and 5% for the second step (i.e., for creating 4ft meta-rules and frequent cedents). With the exception of Tic-Tac-Toe data and 4ft meta-rules for Iris data, we always reduced the number of original 4ft rules to less than one half.

Figure 3.

Relation between relative number of meta-rules and frequent cedents on UCI data.

The chart in Fig. 3 shows the relationship between the relative number of meta-rules (horizontal axis) and the relative number of cedents (vertical axis). We can see that the experimental results can be divided into two groups. The percentage of both meta-rules and cedents is very low (below 20% for meta-rules and below 10% for cendents) in the first group, the percentage of meta/rules and cedents is close to 50% in the second group. The percentage of meta-rules is slightly higher than the percentage of frequent cedens in all the experiments. The correlation between these two measures is 0.9174.

6. Experiments with atherosclerosis risk data set

While we were interested only in quantitative assessment of our method when carrying out the experiments on the UCI benchmark data, here we will also assess the results qualitatively, by interpreting the found meta-rules. To do this, we will use data for which domain experts are available.

In the early 1970s, a project of extensive epidemiological study of atherosclerosis primary prevention was developed under the name “National Preventive Multifactor Study of Heart Attacks and Strokes” in the former Czechoslovakia. The aims of the study were:

1.
to identify atherosclerosis risk factors prevalence in the population considered to be the most endangered by possible atherosclerosis complications (i.e., middle-aged men),
2.
to follow the development of these risk factors and their impact on the examined men’s health, especially with respect to atherosclerotic cardiovascular diseases (CVD),
3.
to study the impact of complex risk factors intervention on their development and cardiovascular morbidity and mortality,
4.
10–12 years into the study, to compare risk factors profile and health of selected men who originally did not show any atherosclerosis risk factors with a group of men showing risk factors from the beginning of the study.

Atherosclerosis is a slow, complex disease that typically starts in childhood and often progresses when people grow older. In some people it progresses rapidly, even in their third decade of age. Many scientists think it begins with damage to the innermost layer of the artery. Atherosclerosis involves the slow buildup of deposits of fatty substances, cholesterol, body cellular waste products, calcium, and fibrin (a clotting material in the blood) in the inside lining of an artery. The buildup (referred as a plaque) with the formation of the blood clot (thrombus) on the surface of the plaque can partially or totally block the flow of blood through the artery. If either of these events occurs and blocks the entire artery, a heart attack, stroke or other life-threatening events may result. Research shows the benefits of reducing the controllable risk factors for atherosclerosis: high blood cholesterol (level of LDL cholesterol over 100 mg/dL), cigarette smoking and exposure to tobacco smoke, high blood pressure (blood pressure over 140/90 mm Hg), diabetes mellitus, obesity (BMI over 25), or physical inactivity.

The study included data of 1417 men born between 1926–1937 and living in the center of Prague. The men were divided according to the presence of risk factors (RF), overall health conditions and ECG result into the following three groups: normal (a group of men showing no RF defined above), risk (group of men with at least one RF defined above) and pathological (group of men with a manifested cardiovascular disease). Long-term observation of patients was based on following the men from the normal and risk groups (randomly divided into intervened risk group – RGI and control risk group – RGC). The men from the pathological group were excluded from further observation. Table 9 shows the distribution of men in the initial groups.

Table 9
Number of men in different groups

Group n %

Normal 277 19.5

Risk 861 60.8

Pathological 114 8.0

Non classifiable 165 11.6

Total 1417 100

STULONG is the data set concerning this longitudinal study of the risk factors of the atherosclerosis. Four data files have been created when transforming the collected data into electronic form1
¹
The study was realized at the 2nd Department of Medicine, 1st Faculty of Medicine of Charles University and Charles University Hospital, U nemocnice 2, Prague 2 (head. Prof. M. Aschermann, MD, SDr, FESC), under the supervision of Prof. F. Boudník, MD, ScD, with collaboration of M. Tomečková, MD, PhD and Ass. Prof. J. Bultas, MD, PhD. The data were transferred to the electronic form by the European Centre of Medical Informatics, Statistics and Epidemiology of Charles University and Academy of Sciences (head. Prof. RNDr. J. Zvárová, DrSc).

1.
the file ENTRY consists of a data table of 1417 objects (patients) and values of 224 attributes obtained from entry examinations,
2.
the file CONTROL contains results of observation of 66 attributes recorded during the follow-up control examinations that took almost 20 years (10572 records),
3.
the file LETTER contains some additional information (values of 62 attributes) about the health status of 403 men that was collected by a postal questionnaire,
4.
the file DEATH contains data about causes of death of 389 patients who died during the study (values of 5 attributes).

Table 10
Summary of the ENTRY data table

Group of attributes No. of attributes

Identification data 2

Social characteristics 5

Physical activity 4

Smoking 3

Drinking of alcohol 10

Sugar, coffee, tea 3

Family history 160

Personal history 18

Chest pain, lower limbs pain, asthma 3

Physical examination 8

Biochemical examination 3

Risk factors 5

We use only the file ENTRY to test our approach to create comprehensive concept description. Table 10 shows the summary of the attributes presented in this file. This data has been analyzed using LISp-Miner with the aim to find interesting relationships between attributes from different groups [9]. Now we are interested in concept description, where the concept to be described is the group a man belongs to, i.e., normal group, risk group or pathological group. So we run the 4ft-Miner procedure repeatedly for different concepts set to be the succedent of the rules and we allow antecedent to be composed of any of the remaining attributes. We however restrict the maximal length of antecedent to 4, and the minimal confidence to vary between 0.8 and 0.9. We then run the subsequent meta learning step on the found rules. Table 11 summarizes the results in the same way as described in Section 4. But we also try to interpret the created meta-rules.

Table 11
Results on ENTRY data table

Concept 4ft rules 4ft meta-rules Frequent cedents

No.rules No.rules Reduction to No.rules Reduction to

Pathological 768 45 6% 28 4%

Risky 681 16 2% 21 3%

Normal 581 40 7% 44 8%

The most frequently occurring cedents for the pathological group are related to long-term smoking, high diastolic blood pressure, diabetes, and chest pain of angina pectoris type. Surprisingly, ictus was typically not diagnosed in this group. If an original 4ft rule refers to chest pain of angina pectoris type, it also refers to high diastolic blood pressure, or to moderate physical activity or to moderate alcohol consumption.

The most frequently occurring cedents for the risky group are related to a high level of cholesterol and heavy smoking, while asthma, diabetes or myocardial infarct was typically not diagnosed. Interestingly, normal values of chest pain and asthma are related to a high level of cholesterol in the original 4ft rules.

The most frequently occurring cedents for the normal group are related to low diastolic blood pressure, low systolic blood pressure, short time smoking or non-smoking, university education, no hypertension, no hyperlipidemia, no ictus or no myocardial infarction. If an original 4ft rule refers to no smoking or short time smoking it also refers to low diastolic pressure or university education or no hypertension.
7. Conclusions

Group	n	%
Normal	277	19.5
Risk	861	60.8
Pathological	114	8.0
Non classifiable	165	11.6
Total	1417	100

Group of attributes	No. of attributes
Identification data	2
Social characteristics	5
Physical activity	4
Smoking	3
Drinking of alcohol	10
Sugar, coffee, tea	3
Family history	160
Personal history	18
Chest pain, lower limbs pain, asthma	3
Physical examination	8
Biochemical examination	3
Risk factors	5

Concept	4ft rules	4ft meta-rules	Frequent cedents
	No.rules	No.rules	Reduction to	No.rules	Reduction to
Pathological	768	45	6%	28	4%
Risky	681	16	2%	21	3%
Normal	581	40	7%	44	8%

In association rule mining, the interpretation of the found rules can be very frustrating for the domain experts, as they have to inspect and consider every rule. Hence, different post-processing techniques have been proposed to simplify this step. The post-processing typically has the form of filtering out redundant rules or “merging” rules into more general ones. The post-processing can either be based only on the found association rules or can exploit some domain knowledge. We present a novel method for post-processing association rules that were created for concept description. Our approach is based on the idea of meta-learning: a subsequent association rule mining step is applied to the results of “standard” association rule mining. So the rules found during the first step are used as data for subsequent learning. Doing this, we obtain “rules about rules” that, in a condensed form, represent the knowledge found using the association rules generated in the first step. No domain knowledge is used in the post-processing itself; the participation of domain expert is of course crucial for interpreting the post-processing results but the meta-learning runs without any intervention of the expert.

We carried out several experiments on both benchmark and real data sets. We focused on qualitative meta-rules and frequent cedents as we think that these types of meta-rules can be more informative for the domain experts. The reported experiments support our working hypothesis that the number of the meta-rules will be significantly smaller than the number of original rules. Thus the interpretation of the meta-rules by domain expert will be significantly less time-consuming and less difficult compared to the interpretation of the original association rules. Additionally, the meta-rules can give the user a summarized interpretation of the original rules. Our experimental results also show that there is no straightforward relationship between the size of the analyzed data (number of objects, number of attributes), the number of 4ft rules and the number of 4ft meta-rules. High relative size of the set of 4ft meta-rules just means low redundancy in the set of 4ft rules, and low relative size of the set of 4ft meta-rules means high redundancy in the set of 4ft rules. Moreover, low redundancy in the set of 4ft rules is usually related to a smaller size of this set of rules. Anyway, the correlation was high between the reduction rate for the number of 4ft meta-rules and that for frequent cedents.

Our future work will focus on other types of rules (more complex syntax of cedents, and other types of relationship between Ant and Suc) that can generated by the 4ft-Miner procedure for mining for concept descriptions both during the standard mining step and the meta-learning step.

Footnotes

Acknowledgments

This paper was prepared with the support of Institutional Funds for Support of Long-Term Development of Science and Research at the Faculty of Informatics and Statistics of University of Economics, Prague.

References

Agrawal

Imielinski

and Swami

, Mining association rules between sets of items in large databases, in: SIGMOD Conference, 1993, pp. 207–216.

Khan

and Huang

, Objective and subjective algorithms for grouping association rules, in: Third IEEE Conference on Data Mining, ICDM’03, 2003, pp. 477–480.

Ashrafi

M.Z.

Taniar

and Smith

, A New Approach of Eliminating Redundant Association Rules, in: 15th International Conference on Database and Expert Systems Applications, DEXA 2004, Galindo

et al., Springer, LNCS 3180, 2004, pp. 465–474.

Baesens

Viaene

and Vanthienen

, Post-processing of association rules, in: The Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’2000, Boston, Massachusetts, 2000.

Bauer

and Kohavi

, An empirical comparison of voting classification algorithms: bagging, boosting, and variants, Machine Learning 36(1/2), (1999), 105–139.

Berka

, Learning compositional decision rules using the KEX algorithm, Intelligent Data Analysis 16(4) (2012), 665–681.

Berka

, Towards Comprehensive Concept Description Based on Association Rules, in: Advances in Intelligent Data Analysis XII, 12th Int. Symposium IDA 2013, Tucker, Hoppner, Siebes, Swift, eds, Springer LNCS 8207, 2013, pp. 80–91.

Berka

and Rauch

, Meta-learning for Post-processing of Association Rules, in: 12th Int. Conf. Data Warehousing and Knowledge Dicovery DaWaK 2010, Pedersen, Mohaia, Tjoa, eds, Springer, LNCS 6263, 2010, pp. 251–262.

Berka

Rauch

and Tomečková

, Data Mining in the Atherosclerosis Risk Factor Data, in: Data Mining and Medical Knowledge Management: Cases and Applications Berka

Rauch

and Zighed

D.A.

, eds, IGI Global, 2009, pp. 376–397.

10.

Chapman

Clinton

Kerber

Khabaza

Reinartz

Shearer

and Wirth

, CRISP-DM 1.0 Step-by-step data mining guide. SPSS Inc., 2000.

11.

Domingues

M.A.

and Rezende

S.O.

, Using Taxonomies to Faciliate the Analysis of the Association Rules, in: Second International Workshop on Knowledge Discovery and Ontologies, KDO’05, ECML/PKDD, Porto, 2005.

12.

Duda

R.O.

and Gasching

J.E.

, Model design in the Prospector consultant system for mineral exploration, in: Expert Systems in the Micro Electronic Age Michie

, ed, Edinburgh University Press, UK, 1979.

13.

Hájek

and Havránek

, Mechanising Hypothesis Formation – Mathematical Foundations for a General Theory, Springer, 1978.

14.

Han

Pei

and Yin

, Mining Frequent Patterns without Candidate Generation, in: Proc. ACM-SIGMOD Int. Conf. on Management of Data, 2000.

15.

Jorge

Poças

and Azevedo

P.J.

, Post-processing Operators for Browsing Large Sets of Association Rules, Discovery Science, 2002, pp. 414–421.

16.

The LISp-Miner Project, http://lispminer.vse.cz/.

17.

Marinica

and Guillet

, Knowledge-based interactive postmining of association rules using ontologies, IEEE Trans. On Knowledge and Data Engineering 22(6) (2010), 784–797.

18.

Rauch

, Logic of association rules, Applied Intelligence 22 (2005), 9–28.

19.

Rauch

, Observational calculi and association rules, Studies in Computational Intelligece, Vol. 469, Springer, 2013.

20.

Rauch

and Šimůnek

, An Alternative Approach to Mining Association Rules, in: Proc. Foundations of Data Mining and Knowledge Discovery, Lin

Ohsuga

Liau

and Tsumoto

, eds, Springer-Verlag, 2005.

21.

Sigal

, Exploring interestingness through clustering, in: Proc. Of the IEEE Int. Conf. on Data Mining, ICDM 2002, Maebashi City, 2002.

22.

Šimůnek

, Academic KDD Project LISp-Miner, in: Advances in Soft Computing Intelligent Systems Design and Applications Abraham

Franke

and Koppen

, eds, Springer, 2003, pp. 263–272.

23.

Toivonen

Klementinen

Roikainen

Hatonen

and Mannila

, Pruning and grouping discovered association rules, in: Workshop Notes of the ECML-95 Workshop on Statistics, Machine Learning and Knowledge Discovery in Databases, Heraklion, 1995, pp. 47–52.

24.

UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/.

25.

Zaki

M.J.

, Generating non-redundant association rules, in: Proc. of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 2000, pp. 34–43.

Body(o)	$\Rightarrow$	Class(+) (0.1951, 1)
Body(o) $\wedge$ Head(o)	$\Rightarrow$	Class(+) (0.1301, 1)
Body(o) $\wedge$ Head(o) $\wedge$ Holding(b)	$\Rightarrow$	Class(+) (0.0569, 1)
Body(o) $\wedge$ Head(o) $\wedge$ Holding(s)	$\Rightarrow$	Class(+) (0.0569, 1)
Body(o) $\wedge$ Head(o) $\wedge$ Smile(n)	$\Rightarrow$	Class(+) (0.0569, 1)
Body(o) $\wedge$ Head(o) $\wedge$ Smile(y)	$\Rightarrow$	Class(+) (0.0732, 1)
Body(o) $\wedge$ Head(o) $\wedge$ Tie(n)	$\Rightarrow$	Class(+) (0.0894, 1)
	$\ldots$

Body(o)	$\Rightarrow$	Head(o)
Body(s)	$\Rightarrow$	Head(s)
Head(o)	$\Rightarrow$	Body(o)
Head(o)	$\Rightarrow$	Jacket(r)
	$\ldots$

Comprehensive concept description based on association rules: A meta-learning approach

Abstract

Keywords

1. Introduction

2. Related work

3. GUHA method and LISp-Miner

3.1 Basic principles of the GUHA Method

Table 1 Fourfold contingency table

Footnotes

Acknowledgments

References

Table 1
Fourfold contingency table