Apriori and GUHA – Comparing two approaches to data mining with association rules

Abstract

Two approaches to data mining with association rules are compared – the apriori algorithm and the ASSOC procedure. The first one was developed for market basket analysis at the beginning of 1990s. An association rule is understood as an implication between conjunctions of attribute-value pairs. The ASSOC procedure is an implementation of the GUHA method of mechanizing hypothesis formation developed since the 1960s. ASSOC deals with association rules – general relations of two general Boolean attributes. Arules – a computational environment for mining association rules based on apriori and the 4ft-Miner procedure – an implementation of the ASSOC procedure are discussed and compared. It is shown that the arules approach to missing information does not correspond to Kleene’s approach and this can lead to a large number of misleading rules. It is also shown that a secured completion developed for the ASSOC procedure avoids this problem.

Keywords

Association rules apriori GUHA arules 4ft-Miner

1. Introduction

Association rules were introduced in the early 1990s with a goal to better understand purchasing behaviour of customers in supermarkets [3]. An association rule is an expression $X\rightarrow Y$ , where $X$ and $Y$ are sets of items. The rule $X\rightarrow Y$ means that transactions containing a set X of items tend to contain set Y of items. There are two basic measures of intensity of an association rule – confidence and support. The association rule discovery task is the task of finding all the association rules $X\rightarrow Y$ such that the support and confidence of $X\rightarrow Y$ are above the user-defined thresholds minConf and minSup. The apriori algorithm [3] was introduced in 1994 to solve this task.

The idea of association rules has been later generalised to data in a tabular, attribute-value form. The association rule is further understood as an expression $Ant\rightarrow Con$ , where both $A n t$ and $C o n$ are conjunctions of attribute-value pairs. Additional measures of interestingness of association rules have been introduced [11]. They are usually used to filter out uninteresting rules. The arules package for R [12, 29] is a computational environment, which provides a widely-used implementation of the apriori algorithm.

However, the concept of association rules was introduced and studied since the 1960s in the framework of the GUHA method development [17]. Monograph [15] introduces a general theory of logic of discovery based on mathematical logic and statistics. Association rules introduced in [15] are general relations $\varphi\approx\psi$ between general Boolean attributes $\varphi$ and $\psi$ derived from columns of an analysed data matrix (i.e. not only conjunctions of attribute-value pairs). The symbol $\approx$ is a 4ft-quantifier. It defines a condition concerning a contingency table of $\varphi$ and $\psi$ . An association rule $\varphi\approx\psi$ is true in a data matrix $\cal M$ if a condition given by $\approx$ is satisfied for a contingency table of $\varphi$ and $\psi$ in $\cal M$ . The term association rules has been used for such patterns $\varphi\approx\psi$ since the association rules were introduced. We will sometimes use the term GUHA association rules to emphasise that we deal with association rules $\varphi\approx\psi$ – general relations between general Boolean attributes.

The ASSOC procedure [13, 14] was invented to mine for GUHA association rules. The ASSOC procedure does not use the apriori algorithm; it uses depth-first walking of the search space and is based on suitable data structures built from strings of bits [36]. It was implemented several times and applied to many real-world data sets [18]. The core feature of the ASSOC procedure is its ability to mine for GUHA association rules satisfying given syntactical patterns. This means that one run of the ASSOC procedure can correspond to a large family of apriori runs. An additional important feature of the ASSOC procedure is a specific handling of missing information, i.e. secured completion (sometimes called pessimistic) and optimistic completion [15, 38].

The 4ft-Miner procedure introduced in [41] is an implementation of an enhancement of the ASSOC procedure. It provides very fine possibilities to define a relevant set of GUHA association rules and to find all relevant rules true in a given data matrix. This makes it possible to solve analytical questions, which cannot be solved by the apriori algorithm. The 4ft-Miner procedure also provides a version of the apriori algorithm based on strings of bits. The 4ft-Miner is a part of the LISp-Miner system, freely available at [24].

The goal of this paper is to compare the apriori algorithm to the GUHA approach for mining association rules. We have two implementations of the apriori algorithm – one in the arules package for R and one in the 4ft-Miner procedure. Finally, we also have the original GUHA approach provided by the 4ft-Miner procedure. Basic information on association rules and the arules package is provided in Section 2. The GUHA association rules as well as the main features of the 4ft-Miner procedure are introduced in Section 3. Comparison of two apriori algorithms and the GUHA approach is located in Section 4. Section 4 also shows that the arules approach to missing information can lead to the output of a relatively large amount of rules not true in many completions of the analysed data. Specific features of the 4ft-Miner procedure are introduced in Section 5. Related works are discussed in Section 6, conclusions and further work are in Section 7.

2. Association rules and apriori

A commonly used definition of the association rule comes from [3], see also [2]. Let $I=\{I_{1},\dots,I_{K}\}$ be a set of literals – items of goods. Let ${\cal D}=\{b_{1},\dots,b_{n}\}$ be a set of transactions (market baskets), where $b_{i}\subseteq I$ for $i=1,\dots,n$ , see left part of Fig. 1. Let us note that ${\cal D}$ can also be seen as a Boolean data matrix ${\cal M_{D}}$ with columns $\overline{I}_{1},\dots,\overline{I}_{K}$ , see the right part of Fig. 1. Here, $\overline{I}_{j}$ is a Boolean attribute saying that the item $I_{j}$ belongs to a basket, $j=1,\dots K$ . The association rule is an expression of the form $X\rightarrow Y$ , where $X\subset I$ , $Y\subset I$ and $X\cap Y=\emptyset$ .

Figure 1.

Example of a set ${\cal D}$ of transactions.

Two basic measures of interestingness of association rules are used. The confidence is defined as

$\displaystyle conf(X\rightarrow Y)=\frac{\text{number of baskets containing }X% \cup Y}{\text{number of baskets containing }X}$

and the support is defined as

$\displaystyle supp(X\rightarrow Y)=\frac{\text{number of baskets containing }X% \cup Y}{n}.$

A task of mining association rules is a task of finding all association rules $X\rightarrow Y$ satisfying $conf(X\rightarrow Y)\geqslant minC$ and $supp(X\rightarrow Y)\geqslant minS$ in a given set $\cal D$ of transactions. Here, $m i n C$ and $m i n S$ are user-specified minimum confidence and support.

An itemset is each set $U$ of items satisfying $U\subset I$ . A support $supp(U)$ of the itemset $U$ in $\cal D$ is defined as

$\displaystyle supp(U)=\frac{\text{number of baskets containing }U}{n}.$

It holds $supp(X\rightarrow Y)=supp(X\cup Y)$ and

$\displaystyle conf(X\rightarrow Y)=\frac{supp(X\cup Y)}{supp(X)}.$

The task of finding all association rules $X\rightarrow Y$ satisfying $conf(X\rightarrow Y)\geqslant minC$ and $supp(X\rightarrow Y)\geqslant minS$ in a given set of market baskets $D$ is solved in two steps:

•

In the first step, all itemsets $U$ satisfying $supp(U)\geqslant minS$ are generated.

•

In the second step, all rules $X\rightarrow Y$ satisfying $X\cup Y=U$ , $X\cap Y=\emptyset$ , $conf(X\rightarrow Y)\geqslant minC$ are generated for each itemset $U$ resulting from the first step.

This is realised by the apriori algorithm, which has been described many times, see e.g. [2, 3]. The apriori algorithm substantially uses a simple fact that if $U, V$ are itemsets satisfying $supp(U)<minS$ and $U\subsetneq V$ , then $supp(V)<minS$ also holds.

The algorithm implemented in the arules package deals with transaction data in a form of Boolean data matrices introduced in the right part of Fig. 1. The Boolean data matrices are represented in a sparse representation. This means that a vector of indices of the non-zero elements (row-wise starting with the first row), and pointers where each row starts are stored. Furthermore, data represented as a data matrix with non-binary attributes can be mined. An example is a data matrix $\cal M$ with columns – attributes $A_{1},\dots,A_{K}$ shown in the left part of Fig. 2. The value $A_{1}[o_{1}]$ of the attribute $A_{1}$ for object $o_{1}$ is 1, the value $A_{2}[o_{1}]$ of the attribute $A_{2}$ for object $o_{1}$ is 7 etc. Attributes $A_{1},A_{2},A_{K}$ have categories $\{1,2,3\},\{1,\dots,7\},\{1,\dots,6\}$ respectively.

Figure 2.

Representation of data matrix $\cal M$ with non-binary attributes.

The data matrix $\cal M$ is transformed to a Boolean data matrix $\cal M_{B}$ with columns – Boolean attributes $A_{1}(1),A_{1}(2),A_{1}(3),\dots,A_{K}(6)$ . Each of these Boolean attributes corresponds to a couple attribute-category. The attribute $A_{1}(1)$ is true in a row $o_{i}$ of data matrix $\cal M_{B}$ if and only if the value of attribute $A_{1}$ in a row $o_{i}$ of data matrix $\cal M$ is 1 for $i=1,\dots,n$ . This is the same for additional columns of $\cal M_{B}$ , see the right part of Fig. 2. If a value of attribute $A_{1}$ for an object $o$ is missing, then values of attributes $A_{1}(1),A_{1}(2),A_{1}(3)$ for the object $o$ are set to 0; this is the same for additional attributes of the data matrix $\cal M$ and corresponding columns of the data matrix $\cal M_{B}$ .

The arules package deals with data matrix $\cal M_{B}$ in a sparse representation introduced above. A resulting set of association rules is available for further analysis. However, the arules package deals with association rules $X\rightarrow Y$ where $Y$ is a set containing one item only. Examples of arules applications are in Section 4.

An association rule $X\rightarrow Y$ concerning a data matrix $\cal M$ can be seen as an expression $A^{\prime}_{1}(a^{\prime}_{1})\land\dots\land A^{\prime}_{u}(a^{\prime}_{u})% \rightarrow A^{\prime\prime}_{1}(a^{\prime\prime}_{1})\land\dots\land A^{% \prime\prime}_{v}(a^{\prime\prime}_{v})$ i.e. $X_{A}\rightarrow Y_{A}$ where the conjunctions $X_{A}=A^{\prime}_{1}(a^{\prime}_{1})\land\dots\land A^{\prime}_{u}(a^{\prime}_% {u})$ and $Y_{A}=A^{\prime\prime}_{1}(a^{\prime\prime}_{1})\land\dots\land A^{\prime% \prime}_{v}(a^{\prime\prime}_{v})$ are derived columns of the Boolean data matrix ${\cal M_{B}}$ . Let us note that it holds

$\displaystyle conf(X\rightarrow Y)=\frac{\text{number of rows of }{\cal M_{B}}% \text{ satisfying }X_{A}\land Y_{A}}{\text{number of rows of }{\cal M_{B}}% \text{ satisfying }X_{A}}$ (1)

and

$\displaystyle supp(X\rightarrow Y)=\frac{\text{number of rows of }{\cal M_{B}}% \text{ satisfying }X_{A}\land Y_{A}}{n}\text{.}$ (2)

3. GUHA association rules and 4ft-Miner

The GUHA association rules are introduced in Section 3.1. Dealing with missing information in the ASSOC procedure is introduced in Section 3.2. Main features of the 4ft-Miner procedure are presented in Section 3.3.

3.1 GUHA association rules

The GUHA association rule is an expression $\varphi\approx\psi$ where $\varphi$ and $\psi$ are Boolean attributes derived from columns of an analysed data matrix. The data matrix ${\cal M}$ in the left part of Fig. 2 is an example of a data matrix suitable for application of the GUHA association rules. The rule $\varphi\approx\psi$ means that the Boolean attributes $\varphi$ and $\psi$ are associated as given by the symbol $\approx$ . $\varphi$ is called antecedent and $\psi$ is succedent (consequent) and the symbol $\approx$ is a 4ft-quantifier.

Basic Boolean attributes are created first. A basic Boolean attribute is an expression $A(\alpha)$ where $\alpha\subset\{a_{1},\dots a_{t}\}$ and $\{a_{1},\dots a_{t}\}$ is a set of all categories of the attribute $A$ . The set $\alpha$ is a coefficient of the basic Boolean attribute $A(\alpha)$ . A basic Boolean attribute $A(\alpha)$ is true in a row o of $\cal M$ if $A[o]\in\alpha$ . If $A[o]\not\in\alpha$ , then $A(\alpha)$ is false in a row $o$ . Boolean attributes $\varphi$ and $\psi$ are derived from basic Boolean attributes using propositional connectives $\vee$ , $\wedge$ and $\lnot$ in a usual way. Expressions $A_{1}(1)$ and $A_{2}(4,5)$ in Fig. 3 are examples of basic Boolean attributes and expressions $A_{1}(1)\land A_{2}(4,5)$ and $\neg A_{K}(6)$ are examples of Boolean attributes. Basic Boolean attribute $A_{1}(1)$ is true in row $o_{1}$ because $A_{1}[o_{1}]=1$ and $1\in\{1\}$ . Basic Boolean attribute $A_{2}(4,5)$ is false in row $o_{1}$ because $A_{2}[o_{1}]=7$ and $7\not\in\{4,5\}$ (Pedantically we should write $A_{1}(\{1\})$ and $A_{2}(\{4,5\})$ etc.; however, we will not do this.).

The 4ft-quantifier $\approx$ is related to a condition concerning 4ft-tables. A 4ft-table $4ft(\varphi,\psi,{\cal M})$ of Boolean attributes $\varphi$ and $\psi$ in a data matrix ${\cal M}$ is a quadruple $4ft(\varphi,\psi,{\cal M})=\langle a,b,c,d\rangle$ where $a$ is the number of rows of ${\cal M}$ satisfying both $\varphi$ and $\psi$ , $b$ is the number of rows satisfying $\varphi$ and not satisfying $\psi$ , etc., see Fig. 4.

A condition related to a 4ft-quantifier $\approx$ concerning 4ft-tables is understood as a $\{0,1\}$ -function $F_{\approx}$ defined for all quadruples $\langle a,b,c,d\rangle$ of non-negative integers satisfying $a+b+c+d>0$ . The function $F_{\approx}$ is an associated function of 4ft-quantifier $\approx$ . We say that a GUHA association rule $\varphi\approx\psi$ is true in a data matrix ${\cal M}$ if $F_{\approx}(a,b,c,d)=1$ where $4ft(\varphi,\psi,{\cal M})=\langle a,b,c,d\rangle$ . If $F_{\approx}(a,b,c,d)=0$ , then say that a GUHA association rule $\varphi\approx\psi$ is false in a data matrix ${\cal M}$ . Examples of 4ft-quantifiers and their associated functions are located in Table 1, see [3, 15, 38].

Table 1
Examples of 4ft-quantifiers

	$\approx$ – name	$F_{\approx}(a,b,c,d)=1$ if and only if
1	$\rightarrow_{p,s}$ – confidence-support [3]	$\frac{a}{a+b}\geqslant p\land\frac{a}{a+b+c+d}\geqslant s$
	i.e. supported p-implication
2	$\Rightarrow_{p,B}$ – founded p-implication [15]	$\frac{a}{a+b}\geqslant p\land a\geqslant B$
3	$\Rightarrow^{!}_{p,\alpha}$ – lower critical implication [15]	$\sum_{i=a}^{a+b}{a+b\choose i}p^{i}(1-p)^{a+b-i}\leqslant\alpha$
4	$\leftrightarrow_{p,s}$ – supported double implication	$\frac{a}{a+b+c}\geqslant p\land\frac{a}{a+b+c+d}\geqslant s$
5	$\Leftrightarrow_{p,B}$ – founded double implication	$\frac{a}{a+b+c}\geqslant p\land a\geqslant B$
6	$\equiv_{p,s}^{\odot}$ – supported equivalence	$\frac{a+d}{a+b+c+d}\geqslant p\land\frac{a}{a+b+c+d}\geqslant s$
7	$\equiv_{p,B}$ – founded equivalence	$\frac{a+d}{a+b+c+d}\geqslant p\land a\geqslant B$
8	$\sim_{q,s}$ – supported lift	$\frac{a(a+b+c+d)}{(a+b)(a+c)}\geqslant q\land\frac{a}{a+b+c+d}\geqslant s$
9	$\approx_{q,B}$ – founded lift	$\frac{a(a+b+c+d)}{(a+b)(a+c)}\geqslant q\land a\geqslant B$

Figure 3.

Data matrix $\cal M$ and examples of Boolean attributes.

Figure 4.

Data matrix ${\cal M}$ and 4ft-table $4ft(\varphi,\psi,{\cal M})$ of $\varphi$ and $\psi$ in ${\cal M}$ .

We assume $0<p\leqslant 1$ , $0<s\leqslant 1$ , $B>0$ and $q>0$ . The 4ft-quantifiers are defined in various papers, an overview of about forty five 4ft-quantifiers is in [38]. Most of 4tft-quantifiers are defined by application of suitable thresholds to various measures of interestingness of association rules. These measures are published under various names in various papers, see e.g. [11, 15]. We use Fig. 5 to clarify usual definitions of measures of interestingness and definitions based on the 4ft-table $4ft(\varphi,\psi,{\cal M})=\langle a,b,c,d\rangle$ . In [11], measures of interestingness are defined for a rule $A\rightarrow B$ instead of $\varphi\approx\psi$ . This means that 4ft-table $4ft(A,B,{\cal M})$ is used instead of $4ft(\varphi,\psi,{\cal M})$ , see Fig. 5. Here, $n(AB)$ corresponds to $a$ , $n(A{\overline{B}})$ corresponds to $b$ , $n(A)=n(AB)+n(A{\overline{B}})=a+b=r$ , etc., see Fig. 5. Frequencies (e.g. $n(AB)$ , $n(A)$ ) or probabilities and conditional probabilities are usually used in definitions of the interestingness measures [11]. An example of probability is $P(A)=\frac{n(A)}{N}$ , which denotes the probability of $A$ . An example of conditional probability is $P(B|A)=\frac{n(AB)}{n(A)}$ , which denotes the conditional probability of $B$ , given $A$ .

Figure 5.

$4ft(A,B,{\cal M})$ and $4ft(\varphi,\psi,{\cal M})$ .

The 4ft quantifiers $\leftrightarrow_{p,s}$ and $\Leftrightarrow_{p,B}$ are based on the Jaccard measure defined as

$\displaystyle P(AB)/(P(A)+P(B)-P(AB))=\frac{\frac{a}{n}}{\frac{a+b}{n}+\frac{a% +c}{n}-\frac{a}{n}}=\frac{a}{a+b+c}$

and the 4ft-quantifiers $\equiv_{p,s}^{\odot}$ and $\equiv_{p,B}$ are based on the measure called accuracy or success rate and defined as $P(AB)+P(\lnot A\lnot B)=\frac{a}{a+b+c+d}+\frac{d}{a+b+c+d}=\frac{a+d}{a+b+c+d}.$ The 4ft-quantifiers $\sim_{q,s}$ and $\approx_{q,B}$ are based on the measure lift defined as

$\displaystyle P(B|A)/P(B)=\frac{\frac{a}{a+b}}{\frac{a+c}{a+b+c+d}}=\frac{a(a+% b+c+d)}{(a+b)(a+c)}.$

There are also important 4tft-quantifiers based on statistical hypothesis tests [15, 38], see 4ft-quantifier $\Rightarrow^{!}_{p,\alpha}$ in row 3 of Table 1 and Note 3.1.

Note 3.1. The condition $\sum_{i=a}^{a+b}{a+b\choose i}p^{i}(1-p)^{a+b-i}\leqslant\alpha$ corresponds to the statistical test (on the level $\alpha$ ) of the null hypothesis $H_{0}:P(\psi|\varphi)\leqslant p$ against the alternative one $H_{1}:P(\psi|\varphi)>p$ . Here, $P(\psi|\varphi)$ is the conditional probability of the validity of $\psi$ under the condition $\varphi$ [15, 38].

Remember the data matrices $\cal M$ and $\cal M_{B}$ introduced in Fig. 2. The columns of $\cal M_{B}$ can be seen as basic Boolean attributes with one-category coefficients. Thus each association rule $X\rightarrow Y$ concerning a data matrix $\cal M$ can be seen as a GUHA association rule $A^{\prime}_{1}(a^{\prime}_{1})\land\dots\land A^{\prime}_{u}(a^{\prime}_{u})% \approx A^{\prime\prime}_{1}(a^{\prime\prime}_{1})\land\dots\land A^{\prime% \prime}_{v}(a^{\prime\prime}_{v})$ i.e. $X_{A}\approx Y_{A}$ where $X_{A}=A^{\prime}_{1}(a^{\prime}_{1})\land\dots\land A^{\prime}_{u}(a^{\prime}_% {u})$ , $Y_{A}=A^{\prime\prime}_{1}(a^{\prime\prime}_{1})\land\dots\land A^{\prime% \prime}_{v}(a^{\prime\prime}_{v})$ and $\approx$ is a suitable 4ft-quantifier. Let us assume $4ft(X_{A},Y_{A},{\cal M})=\langle a,b,c,d\rangle$ . Then we get relations $conf(X\rightarrow Y)=conf(X_{A}\rightarrow Y_{A})=\frac{a}{a+b}$ and also $supp(X\rightarrow Y)=supp(X_{A}\rightarrow Y_{A})=\frac{a}{a+b+c+d}$ , see also Eqs (1) and (2) at the end of Section 2. This means that the association rule $X\rightarrow Y$ can be seen as the GUHA association rule $X_{A}\rightarrow_{minC,minS}Y_{A}$ .

This way, each association rule $X\rightarrow Y$ can be seen as a GUHA association rule. However, there are GUHA association rules, which cannot be seen as association rules without transformations of original columns of the analysed data matrix. Examples are in Section 5.

3.2 GUHA association rules and missing information

3.2.1 Secured X-extension

Missing information is a serious problem of data mining. A specific approach called secured X-extension is developed in [15, 35]and applied to the ASSOC procedure. Each attribute can get a value $X$ in addition to its regular values. The $X$ value is interpreted as the value of the attribute is not known. This means that we have to deal with a data matrix ${\cal M}^{X}$ with missing information. An example of such a data matrix is in the left part of Fig. 6. A completion of a data matrix with missing values is a data matrix in which all $X$ values are replaced by possible regular values. A data matrix ${\cal M}$ in the right part of Fig. 6 is an example of a completion of ${\cal M}^{X}$ .

Figure 6.

Data matrix ${\cal M}^{X}$ with missing information and its completion ${\cal M}$ .

It is natural that in some cases the values of Boolean attributes derived from columns of a data matrix ${\cal M}^{X}$ with missing information cannot be known, and the same is true for association rules. However, in some cases, we can determine the value of a Boolean attribute without the danger of a mistake. This means that we can unmistakably say that a Boolean attribute is true or false for a row $o$ of all completions of ${\cal M}^{X}$ . If we cannot do so, then we say that the value of the Boolean attribute for the row $o$ is $X$ . A value of an association rule in ${\cal M}^{X}$ is determined similarly. This means:

•

An association rule $\varphi\approx\psi$ is true in a data matrix ${\cal M}^{X}$ with missing information if it is true in all possible completions of ${\cal M}^{X}$ ;

•

An association rule $\varphi\approx\psi$ is false in a data matrix ${\cal M}^{X}$ with missing information if it is false in all possible completions of ${\cal M}^{X}$ ;

•

Otherwise, the value of $\varphi\approx\psi$ is $X$ .

This approach is introduced in [15] as the principle of secured X-extension, see also [35, 38]. The main features of this approach are summarized below.

Boolean attributes can be assigned values from $\{0,1,X\}$ . We refer to such Boolean attributes as Boolean attributes with missing values or three-valued attributes. According to the principle of secured X-extension, it holds for a value $A(\alpha)[o]$ of a basic Boolean attribute $A(\alpha)$ in a row $o$ of a data matrix ${\cal M}^{X}$ with missing information: $A(\alpha)[o]=1$ if $A[o]\in\alpha$ , $A(\alpha)[o]=0$ if $A[o]\not\in\alpha$ and $A(\alpha)[o]=X$ if $A[o]=X$ . Values of derived three-valued attributes are introduced in Fig. 7. Let us note that the secured extension of values of derived Boolean attributes leads to the same values as Kleene’s logic [22]. Examples of values of derived three-valued attributes are in Fig. 8.

Figure 7.

Values of derived three-valued attributes.

Figure 8.

Three-valued attributes in data matrix ${\cal M}^{X}$ with missing information.

This means that we deal with association rules concerning three-valued attributes with possible values $\{0,1,X\}$ . There are nine possible combinations of values $\{0,1,X\}$ of two three-valued attributes and, thus, we have to deal with a nine-fold table $9ft(\varphi,\psi,{\cal M}^{X})$ of three-valued attributes $\varphi$ and $\psi$ in a data matrix ${\cal M}^{X}$ with missing information instead of a 4ft-table. An example of $9ft(\varphi,\psi,{\cal M}^{X})$ is in the left part of Fig. 9.

Figure 9.

$9ft(\varphi,\psi,{\cal M}^{X})$ and $4ft(\varphi,\psi,{\cal M})$ where ${\cal M}$ is a completion of ${\cal M}^{X}$ .

Here, $f_{1,1}$ is the number of rows $o$ of ${\cal M}^{X}$ satisfying both $\varphi[o]=1$ and $\psi[o]=1$ , $f_{1,X}$ is the number of rows satisfying both $\varphi[o]=1$ and $\psi[o]=X$ , etc. We also write $9ft(\varphi,\psi,{\cal M}^{X})=\langle f_{1,1},f_{1,X},f_{1,0},f_{X,1},f_{X,X}% ,f_{X,0},f_{0,1},f_{0,X},f_{0,0}\rangle$ . If $\cal M$ is a completion of ${\cal M}^{X}$ , then the 4ft-table $4ft(\varphi,\psi,{\cal M})$ can be derived from $9ft(\varphi,\psi,{\cal M}^{X})$ in a way outlined in the right part of Fig. 9. Here, $f_{1,X,a}$ , $f_{1,X,b}$ , $f_{X,1,a}$ , $f_{X,1,c}$ , $f_{X,0,b}$ , $f_{X,0,d}$ , $f_{0,X,c}$ , $f_{0,X,d}$ , $f_{X,X,a}$ , $f_{X,X,b}$ , $f_{X,X,c}$ , $f_{X,X,d}$ are non-negative integers satisfying $f_{1,X,a}+f_{1,X,b}=f_{1,X}$ , $f_{X,1,a}+f_{X,1,c}=f_{X,1}$ , $f_{X,0,b}+f_{X,0,d}=f_{X,0}$ , $f_{0,x,c}+f_{0,X,d}=f_{0,X}$ , and $f_{X,X,a}+f_{X,X,b}+f_{X,X,c}++f_{X,X,d}=f_{X,X}$ . This means that $f_{1,X,a}$ is the number of rows $o$ , which satisfy $\varphi[o]=1$ and $\psi[o]=X$ in ${\cal M}^{X}$ as well as $\varphi[o]=1$ and $\psi[o]=1$ in ${\cal M}$ ; similarly for additional frequencies $f_{X,1,a}$ , $f_{X,1,c}$ , …, $f_{X,X,d}$ .

We are interested if a given association rule $\varphi\approx\psi$ is true in a given data matrix ${\cal M}^{X}$ with missing information. This means that we have to test if $F_{\approx}(a,b,c,d)=1$ holds for each 4ft-table $\langle a,b,c,d\rangle=4ft(\varphi,\psi,{\cal M})$ , where ${\cal M}$ is a completion of ${\cal M}^{X}$ . There is a unique nine-fold table $9ft(\varphi,\psi,{\cal M}^{X})$ assigned to given $\varphi,\psi$ and ${\cal M}^{X}$ . It is crucial that for important 4ft-quantifiers, it is possible to get a secured four-fold table $\langle a_{s},b_{s},c_{s},d_{s}\rangle$ for each nine-fold table $9ft(\varphi,\psi,{\cal M}^{X})$ of three-valued attributes $\varphi,\psi$ and a data matrix ${\cal M}^{X}$ [38]. A 4ft-table $\langle a_{s},b_{s},c_{s},d_{s}\rangle$ is a secured four-fold table for a nine-fold table $9ft(\varphi,\psi,{\cal M}^{X})$ and a 4ft-quantifier $\approx$ if $F_{\approx}(a_{s},b_{s},c_{s},d_{s})=1$ holds if and only if the association rule $\varphi\approx\psi$ is true in all completions of the data matrix ${\cal M}^{X}$ . Examples of secured four-fold tables for 4ft-quantifiers introduced in Table 1 and $9ft(\varphi,\psi,{\cal M}^{X})=\langle f_{1,1},f_{1,X},f_{1,0},f_{X,1},f_{X,X}% ,f_{X,0},f_{0,1},f_{0,X},f_{0,0}\rangle$ follow:

•

$\langle a_{s},b_{s},c_{s},d_{s}\rangle=\langle f_{1,1},f_{1,0}+f_{1,X}+f_{X,X}% +f_{X,0},f_{0,1}+f_{X,1},f_{0,0}+f_{0,X}\rangle$ for 4ft-quantifiers $\rightarrow_{p,s}$ , $\Rightarrow_{p,B}$ , and $\Rightarrow^{!}_{p,\alpha}$ .

•

$\langle a_{s},b_{s},c_{s},d_{s}\rangle=\langle f_{1,1},f_{1,0}+f_{1,X}+f_{X,X}% +f_{X,0},f_{0,1}+f_{X,1}+f_{0,X},f_{0,0}\rangle$ for 4ft-quantifiers $\leftrightarrow_{p,s}$ , $\Leftrightarrow_{p,B}$ , $\equiv_{p,s}^{\odot}$ , $\equiv_{p,B}$ .

•

for 4ft-quantifiers $\sim_{q,s}$ and $\approx_{q,B}$ it holds $\langle a_{s},b_{s},c_{s},d_{s}\rangle=$ $\langle f_{1,1},f_{1,0}+f_{1,X}+f_{X,X,b}+f_{X,0},f_{0,1}+f_{X,1}+f_{X,X,c}+f_% {0,X},f_{0,0}\rangle$ , where $f_{X,X,b}\geqslant 0$ , $f_{X,X,c}\geqslant 0$ , $f_{X,X,b}+f_{X,X,c}=f_{X,X}$ and $f_{X,X,b},f_{X,X,c}$ are chosen such that $|b_{s}-c_{s}|$ is minimal.

Let us note that in many cases there are more than one secured 4ft-tables for a given nine-fold table $9ft(\varphi,\psi,{\cal M}^{X})$ and 4ft-quantifier $\approx$ . Proofs and additional details are available in [38].

3.2.2 Additional ways of dealing with missing information

There are two additional ways of dealing with missing information related to the ASSOC procedure: the optimistic completion and the deleting of missing information [16, 38]. If we use the optimistic completion, then an association rule $\varphi\approx\psi$ is considered true in a data matrix ${\cal M}^{X}$ with missing information if there is a completion $\cal M$ of ${\cal M}^{X}$ such that $\varphi\approx\psi$ is true in $\cal M$ . A 4ft-table $\langle a_{o},b_{o},c_{o},d_{o}\rangle$ is an optimistic four-fold table for a nine-fold table $9ft(\varphi,\psi,{\cal M}^{X})$ and a 4ft-quantifier $\approx$ if $F_{\approx}(a_{o},b_{o},c_{o},d_{o})=1$ holds if and only if there is a completion of ${\cal M}^{X}$ in which the rule $\varphi\approx\psi$ is true. For important 4ft-quantifiers, it is possible to get an optimistic four-fold table for each nine-fold table $9ft(\varphi,\psi,{\cal M}^{X})$ of three-valued attributes $\varphi,\psi$ and a data matrix ${\cal M}^{X}$ [38]. An optimistic four-fold table for $9ft(\varphi,\psi,{\cal M}^{X})=\langle f_{1,1},f_{1,X},f_{1,0},f_{X,1},f_{X,X}% ,f_{X,0},f_{0,1},f_{0,X},f_{0,0}\rangle$ and 4ft-quantifiers $\rightarrow_{p,s}$ , $\Rightarrow_{p,B}$ , and $\Rightarrow^{!}_{p,\alpha}$ (see Table 1) is each 4ft-table $\langle f_{1,1}+f_{1,X}+f_{X,1}+f_{X,X},f_{1,0},f_{0,1}+f_{0,X,c},f_{0,0}+f_{0% ,X,d}+f_{X,0}\rangle$ , where $f_{0,X,c}$ and $f_{0,X,d}$ are as above.

If we use the deleting of missing information, then a value of an association rule $\varphi\approx\psi$ in a data matrix ${\cal M}^{X}$ with missing information is defined as $F_{\approx}(f_{1,1},f_{1,0},f_{0,1},f_{0,0})$ if $f_{1,1}+f_{1,0}+f_{0,1}+f_{0,0}>0$ and as $X$ otherwise. More information concerning dealing with missing information in the ASSOC procedure is in [38].

Let us note that the arules uses an approach to missing information that is different from each of the above-introduced (secured completion, optimistic completion and deleting). The arules approach can be described such that if $A$ is an attribute of a data matrix ${\cal M}^{X}$ with missing information and $A[o]=X$ for a row $o$ of ${\cal M}^{X}$ , then $A(\alpha)[o]=0$ for each basic Boolean attribute $A(\alpha)$ , see Section 2. The idea behind this approach is that the $X$ -value does not carry information. This approach is also implemented in the 4ft-Miner procedure with the name ignoring of missing information.

However, such ignoring of missing information means that no information about the value of a Boolean attribute $A(a)$ is replaced by the fact, that $A(a)$ is false. This leads to situations like the following. Let us have a data matrix ${\cal M}^{X}$ with 100 rows, attributes – columns $A_{1}$ and $A_{2}$ with values as in Fig. 10 (i.e. missing values for rows 51–100). Let us deal with an association rule $A_{1}(2)\rightarrow A_{2}(2)$ . If we use the arules representation of ${\cal M}^{X}$ shown in Fig. 10, we get $conf(A_{1}(2)\rightarrow A_{2}(2))=\frac{50}{50}=1$ and $supp(A_{1}(2)\rightarrow A_{2}(2))=\frac{50}{50+50}=0.5$ . However, if we consider a secured completion ${\cal M}_{S}$ also shown in Fig. 10, then we get $conf(A_{1}(2)\rightarrow A_{2}(2))=\frac{50}{100}=0.5$ and again $supp(A_{1}(2)\rightarrow A_{2}(2))=0.5$ . This means that the rule $A_{1}(2)\rightarrow A_{2}(2)$ will be a part of the apriori output for all tasks with input ${\cal M}^{X}$ , $minC>0.5$ and $minS=0.5$ even if there is a completion ${\cal M}_{S}$ of ${\cal M}^{X}$ for which the rule $A_{1}(2)\rightarrow A_{2}(2)$ is false for $minC>0.5$ and $minS=0.5$ .

Figure 10.

Ignoring and secured completion.

Let us emphasise that we introduce only those approaches to missing information, which are implemented in the arules or in the 4ft-Miner procedure. General approaches to missing information in data mining are presented in [2], additional approaches to dealing with missing information in data mining with association rules are introduced in [21, 30].

3.3 4ft-Miner procedure

3.3.1 4ft-Miner – a part of the LISp-Miner system

The 4ft-Miner procedure is an implementation of the enhanced ASSOC procedure. It is one of nine GUHA procedures implemented in the LISp-Miner system. LISp-Miner is an academic system for knowledge discovery in databases (KDD for short) developed as rich and possibly an easy-to-use environment for the education of basic and advanced courses and for research in KDD [46, 47]. System architecture, an internal metadata storage model and user interface have been designed with this in mind. The LM Workspace module especially is meant for interactive use and repeated iterations among data preprocessing, modelling and result interpretation phases of the KDD process.

The aim was to offer rich possibilities to formulate analytical tasks, based on the GUHA method approach, with nine GUHA procedures currently available. There are also functions for further processing of found results, namely to compare them with consequences of already known items of domain knowledge and to filter them accordingly [44]. But there are also means implemented for a more automated approach to KDD using script language and specially designed modules for solving KDD tasks in the background [45, 48].

The goal of this paper is to compare the apriori and ASSOC approach to mining with association rules. We first have to outline 4ft-Miner options for the definition of a set of relevant association rules and to introduce principles of the ASSOC implementation. This is done in Sections 3.3.2 and 3.3.3. A slightly modified version of the apriori algorithm was implemented into the 4ft-Miner procedure for mining simple association rules. So we can compare the GUHA approach of depth-first walking through the searched space of potentially interesting relevant patterns to apriori breadth-first walking. This implementation is shortly introduced in Section 3.3.4. Additional features of the 4ft-Miner and the LISp-Miner approach to mining with association rules are then shortly introduced in Section 5.3.

3.3.2 Defining set of relevant GUHA association rules

The 4ft-Miner procedure mines for the GUHA association rules $\varphi\approx\psi$ and also for conditional association rules introduced in Section 5.3.2. Input of the procedure consists of definitions of a set $\Phi$ of relevant antecedents $\varphi$ and a set $\Psi$ of relevant succedents $\psi$ , a definition of a relevant 4ft-quantifier $\approx$ , a definition of a way of dealing with missing information and of additional parameters specifying output details.

The derived Boolean attributes $\varphi$ and $\psi$ are called cedents. Each cedent is a conjunction of partial cedents. Each partial cedent is a conjunction or a disjunction of basic Boolean attributes. Attributes – columns of a data matrix are usually structured in several groups with a natural interpretation. Let us have a set ${\cal G}=\{G_{1},\dots,G_{u}\}$ of mutually disjoint groups $G_{1},\dots,G_{u}$ of attributes. If we are interested in Boolean characteristics of the set ${\cal G}$ , it is reasonable to consider each such Boolean characteristic as a conjunction $\omega_{\cal G}=\omega_{1}\land\dots\land\omega_{u}$ of partial cedents $\omega_{1},\dots,\omega_{u}$ – Boolean characteristics of the groups $G_{1},\dots,G_{u}$ respectively.

Let us consider a group $G=\{A_{1},\dots,A_{v}\}$ of attributes $A_{1},\dots,A_{v}$ . Then each conjunction $A_{1}(\alpha_{1})\land\dots\land A_{v}(\alpha_{v})$ and each disjunction $A_{1}(\alpha_{1})\lor\dots\lor A_{v}(\alpha_{v})$ of basic Boolean attributes can be considered as a partial cedent – a Boolean characteristics of the group $G$ . A definition of a set of relevant partial cedents consists of a set $G=\{A_{1},\dots,A_{v}\}$ of attributes, definition of sets ${\cal B}(A_{1})$ , …, ${\cal B}(A_{v})$ of basic Boolean attributes to be derived from particular attributes $A_{1},\dots,A_{v}$ respectively; minimal length $l_{min}$ and maximal length $l_{max}$ satisfying $1\leqslant l_{min}\leqslant l_{max}\leqslant v$ ; a parameter specifying if we are interested in conjunctions or in disjunctions; and of additional parameters making it possible to tune the syntax of partial cedents in a fine way.

There are also very rich possibilities to define a set ${\cal B}(A)$ of basic Boolean attributes to be derived from a given attribute $A$ . The most important are subsets of categories and sequences of categories of an ordinal attribute $A$ . A definition of a set of relevant cedents – Boolean characteristics of a set ${\cal G}$ of groups $G_{1},\dots,G_{u}$ of attributes consists of definitions of sets of relevant partial cedents corresponding to groups $G_{1},\dots,G_{u}$ . Minimal and maximal lengths of a cedent can be defined. Several examples of such definitions are in Section 5, see also [41, 39].

Examples of 4ft-quantifiers are in Table 1, additional 4ft-quantifiers are in [15, 38]. It is important that a relevant 4ft-quantifier can be defined as a conjunction of several simple 4ft-quantifiers. A detailed description of 4ft-quantifiers implemented in the 4ft-Miner procedure is out of the range of this paper and is available in [46]. There are four possibilities to deal with missing information: secured completion, optimistic completion, deleting and ignoring of missing information. Descriptions of additional parameters specifying details of an output are out of the range of this paper.

The 4ft-Miner procedure generates and verifies all GUHA association rules $\varphi\approx\psi$ such that $\varphi\in\Phi$ , $\psi\in\Psi$ and $\varphi$ and $\psi$ have no common attributes. Output of the ASSOC procedure consists of all rules $\varphi\approx\psi$ true in an analysed data matrix and satisfying the additional parameters. There are various examples of the 4ft-Miner input parameters and applications, see namely [38, 40, 41, 44, 45].

3.3.3 Principles of ASSOC implementation

Dealing with general Boolean attributes and with data with incomplete information, together with a necessity to quickly compute 4ft-tables and nine-fold tables, led to the employment of bitstrings as already described in [36], see also [41]. Each attribute $A$ with categories $c_{1},\dots,c_{u}$ is represented by $u+1$ strings of $n$ bits ${\cal C}(A(c_{1})),\dots,{\cal C}(A(c_{u})),{\cal C}(A(X))$ , where $n$ is the number of rows of an analysed data matrix $\cal M$ . There is $1$ in a i-th bit of ${\cal C}(A(c_{j}))$ if and only if $A[o_{i}]=c_{j}$ , where $o_{i}$ is an i-th row of $\cal M$ , $i=1,\dots,n$ , $j=1,\dots,u$ ; analogously for missing information $X$ . The string ${\cal C}(A(c_{i}))$ is called a card of the category $c_{i}$ of $A$ for $i=1,\dots,u$ and $A[X]$ is called an X-card of $A$ . Examples of such cards for an attribute $A_{1}$ with categories $1,2,3$ in a data matrix $\cal M$ with missing information are in Fig. 11.

Figure 11.

Cards of categories and X-card of the attribute $A_{1}$ .

Cards of Boolean attributes $\varphi$ and $\psi$ are used to compute frequencies from 4ft-tables and nine-fold tables. The card of the Boolean attributes $\varphi$ is denoted by ${\cal C}(\varphi)$ . The card ${\cal C}(\varphi)$ is a string of bits that is analogous to a card of category. Each row of the data matrix corresponds to one bit of ${\cal C}(\varphi)$ and there is “1” in the i-th bit if and only if $\varphi$ is true in row $o_{i}$ . It is evident that ${\cal C}(\varphi\land\psi)={\cal C}(\varphi)\dot{\land}{\cal C}(\psi)$ , ${\cal C}(\varphi\lor\psi)={\cal C}(\varphi)\dot{\lor}{\cal C}(\psi)$ , ${\cal C}(\lnot\varphi)=\dot{\lnot}{\cal C}(\varphi)$ . Here, ${\cal C}(\varphi)\dot{\land}{\cal C}(\psi)$ is a bit-wise conjunction of bitstrings ${\cal C}(\varphi)$ and ${\cal C}(\psi)$ , analogously for $\dot{\lor}$ and $\dot{\lnot}$ . Moreover, it is ${\cal C}(A_{1}(1,2))=A_{1}[1]\dot{\lor}A_{1}[2]$ for the basic Boolean attribute $A_{1}(1,2)$ , etc.

It is important that the bit-wise Boolean operations $\dot{\land}$ , $\dot{\lor}$ and $\dot{\lnot}$ are carried out by very fast processor instructions. An optimised algorithm is also used to carry out a bitstring function $Count(\xi)$ , returning the number of values “1” in the bitstring $\xi$ . This function is used to compute frequencies from 4ft-table $4ft(\varphi,\psi,{\cal M})=\langle a,b,c,d\rangle$ . It is $a=Count({\cal C}(\varphi)\dot{\land}{\cal C}(\psi))$ , $b=Count({\cal C}[\varphi])-a$ , $c=Count({\cal C}[\psi])-a$ , $d=n-a-b-c$ , where $n$ is the total number of rows in the data matrix ${\cal M}$ .

The relevant partial cedents are generated in a depth-first way. This can be outlined by a toy example concerning a set of relevant partial cedents created from attributes $A, B$ , each with categories $1,2,3$ . We assume that basic Boolean attributes $A(\alpha)$ , $B(\beta)$ , where $\alpha,\beta\subsetneq\{1,2,3\}$ can only be used and that we are interested in conjunctions. Then the following partial cedents are generated: $A(1)$ , $A(1)\land B(1)$ , $A(1)\land B(1,2)$ , $A(1)\land B(1,3)$ , $A(1)\land B(2)$ , $A(1)\land B(2,3)$ , $A(1)\land B(3)$ , $A(1,2)$ , …, $A(1,2)\land B(3)$ , $A(1,3)$ , …, $A(1,3)\land B(3)$ , $A(2)$ , …, $A(2)\land B(3)$ , $A(2,3)$ , …, $A(2,3)\land B(3)$ , $A(3)$ , …, $A(3)\land B(3)$ , $B(1)$ , $B(1,2)$ , $B(1,3)$ , $B(2)$ , $B(2,3)$ , $B(3)$ .

A more detailed description of the 4ft-Miner algorithm and implementation is out of the range of this paper. The basic ideas come from [36], for more details see [41]. Let us note that there is an analogy to the important fact used in the apriori algorithm, i.e. if $U, V$ are itemsets satisfying $supp(U)<minS$ and $U\subsetneq V$ , then $supp(V)<minS$ also holds. The ASSOC procedure as well as the 4ft-Miner procedure use an analogous optimisation criterion. Let $Fr(\omega)$ denote the number of rows of a data matrix $\cal M$ satisfying a Boolean attribute $\omega$ . Then if $Fr(\varphi)<B$ , then also $Fr(\varphi\land\psi)<B$ , where $\varphi$ and $\psi$ are Boolean attributes.

3.3.4 4ft-Miner and apriori

A slightly modified version of the apriori algorithm was implemented into the 4ft-Miner procedure for mining simple association rules. Therefore, we can compare the GUHA approach of depth-first walking through the searched space of potentially interesting relevant patterns to breadth-first walking introduced in [3]. The format of association rules mined for by the apriori algorithm is more simple than the GUHA format, so the apriori algorithm could be only used in the following scenarios:

•
Handling of missing information is set to ignore,
•
Coefficients of attributes consist of exactly one category,
•
No other additional options to define set of relevant rules are used (see Section 5.3.1),
•
Only the logical operation of conjuction is used between Boolean attributes and no negation of Boolean attributes is possible, and
•
Minimal support criterion is included into task quantifiers.

The already implemented memory representation of data by bitstrings was used to quickly compute frequencies of Boolean attributes and their conjunctions. A sorted working list of bitstrings is initialised by inserting all cards of categories fulfilling the minimal support criterion. The bitstrings are sorted in descending order by number of 1’s in each of them. While the working list is not empty, the most frequent card at the top of the list is repeatedly removed from the list and used:

•
To prepare an antecedent corresponding to the currently processed bitstring (i.e. either a single literal, or a conjunction of two or more literals). This antecedent is combined progressively with every possible succedent (apriori allows for single literals in succedent only). All rules true in data are written to the output;
•
To create bitstrings for all possible longer conjunctions consisting of the currently processed bitstring together with all the more frequent bitstrings representing one category and still in the working list. Newly-created bitstrings fulfilling the minimal support criterion are inserted at a proper position in the list to maintain sorting by frequencies.

To compensate for a large memory footprint of using bitstrings to compute frequencies of itemsets, a packed bitstring variant was implemented to represent sparse values in large datasets (more than 10 million rows). The apriori algorithm implementation in LISp-Miner allows to specify not only the maximal length of the antecedent, but its minimal length as well. Moreover, it is used separately for each cedent (in antecedent, succedent and/or in condition, see Section 5.3.2), so even conditional association rules could be generated. Moreover, the both approaches could be used simoultaneously to generate a single association rule (GUHA approach to generate part of the rule with a richer syntax and apriori to generate the simple part).

To preserve other LISp-Miner functionalities (e.g. partitioning the searched space of a task into smaller pieces to be processed in parallel or the pruning of results from logically derived rules), some memory structures had to be maintained even if the apriori branch is taken. This makes this implementation of apriori a little slower, see Section 4.3.
4. Comparing apriori to GUHA

The goal of this section is to compare the apriori to GUHA approaches when solving the same tasks. We use two implementations of apriori – the first one provided by the arules package for R introduced in Section 2 and the second one implemented in the LISp-Miner system (and the 4ft-Miner procedure especially) as described in Section 3.3.4. The GUHA approach is introduced in Sections 3.3.2 and 3.3.3.

Principles of the comparison are in Section 4.1. We use the Adult data set also used in [12], a brief introduction is in Section 4.2. Solution times and the number of found rules for several association-rules-discovery tasks are presented in Section 4.3 to demonstrate scalability of all considered approaches for mining association rules. The results show remarkable differences between apriori and GUHA when dealing with missing information. Comments to this fact are located in Section 4.4. An important characteristic of a tool for data mining with association rules is its ability to deal with additional measures of the interestingness of rules. Obligatory use of both the minimal support and the minimal confidence in the arules package can lead to loss of interesting rules, see Section 4.5.

4.1 Principles of comparison

To obtain meaningful results, we need to compare only the algorithm cores, excluding unrelated steps (i.e. data preparations steps or user-interface related steps like progress-reporting) and excluding times spent in communications with other systems (e.g. DBMS). Thus, the data preprocessing phase was completed in advance both in the R console and in the LISp-Miner environment. Both compared systems supports logging with timestamps to get the most precise times. Both 32-bit and 64-bit versions are available in both systems. We compare the 32-bit versions.

We compare solutions times of the apriori function in the R console to solution times of command-line modules in LISp-Miner with all the progress-reporting and other user-interface features disabled. LISp-Miner offers both single-thread processing of tasks (by the LM TaskPooler module) and multi-thread parallel processing (by the LM SamePooler module). We provide both solution times.

Both systems have options to store results into a text file. Furthermore, LISp-Miner could store them into any DBMS accessible through ODBC. Anyway, storing of results was excluded from solution times because its duration is a pure function of HDD speed (in the first case) or the speed of the chosen DBMS and its ODBC database driver (in the second case).

4.2 Benchmark Adult data

We use the Adult data matrix to compare the core algorithms in terms of speed. This data matrix is also used in [12]. The Adult data matrix is a result of transformations of an original data matrix from the UCI machine learning repository. The Adult data matrix has 48 842 rows and 15 columns, i.e. attributes. We use nine categorical attributes and four metric attributes. Four metric attributes were transformed to ordinal attributes, see Table 2. Seven categorical attributes are nominal, see Table 3. Two categorical attributes are ordinal, see Table 4. If an attribute has missing values, then the number of missings is given in the corresponding row.

Table 2
Ordinal attributes created from metric attributes

	Attribute	Categories/frequencies
1	Age	Young $\langle 15;25\rangle$ /9 627, Middle-aged $\langle 26;45\rangle$ /24 671, Senior $\langle 46;65\rangle$ /12 741, Old $\langle 66;99\rangle$ /1 803
2	Hours_per_week	Part-time $\langle 1;25\rangle$ /5 913, Full-time $\langle 26;40\rangle$ /28 577, Over-time $\langle 41;60\rangle$ /12 676,
		Workaholic $\langle 61;168\rangle$ /1 676
3	Capital_gain	None/44 807, Low $\langle 1;7268\rangle$ /2 345, High $\langle 7269;99999\rangle$ /1 690
4	Capital_loss	None/46 560, Low $\langle 1;1887\rangle$ /1 166, High $\langle 1888;4500\rangle$ /1 116

Table 3

Nominal attributes – columns of the Adult data matrix

	Attribute	Categories/frequencies
1	Sex	Female/16 192, Male/32 650
2	Marital_status	Divorced/6 633, Married-AF-spouse/37, Married-civ-spouse/22 379,
	7 categories	Married-spouse-absent/628, Never-married/16 117, Separated/1 530, Widowed 1 518
3	Relationship	Husband/19 716, Not-in-family/12 583, Other-relative/1 506, Own-child/7 581,
	6 categories	Unmarried/5 125, Wife/2 331
4	Workclass	Federal-gov/1 432, Local-gov/3 136, Without-pay/21, Never-worked/10, Private/33 906,
	8 categories	Self-emp-inc/1 695, Self-emp-not-inc/3 862, State-gov/1 981
	missings: 2 799
5	Occupation	Adm-clerical/5 611, Craft-repair/6 112, Exec-managerial/6 086, Machine-op-inspct/3 022,
	14 categories	Other-service/4 923, Prof-specialty/6 172, Sales/5 504,
	missings: 2 809	$+$ 7 occupations with frequency $<$ 3 000, their total frequency $=$ 8 603
6	Race	Amer-Indian-Eskimo/470, Asian-Pac-Islander/1 519,Black/4 685, Other/406,
	5 categories	White/41 762
7	Native_country	United-States/43 832, Mexico/951, Philippines/295, Germany/206, Puerto-Rico/184,
	41 categories	Canada/182, El-Salvador/155, India/151, $+$ 33 countries with frequency $<$ 150,
	missings: 857	their total frequency $=$ 1 172

Table 4

Ordinal attributes – columns of the Adult data matrix

	Attribute	Categories/frequencies
1	Education	Preschool/83, 1st-4th/247, 5th-6th/509, 7th-8th/955
	16 categories	9th/756, 10th/1 389, 11th/1 812, 12th/657, HS-grad/15 784, Some-college/10 878, Assoc-voc/2 061,
		Assoc-acdm/1 601, Bachelors/8 025, Masters/2 657, Prof-school/834, Doctorate/594
2	Income	Small/24 720, Large/7 841
	missings: 16 281

4.3 Comparison of results and solution times

The testing platform was the HP ProDesk with Intel i5-4590S (4 cores at 3 GHz), 4 GB RAM and HDD Toshiba 500 GB DT01ACA050 running Windows 7 Professional x64 with SP1. We have used the R system version of 3.2.3 and LISp-Miner system version of 27.00.01.

Results of comparison are in Table 5. Solutions times and the numbers of found association rules are presented for the apriori approach implemented in the arules package, for the apriori implemented in the 4ft-Miner and for the GUHA approach with secured completion of missing information implemented in the 4ft-Miner. There are two columns for the 4ft-Miner implementations – the first one for a single-thread solution using one processor marked ST and the second one for a multi-thread parallel solution using all available processor cores, which is marked MT.

Table 5
Comparing R-arules and 4ft-Miner in the Adult data matrix

		Apriori				Secured
Parameters		R	4ft-Miner		number	number	4ft-Miner
$m i n S$	$m i n C$	arules	ST	MT	of rules	of rules	ST		MT
A	B	C	D	E	F	G	H		I
0.96	0.9	0.2	0.2	0.7	0	0	0.	2	0.7
0.95	0.9	0.2	0.2	0.7	1	1	0.	3	0.7
0.9	0.9	0.2	0.2	0.7	2	2	0.	3	0.7
0.8	0.9	0.2	0.2	0.7	7	7	0.	2	0.7
0.7	0.9	0.2	0.2	0.7	17	17	0.	2	0.7
0.6	0.9	0.2	0.2	0.7	26	26	0.	2	0.7
0.5	0.9	0.2	0.2	0.7	52	50	0.	2	0.7
0.4	0.9	0.2	0.3	0.7	103	98	0.	3	0.7
0.3	0.9	0.2	0.3	0.7	326	309	0.	3	0.7
0.2	0.9	0.2	0.3	0.7	845	788	0.	6	1.7
0.1	0.9	0.3	1.0	0.9	4 122	3 726	3.	5	1.5
0.05	0.9	0.4	3.0	1.6	14 012	11 725	12.	1	4.5
0.04	0.9	0.5	4.8	2.5	20 344	16 688	20.	4	7.0
0.03	0.9	0.5	8.6	3.5	31 456	25 287	39.	1	13.2
0.02	0.9	0.6	16.4	6.3	57 907	45 358	85.	6	27.3
0.01	0.9	0.8	42.6	15.9	143 535	107 317	271.	0	81.9
0.01	0.6	0.9	42.6	17.2	276 443	240 845	274.	0	83.3

Tasks inspired by the arules application described in [12] are used. This means that the Adult data matrix is analysed and association rules with antecedents of length from 1 to 10 are mined. Remember that we use 13 columns of the Adult, i.e. attributes introduced in Tables 2–4.

There are 17 variants of minimal support minS and minimal confidence minC used, see columns A and B. The rest of Table 5 presents solution times in seconds and the number of found rules for each combination of minS and minC and of a particular algorithm used. Solution times for apriori implementation in the arules package for R are in column C, solution times for 4ft-Miner apriori implementation are in column D (single-thread task solving) and in column E (multi-thread parallel task solving). The number of found rules is in column F (this number is the same for both apriori implementations). The number of found rules with secured completion used is in column G and is significantly different from column F, see Section 4.4. Solution times for the GUHA approach with secured completion are in column H (single-thread task solving) and in column I (multi-thread parallel task solving).

We can see that apriori implemented in the arules package (column C) is faster than apriori implemented in the 4ft-Miner (columns D and E). The higher number of output rules, the higher difference between both implementations. However, the time used by the 4ft-Miner is not much too longer, and it is only a fraction of the whole time necessary to solve a real task.

Regarding solutions times for the apriori algorithm in columns D and E, we have to stress that the implementation in the LISp-Miner system is meant as an add-on functionality provided just for a few cases when a task description is suitable for apriori (i.e. really simple, without missing information and does not use any of the rich-syntax possibilities the GUHA approach offers). It was implemented on the partial cedents level only, so no existing functionality of the LISp-Miner system is lost. Therefore, some optimisations regarding the whole association rule could not be used and some time is spent on preparing additional memory structures. It results in time overhead compared to the apriori implementation in the arules package.

Solution times for the secured completion are longer because the GUHA approach of the depth-first walking have to be used and simultaneously, it is necessary to maintain additional information to compute nine-fold tables. Thus, solution times in columns H and I are remarkably higher than in columns C, D and E. However, we state that arules results are misleading in the case of ignoring missing information, see Section 4.4. The time consumed by the secured completion is still acceptable, and it is only a fraction of time necessary to solve the whole analytical task concerning real data, which usually does contain missing information. Moreover, the GUHA approach makes it possible to use not only the secured completion but any of the available different types of handling of missing information (see Section 3.2.2), as well as dealing with general Boolean attributes and with conditional association rules. These additional features can be only hardly realised by apriori, see Section 5.

The secured completion of missing information seems to be useful even when solving a task that can be solved by the apriori algorithm. This is because some rules produced by the arules can be misleading. Last but not least, the secured completion decreases the usually very high number of output rules to only ones, which are surely true in any completion of the analysed data. In the last three rows of Table 5, we can see that the number of rules decreases by 13–25 percent. See Section 4.4 for more details.

4.4 Missing information

We can see a big difference between apriori and secured completion of missing information in columns F and G of Table 5. There are many extra rules produced by the apriori algorithm. Each of these extra rules has a problem – there is a completion of the analysed data matrix Adult in which the rule is not valid. Let us consider minsup $=$ 0.5 and minconf $=$ 0.9. One of the problematic rules is the rule $\text{{\em Capital\_loss}(None)}\land\text{{\em Native\_country}(United-States% )}\land\text{{\em Sex}(Male)}\rightarrow\text{{\em Race}(White)}\text{.}$ We denote Sex(Male) $\land$ Capital_loss(None) $\land$ Native_country(United-States) as Ant and Race(White) as Suc. A nine-fold table $9ft$ (Ant, Suc, Adult) for the rule in question as well as a 4ft-table $4ft\_Ignoring$ resulting from the arules approach are in Fig. 12. Both tables are included in a comprehensive protocol produced by the 4ft-Miner procedure for each output rule.

Figure 12.

Tables $9ft$ (Ant, Suc, Adult) and $4ft\_Ignoring$ .

A secured four-fold table $4ft\_Secured$ for the rule in question is defined as $4ft\_Secured=\langle 24976,2675+181,16390+396,4224\rangle$ , see Section 3.2.1. Let us note that the frequency 181 in the last column of the $9ft$ (Ant, Suc, Adult) in Fig. 12 can generate 182 4ft-tables $\langle 24976,2675+i,16786,4224+181-i\rangle$ , where $0\leqslant i\leqslant 181$ . These tables belong to at least 182 mutually distinct completions of the Adult data matrix. We need $\frac{24976}{48842}\geqslant 0.5\land\frac{24976}{24976+2675+i}\geqslant 0.9$ which requires $0\leqslant i\leqslant 100$ . Thus, there are at least 81 mutually distinct completions of the Adult data matrix in which the rule in question is false. In addition, let us note that the confidence of the rule $\text{{\em Capital\_loss}(None)}\land\text{{\em Native\_country}(United-States% )}\land\text{{\em Sex}(Male)}\rightarrow\text{{\em Race}(White)}$ in the secured completion is $\frac{24976}{24976+2675+181}=0.897<0.9$ and, thus, this rule is not a part of any run of the 4ft-Miner with secured completion introduced in Table 5.

In Table 5, we deal with the 4ft-quantifier $\rightarrow_{p,s}$ – confidence-support introduced in Table 1. However, if we deal with the lift measure, the problem with missing values is even worse. Let us use the arules instructions >rules<-apriori(Adult,parameter=list(support=0.01,confidence=0.6))>Lift4 <- subset(rules, subset = rhs %in%"income=large"&lift>=4) . This results in a set Lift4 of 88 rules with lift in the range $\langle 4.117980;4.266398\rangle$ and the instruction> inspect(head(sort(Lift4, by = "lift"), n = 1)) outputs the rule according to Fig. 13.

Figure 13.

The rule with highest lift in the set Lift4 of rules.

However, if we use the secured approach for the lift, there is no rule with lift $\geqslant 4$ and the lift is in the range $\langle 1.343493;1.396369\rangle$ . Let us denote the rule in Fig. 13 as $AntL\rightarrow SucL$ . Using the 4ft-Miner, we can get the nine-fold table $9ft$ (AntL, SucL, Adult) and the secured 4ft-table $\langle a_{s},b_{s},c_{s},d_{s}\rangle$ for $9ft$ (AntL, SucL, Adult) and the 4ft-quantifier $\sim_{q,s}$ of supported lift, see Fig. 14.

Figure 14.

Tables $9ft$ (AntL, SucL, Adult) and $\langle a_{s},b_{s},c_{s},d_{s}\rangle$ .

It holds $\langle a_{s},b_{s},c_{s},d_{s}\rangle=\langle 763,8+343+11,7054+24+15927,24712\rangle$ , see Section 3.2.1; i.e. $\langle a_{s},b_{s},c_{s},d_{s}\rangle=\langle 763,362,23005,24712\rangle$ . This means that for each completion of the Adult data matrix and the $l i f t$ of the rule $AntL\rightarrow SucL$ , we have $lift\geqslant\frac{763(763+362+23005+24712)}{(763+362)(763+23005)}=1.393711$ . It is easy to show that the maximal value of lift among all completions of the Adult data matrix is equal to $\frac{(763+343+24+11)(763+343+8+24+11+7054+15927+24712)}{(763+343+24+11+8)(763% +343+24+11+7054)}=5.918479$ . This means that the lift of the rule $AntL\rightarrow SucL$ ranges from 1.393711 to 5.918479 if we consider all possible completions of the Adult data matrix. Thus, the value 4.266396 produced as lift of the rule in question can be considered as confusing.

We can conclude that:

•

The association rule mining process usually suffers from a problem of a large number of output rules,

•

The ignoring-of-missing-information approach used in the arules package produces rules, which do not satisfy the constraints (i.e. are false) in some completions of the analysed data matrix, the number of such rules can be relatively large,

•

A value of lift of a rule produced by the arules package for a data matrix with missing information can be confusing, and

•

The secured completion produces a lower number of rules, which surely satisfy the constraints (i.e. are true) in all possible completions of the analysed data matrix.

A more detailed description of dealing with missing information in the 4ft-Miner procedure is out of the scope of this paper, for details see namely [15, 38].

4.5 Loss of some interesting rules

We present an example of loss of rules, which are interesting because of the high value of the lift measure. The loss is related to obligatory use of minimal values of both support and confidence in the arules package. The following instructions for R>rules <- apriori(Adult,parameter = list(support=0.05, confidence=0.9))>inspect(head(sort(rules, by = "lift"), n = 1)) result to 14 012 rules, the highest lift is 2.956264, and it belongs to the rule age=Young, relationship=Own-child, sex=Male, capital-loss=none, native-country=United-States => marital-status=Never-married , see Fig. 15. However, if we apply the 4ft-Miner procedure to search all rules satisfying $support\geqslant 0.05\land lift\geqslant 2.96$ with even secured completion to missing information, we get 261 rules. Details concerning the rule with the highest lift are in Fig. 16. Confidence of this rule is $\frac{2449}{2449+477}=0.84$ . Overview of all rules with $lift\geqslant 2.96$ not produced by the instruction >rules <- apriori(Adult,parameter = list(support=0.05, confidence=0.9))is in Table 6. A workaround in arules would be to set confidence to a very low artificial number. But it would result in an explosion of rules mined and problems in processing such a large set later.

Table 6
Rules with $support\geqslant 0.05$ , $lift\geqslant 2.96$ and $confidence<0.9$

	Number	Lift
Confidence	of rules	min	max
$(0.85,0.90)$	0	–	–
$(0.80,0.85\rangle$	7	4.189	4.246
$(0.75,0.80\rangle$	40	3.814	3.955
$(0.70,0.75\rangle$	44	3.555	3.724
$(0.65,0.70\rangle$	41	3.388	3.553
$(0.60,0.65\rangle$	49	3.109	4.175
$(0.55,0.60\rangle$	37	3.599	3.852
$(0.50,0.55\rangle$	34	3.236	3.542
$(0.45,0.50\rangle$	9	2.963	3.168

Figure 15.

The highest lift for apriori; support $=$ 0.05, confidence $=$ 0.9.

Figure 16.

The rule with the highest $l i f t$ for $support\geqslant 0.05$ .

We can conclude that the obligatory use of minimal values of both support and confidence applied in the arules package can lead to remarkable loss of rules interesting for their high values of lift. It is practically not possible to estimate a degree of such loss. This is the same for additional measures of interestingness. This problem does not occur in the 4ft-Miner procedure, where conditions concerning various measures of interestingness can be freely combined.

5. 4ft-Miner, LISp-Miner and association rules

The goal of this section is to present 4ft-Miner specific features as well as to point to possibilities of the LISp-Miner system to deal with association rules. An example of 4ft-Miner application, which cannot be practically solved by the arules package, is located in Section 5.1. This example is based on dealing with basic Boolean attribute $A(\alpha)$ , where $\alpha$ is a subset with at least two categories of the attribute $A$ . Additional examples of reasonable applications of the 4ft-Miner procedure, which cannot be easily substituted by the arules package, are in Section 5.2. These examples are based on conjunctions and disjunctions of basic Boolean attributes in succedents (i.e. consequents) of rules. Several additional possibilities of the LISp-Miner system to deal with the GUHA association rules are briefly introduced in Section 5.3.

5.1 Applying general basic boolean attributes

First we show that the 4ft-Miner procedure can be used to search for segments of clients with a high chance to have a maximal gain. This is done in Section 5.1.1. In Section 5.1.2, we show that it is very difficult to solve this task by the arules package in the same way.

5.1.1 Segments of clients with extreme values of gain

An example of input parameters of the 4ft-Miner is in Fig. 17.

Figure 17.

4ft-Miner input example.

These parameters mean that we search for all rules $\varphi\approx_{6,118}\psi$ true in the Adult data matrix such that $\varphi\in\Phi$ and $\psi\in\Psi$ . Here, $\Phi$ is a set of Boolean attributes specified in the column ANTECEDENT in Fig. 17 and $\Psi$ is a set of Boolean attributes specified in the column SUCCEDENT. Remember that $\approx_{6,118}$ is the 4ft-quantifier of supported lift, see Table 1. An association rule $\varphi\approx_{6,118}\psi$ is true in the Adult data matrix if it holds $\frac{a(a+b+c+d)}{(a+b)(a+c)}\geqslant 6\land a\geqslant 118$ , where $\langle a,b,c,d\rangle=4ft(\varphi,\psi,\text{\em Adult})$ . The 4ft-quantifier $\approx_{6,118}$ is specified in the column QUANTIFIERS by rows Base p=118 Abs. and AAD p=5.000; it holds $Lift=\text{\tt AAD}+1$ .

The set $\Phi$ is specified in the column ANTECEDENT. The row Client Con, 1-10 means that each $\varphi\in\Phi$ is a conjunction of 1–10 basic Boolean attributes specified in the remaining rows of the column ANTECEDENT. The row Age_ed5(seq), 1-4 B,pos specifies a set ${\cal B}_{1}={\cal B}(\text{\em Age\_ed5})$ of basic Boolean attributes generated from the attribute Age_ed5. The sets ${\cal B}_{2}={\cal B}(\text{\em Education})$ , …, ${\cal B}_{10}={\cal B}(\text{\em Workclass})$ of additional basic Boolean attributes are specified similarly. It holds $\varphi=\varphi_{i_{1}}\land\dots\varphi_{i_{k}}$ , where $\varphi_{i_{1}}\in{\cal B}_{i_{k}}$ , $1\leqslant k\leqslant 10$ and $i_{1},\dots,i_{k}$ are mutually distinct. The set $\Psi$ is specified in the column SUCCEDENT as a set ${\cal B}(\text{\em Capital\_gain\_exp})$ .

The attribute Age_ed5 was created by means of the LISp-Miner as a discretisation of the attribute Age into equidistant intervals of length 5. It has 15 categories $\langle 15;20\rangle$ , $(20;25\rangle$ , …, $(85;90\rangle$ , see Fig. 18. The row Age_ed5(seq), 1-4 B,pos specifies the set ${\cal B}_{1}={\cal B}(\text{\em Age\_ed5})$ as a set of all basic Boolean attributes $\text{\em Age\_ed5}(\alpha)$ , where $\alpha$ is a sequence of 1–4 consecutive categories of the attribute Age_ed5. This means that ${\cal B}_{1}={\cal B}(\text{\em Age\_ed5})$ consists of basic Boolean attributes $\text{\em Age\_ed5}\langle 15;20\rangle$ , …, $\text{\em Age\_ed5}\langle 85;90\rangle$ , $\text{\em Age\_ed5}\langle 15;25\rangle$ , …, $\text{Age\_ed5}\langle 80;90\rangle$ , $\text{\em Age\_ed5}\langle 15;30\rangle$ , …, $\text{\em Age\_ed5}\langle 75;90\rangle$ , $\text{\em Age\_ed5}\langle 15;35\rangle$ , …, $\text{\em Age\_ed5}\langle 70;90\rangle$ , where we write $\text{\em Age\_ed5}\langle 15;20\rangle$ instead of $\text{\em Age\_ed5}(\langle 15;20\rangle)$ and $\text{\em Age\_ed5}\langle 15;25\rangle$ instead of $\text{\em Age\_ed5}(\langle 15;20\rangle,(15;25\rangle)$ , etc. There are 15 $+$ 14 $+$ 13 $+$ 12 $=$ 54 basic Boolean attributes $\text{\em Age\_ed5}(\alpha)$ in the set ${\cal B}_{1}={\cal B}(\text{\em Age\_ed5})$ .

Figure 18.

Categories of the attribute Age_ed5 and their frequencies.

The attribute Education has 16 categories, see Table 4. This means that the row Education(seq),1-4 B,pos specifies 16 $+$ 15 $+$ 14 $+$ 13 $=$ 58 basic Boolean attributes $\text{\em Education}(\alpha)$ in a way similar to the way in which the row Age_ed5(seq), 1-4 B,pos specifies basic Boolean attributes $\text{\em Age\_ed5}(\alpha)$ . Boolean attributes Education(Preschool) and Education(Masters, Prof-school, Doctorate) belong to ${\cal B}(\text{Education})$ .

The attribute Hours_per_week has four categories, see Table 4. The row Hours_per_week(subset),1-1 B,pos specifies the set ${\cal B}_{3}={\cal B}(\text{\em Hours\_per\_week})$ as a set of all basic Boolean attributes Hours_per_week( $\alpha$ ), where $\alpha$ is a subset of 1–1 categories of the set of all categories of the attribute Hours_per_week. Four basic Boolean attributes are generated: Hours_per_week(Part-time), Hours_per_week(Full-time), Hours_per_week(Over-time), Hours_per_week(Workaholic).

Categories of the attribute Capital_gain_exp and their frequencies are in Fig. 19.

Figure 19.

Categories of the attribute Capital_gain_exp.

The row Capital_gain_exp(rcut), 1-8 B,pos in the column SUCCEDENT specifies a set ${\cal B}(\text{\em Capital\_gain\_exp})$ as a set of all basic Boolean attributes $\text{\em Capital\_gain\_exp}(\alpha)$ , where $\alpha$ is a right cut of a length of 1–8 categories. The right cut of the length $k$ is a sequence of $k$ categories such that the last category of the sequence is also the last category of the attribute. This means that eight basic Boolean attributes are generated: $\text{\em Capital\_gain\_exp}\langle 98001;100000\rangle$ , $\text{\em Capital\_gain\_exp}\langle 20001;100000\rangle$ , …, $\text{\em Capital\_gain\_exp}\langle 2001;100000\rangle$ , $\text{\em Capital\_gain\_exp}\langle 1;100000\rangle$ . Let us note that using the right cut option means that we are interested in extreme values of the attribute in question.

The run of the 4ft-Miner with parameters according to Fig. 17 gets 686 rules. We have used secured completion of missing information. Five rules with the highest lift are displayed in a short form in Fig. 20.

The highest lift – 12.787 has the rule

$\text{{\em Education}(Prof-school, Doctorate)}\land\text{{\em Sex}(Male)}% \approx_{12.787,118}\text{\em Capital\_gain\_exp}(\geqslant 20001)\text{.}$

There are details for each rule analogous to that in Fig. 16. Basic information concerning the run of the 4ft-Miner are in the upper left part of Fig. 20. The 4ft-Miner procedure has generated and verified 30 831 256 rules and skipped many more, see Section 3.3.3. The run took 25 minutes and 26 seconds. The minimal confidence in the output set of 686 rules introduced above is 0.048.

5.1.2 Segments of clients with extreme values of gain and arules

The task solved in the previous section can also be theoretically solved by the R package arules if we accept the ignoring approach to missing information. However, there is still a problem concerning basic Boolean attributes with more than one category, $\text{\em Age\_ed5}(\langle 15;20\rangle,(15;25\rangle)$ and Education(Preschool, 1st-4th, 5th-6th, 7th-8th) being examples. The apriori algorithm deals only with attribute-category Boolean attributes. This means that we have to prepare three families of attributes with suitable categories in advance to substitute sets of basic Boolean attributes specified by the rows Age_ed5(seq), 1-4 B,pos, Education(seq), 1-4 B,pos and Capital_gain_exp(rcut), 1-8 B,pos in Fig. 17.

We can use the attribute Age_ed5 to get all sequences of categories of length 1. Then we need attributes $\text{\em Age\_ed5}_{2,1}$ and $\text{\em Age\_ed5}_{2,2}$ to get all sequences of categories of length 2, attributes $\text{Age\_ed5}_{3,1}$ , $\text{\em Age\_ed5}_{3,2}$ , $\text{\em Age\_ed5}_{3,3}$ , to get all sequences of categories of length 3 and attributes $\text{\em Age\_ed5}_{4,1}$ , $\text{\em Age\_ed5}_{4,2}$ , $\text{\em Age\_ed5}_{4,3}$ , and $\text{\em Age\_ed5}_{4,4}$ to get all sequences of length 4. Altogether, we have a family ${\cal F}(\text{\tt Age\_ed5(seq), 1-4})$ of 10 attributes which can substitute the set of basic Boolean attributes specified by the row Age_ed5(seq), 1-4 B,pos. All these attributes and their categories are listed in Table 7.

Table 7
Family ${\cal F}(\text{\tt Age\_ed5(seq), 1-4})$ of 10 attributes

Attribute	Categories
Age_ed5	$\langle 15;20\rangle$ , $(20;25\rangle$ , $(25;30\rangle$ , …, $(75;80\rangle$ , $(80;85\rangle$ , $(85;90\rangle$
$\text{\em Age\_ed5}_{2,1}$	$\langle 15;25\rangle$ , $(25;35\rangle$ , $(35;45\rangle$ , $(45;55\rangle$ , $\langle 55;65\rangle$ , $\langle 65;75\rangle$ , $\langle 75;85\rangle$
$\text{\em Age\_ed5}_{2,2}$	$(20;30\rangle$ , $(30;40\rangle$ , $(40;50\rangle$ , $(50;60\rangle$ , $(60;70\rangle$ , $(70;80\rangle$ , $(80;90\rangle$
$\text{\em Age\_ed5}_{3,1}$	$\langle 15;30\rangle$ , $(30;45\rangle$ , $(45;60\rangle$ , $(60;75\rangle$ , $(75;90\rangle$
$\text{\em Age\_ed5}_{3,2}$	$(20;35\rangle$ , $(35;50\rangle$ , $(50;65\rangle$ , $(65;80\rangle$
$\text{\em Age\_ed5}_{3,3}$	$(25;40\rangle$ , $(40;55\rangle$ , $(55;70\rangle$ , $(70;85\rangle$
$\text{\em Age\_ed5}_{4,1}$	$\langle 15;35\rangle$ , $(35;55\rangle$ , $(55;75\rangle$
$\text{\em Age\_ed5}_{4,2}$	$(20;40\rangle$ , $(40;55\rangle$ , $(55;75\rangle$
$\text{\em Age\_ed5}_{4,3}$	$(25;45\rangle$ , $(45;65\rangle$ , $(65;85\rangle$
$\text{\em Age\_ed5}_{4,4}$	$(30;50\rangle$ , $(50;70\rangle$ , $(70;90\rangle$

Figure 20.

4ft-Miner output example.

A similar family ${\cal F}(\text{\tt Education(seq), 1-4})$ of 10 attributes, corresponding to the set of basic Boolean attributes specified by the row Education(seq), 1-4 B,pos, is introduced in Table 8. Remember that categories of the attribute Education are Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-school, Doctorate, see Table 4.

Table 8

Family ${\cal F}(\text{\tt Education(seq), 1-4})$ of 10 attributes

Attribute	Categories
Education	Preschool, 1st-4th, …, Masters, Prof-school, Doctorate
$\text{\em Education}_{2,1}$	Preschool#1st-4th, 5th-6th#7th-8th, 9th#10th, …,
	…, Bachelors#Masters; Prof-school#Doctorate
$\text{\em Education}_{2,2}$	1st-4th#5th-6th, 7th-8th#9th, 10th#11th, …,
	…, Assoc-acdm#Bachelors, Masters#Prof-school
$\text{\em Education}_{3,1}$	Preschool#1st-4th#5th-6th, 7th-8th#9th#10th, …,
	…, Bachelors#Masters#Prof-school
$\text{\em Education}_{3,2}$	1st-4th#5th-6th#7th-8th, 9th#10th#11th, …
	…, Masters#Prof-school#Doctorate
$\text{\em Education}_{3,3}$	5th-6th#7th-8th#9th, 10th#11th#12th, …
	…, Assoc-acdm#Bachelors#Masters
$\text{\em Education}_{4,1}$	Preschool#1st-4th#5th-6th#7th-8th, …,
	…, Bachelors#Masters#Prof-school#Doctorate
$\text{\em Education}_{4,2}$	1st-4th#5th-6th#7th-8th#9th, …,
	…, Some-college#Assoc-voc#Assoc-acdm#Bachelors
$\text{\em Education}_{4,3}$	5th-6th#7th-8th#9th#10th, …,
	…, Assoc-voc#Assoc-acdm#Bachelors#Masters
$\text{\em Education}_{4,4}$	7th-8th#9th#10th#11th, …,
	…, Assoc-acdm#Bachelors#Masters#Prof-school

The category Preschool#1st-4th of the attribute $\text{\em Education}_{2,1}$ is defined such that the attribute $\text{\em Education}_{2,1}$ has the value Preschool#1st-4th if and only if the attribute Education has the value Preschool or the value 1st-4th. Thus, the basic Boolean attributes $\text{\em Education}_{2,1}(Preschool\#1st-4th)$ and $\text{\em Education}(Preschool,1st-4th)$ are equivalent. Analogous relations are valid for additional attributes and categories introduced in Table 8.

The row Capital_gain_exp(rcut), 1-8 B,pos in the column SUCCEDENT specifies a set of eight basic Boolean attributes $\text{\em Capital\_gain\_exp}\langle 98001;100000\rangle$ , …, Capital_gain_exp $\langle 2001;100000\rangle$ , Capital_gain_ $\text{\em exp}\langle 1;100000\rangle$ . This set of basic Boolean attributes can be substituted by a family ${\cal F}(\text{\tt Capital\_gain\_}$ $\text{\tt exp(rcut), 1-8})$ of eight attributes $\text{\em Capital\_gain\_exp}_{1},\dots,$ $\text{\em Capital\_gain\_exp}_{8}$ . Each of these attributes has two categories. One of these categories is specified as an interval $\langle U;100000\rangle$ in Table 9. The second category can be seen as a union $\{\text{none}\}\cup\langle 0,U)$ .

Table 9

Family ${\cal F}(\text{\tt Capital\_gain\_exp(rcut), 1-8})$ of 8 attributes

Attribute	$\langle U;100000\rangle$	Attribute	$\langle U;100000\rangle$
$\text{\em Capital\_gain\_exp}_{1}$	$\langle 98001;100000\rangle$	$\text{\em Capital\_gain\_exp}_{5}$	$\langle 6001;100000\rangle$
$\text{\em Capital\_gain\_exp}_{2}$	$\langle 20001;100000\rangle$	$\text{\em Capital\_gain\_exp}_{6}$	$\langle 4001;100000\rangle$
$\text{\em Capital\_gain\_exp}_{3}$	$\langle 10001;100000\rangle$	$\text{\em Capital\_gain\_exp}_{7}$	$\langle 2001;100000\rangle$
$\text{\em Capital\_gain\_exp}_{4}$	$\langle 8001;100000\rangle$	$\text{\em Capital\_gain\_exp}_{8}$	$\langle 1;100000\rangle$

Having prepared all 28 attributes belonging to families ${\cal F}(\text{\tt Age\_ed5(seq), 1-4})$ , ${\cal F}(\text{\tt Education}$ (seq), $\text{\tt 1-4})$ , ${\cal F}(\text{\tt Capital\_gain\_exp(rcut), 1-8})$ , we can try to simulate the run of the 4ft-Miner procedure specified in Fig. 17 by the arules package. First, we have to solve the problem of loss of rules which are interesting by high value of the lift measure described in Section 4.5. The loss is related to the obligatory use of minimal values of both support and confidence in the arules package. The minimal value of support follows from the 4ft-quantifier of supported lift $\approx_{6,118}$ used in the run of the 4ft-Miner procedure specified in Fig. 17. We have minsup $=$ $\frac{118}{48842}$ , in other words minsup $=$ 118/(number of rows of the Adult data matrix). A minimal value of confidence must be set to a very low artificial number. But it would again result in an explosion of resulting rules (Let us note that from the application of the 4ft-Miner specified in Fig. 17 we know that minimal confidence among the resulting 686 rules is 0.048).

Then we can use 28 attributes introduced in Tables 7–9 together with eight additional attributes, see rows Hours_per_week(subset), 1-1, …, Workclass(subset), 1-1 in the column ANTECEDENT in Fig. 17. However, this leads to a huge number of redundant rules. The rule $\text{\em Age}\langle 35;55\rangle\land\text{\em Education}(\text{Prof-school}% )\approx\text{\em Capital\_gain}(\geqslant 10001)$ – the third rule in Fig. 20 – can be used to demonstrate this problem. The problem arises from the fact that the Boolean attribute Education(Prof-school) is equivalent to conjunctions $\text{\em Ed}(\text{P-s})\land\text{\em Ed}_{2,1}\text{(P-s\#Doctorate)}$ , $\text{\em Ed}(\text{P-s})\land\text{\em Ed}_{2,2}$ (Masters#P-s), $\text{\em Ed}(\text{P-s})\land\text{\em Ed}_{3,1}\text{(Bachelors\#Masters\#P-% s)}$ , $\text{\em Ed}(\text{P-s})\land\text{\em Ed}_{3,2}\text{(Masters\#P-s\#% Doctorate)}$ , Ed $(\text{P-s})\land\text{\em Ed}_{4,1}\text{(Bachelors\#Masters\#P-s\#Doctorate)}$ , $\text{\em Ed}(\text{P-s})\land\text{\em Ed}_{4,4}\text{(Assoc-acdm\#Bachelors% \#Masters\#P-s)}$ , $\text{\em Ed}(\text{P-s})\land\text{\em Ed}_{2,1}\text{(P-s\#Doctorate)}\land% \text{\em Ed}_{2,2}\text{(Masters\#P-s)}$ , …, where we write “Ed” only instead of “Education” and “P-s” instead of “Prof-school”. There are $2^{6}-1=63$ such rules. This means that there are 63 redundant output association rules related to the rule $\text{\em Age}\langle 35;55\rangle\land\text{\em Education}(\text{Prof-school}% )\approx\text{\em Capital\_gain}(\geqslant 10001)$ . The problem of additional redundant output rules in the arules output concerns all rules produced by the 4ft-Miner with input specified in Fig. 17.

The problem of redundant rules in the arules can be avoided such that we use a family of runs of the arules package on data matrices with attributes $\text{\em Age\_ed5}_{A}$ , $\text{\em Education}_{E}$ , Hours_per_week, Marital_status, Native_country, Occupation, Race, Relationship, Sex, Workclass, $\text{\em Capital\_gain\_exp}_{G}$ , with suitable minimal values of confidence and support. We assume $\text{\em Age\_ed5}_{A}\in{\cal F}(\text{\tt Age\_ed5(seq),}$ $\text{\tt 1-4})$ , $\text{\em Education}_{E}\in{\cal F}(\text{\tt Education(seq), 1-4})$ , and $\text{\em Capital\_gain\_exp}_{G}\in{\cal F}(\text{\tt Capital\_gain}$ $\text{\tt\_exp(rcut), 1-8})$ . All combinations of $\text{\em Age\_ed5}_{A}$ , $\text{\em Education}_{E}$ , and $\text{\em Capital\_gain\_exp}_{G}$ must be used. However, there is $10\times 10\times 8=800$ such combinations.

The minimal confidence in the output set of 686 rules introduced above is 0.048 and the minimal support is 0.002. A run of the arules package task with these minimal values in the Adult data matrix results in 2 783 322 rules and requires 11 seconds. Thus, 800 runs of similar tasks requires more than 146 minutes. Moreover, it is necessary to prepare families of attributes ${\cal F}(\text{\tt Age\_ed5(seq), 1-4})$ , ${\cal F}(\text{\tt Education(seq), 1-4})$ , and ${\cal F}(\text{\tt Capital\_gain\_exp(rcut), 1-8})$ .

An additional possibility is to try to combine both the approaches outlined above. We can conclude that a substitution of the run of the 4ft-Miner procedure specified in Fig. 17 by applications of the arules is very laborious.

5.2 Conjunctions and disjunctions in succedent

We show two examples of additional reasonable applications of the 4ft-Miner procedure, which can only hardly be substituted by the arules package. We use the two following analytical questions Q1 and Q2.

Q1:
Are there any segments of clients specified by combinations of attributes Age, Education, Hours_per_week, Marital_status, Native_country, Occupation, Race, Relationship, Sex, Workclass such that clients from this segment have both large income and extreme capital gain?
Q2:
Are there any segments of clients specified by combinations of attributes Age, Education, Hours_per_week, Marital_status, Native_country, Occupation, Race, Relationship, Sex, Workclass such that clients from this segment have large income or extreme capital gain?

We consider both Q1 and Q2 reasonable since segments of clients having both large income and extreme capital gain and also segments of clients having large income or extreme capital gain are interesting segments.

Question Q1 can be solved such that we find a set of all GUHA association rules $\varphi\approx\psi$ true in the Adult data matrix and satisfy the following:

•
$\varphi\in\Phi$ , where $\Phi$ is a set of Boolean attributes specified in the column ANTECEDENT in Fig. 17; i.e. $\varphi$ defines a relevant segment of clients.
•
$\psi$ is a conjunction Income(large) $\land\gamma$ , where $\gamma$ is one of eight right cuts $\text{\em Capital\_gain\_exp}\langle 98001;$ $100000\rangle$ , …, $\text{\em Capital\_gain\_exp}\langle 1;100000\rangle$ specified by the row Capital_gain_exp(rcut), 1-8 B,pos in the column SUCCEDENT in the left part of Fig. 21; i.e. $\psi$ means that a client has both large income and extreme capital gain.
•
$\approx$ is the 4ft-quantifer $\approx_{4,118}$ , i.e. we are interested in rules with lift of at least 4.

Figure 21.
Definitions of relevant succedents for Q1 and Q2.

The corresponding run of the 4ft-Miner resulted in a set of 190 rules. The rule Education(Prof-school) $\approx_{6.64,121}\text{{\em Capital\_gain}}(\geqslant 10001)\land\text{{\em Income% }(large)}$ has the highest lift of 6.64. This rule has confidence 0.15. Thus, we used an additional run of the 4ft-Miner with the 4ft-quantifer $\Rightarrow_{0.14,118}$ , we were interested in rules with confidence of at least 0.14. The run resulted in a set of 2 472 rules. The highest confidence 0.181 is possessed by the rule $Ant\Rightarrow_{0.181,127}\text{{\em Capital\_gain}}(\geqslant 2001)\land\text% {{\em Income}(large)}$ , where $Ant=\text{\em Age}(35;55\rangle\land\text{{\em Education}(Prof-school, % Doctorate)}\land\text{{\em Sex}(Male)}$ .

Question Q2 can be solved in a similar way; we deal with GUHA association rules $\varphi\approx\psi$ true in the Adult data matrix which differ from rules for the question Q1 only such that we use succedents Income(large) $\lor\gamma$ , $\gamma$ meaning the same as in the case of Q1. We used the 4ft-Miner with the 4ft-quantifer $\approx_{1.5,118}$ ; i.e. we were interested in rules with lift of at least 1.5. The set of relevant antecedents was defined as in the column ANTECEDENT in Fig. 17 and the set of relevant succedents was defined as in the column SUCCEDENT in the right part of Fig. 21.

The run of the 4ft-Miner resulted in a set of 174 rules. The highest lift 1.535 has the rule $Ant_{1}\approx_{1.535,121}\text{{\em Capital\_gain}}(\geqslant 6001)\lor\text{% {\em Income}(large)}$ , where it holds $Ant_{1}=\text{{\em Age}}(35;50\rangle\land\text{{\em Sex}(Male)}\land\text{\em Education% }\text{(Prof-school)}\land\text{{\em Hours}(Over-time)}\land\land\text{{\em Marital% \_status}(Married-civ-spouse)}\land\text{{\em Occupation}(Prof-specialty)}\land$ $\land\text{{\em Relationship}(Husband)}$ . This rule has confidence 0.76. Thus, we used an additional run of the 4ft-Miner with the 4ft-quantifer $\Rightarrow_{0.75,118}$ ; we were interested in rules with confidence of at least 0.75. The run resulted in a set of 198 rules. The highest confidence 0.76 is possessed by the rule $Ant_{1}\approx_{1.535,121}\text{{\em Capital\_gain}}(\geqslant 6001)\lor\text{% {\em Income}(large)}$ introduced above, which can thus be also written with 4ft-quantifier $\Rightarrow_{0.76,121}$ : $Ant_{1}\Rightarrow_{0.76,121}\text{{\em Capital\_gain}}(\geqslant 6001)\lor$ Income(large).

The results of the four above-introduced runs of the 4ft-Miner can be transformed into answers to questions Q1 and Q2. We used the secured completion of missing information, which gives more precise results than ignoring missing information as in the arules package. However, even if we use the ignoring approach implemented in the arules package, it is not easy to get equivalent results. The only way is to use suitable families of attributes and/or arules tasks similar to those introduced in Section 5.1.2.
5.3 LISp-Miner and GUHA association rules

We briefly introduce some additional features of the 4ft-Miner procedure to deal with association rules. We outline several, yet not mentioned, options to tune the set of relevant rules in Section 5.3.1. Then we briefly introduce conditional association rules in Section 5.3.2, which can be also mined by the 4ft-Miner.

The 4ft-Miner procedure is a part of the LISp-Miner system which involves eight other GUHA procedures [18, 46]. Two of them deal with interesting couples of GUHA association rules. The procedure SD4ft-Miner [42] mines for couples of conditional GUHA association rules expressing differences between two subsets of rows of the analysed data matrix. The procedure Ac4ft-Miner [43] mines for very general action rules [6]. However, a more detailed description of these procedures is out of the scope of this paper.

5.3.1 Additional options to define set of relevant rules

There are several additional options to tune a set of relevant rules. There is a possibility to define coefficients – cyclical sequences. Cyclical sequences can be used for days of the weak and similar attributes with cyclical categories. If we have an attribute Day with categories Sun, Mon, Tue, Wed, Thu, Fri, Sat, then there are the following basic Boolean attributes – sequences of length 3: Day(Sun,Mon,Tue), Day(Mon,Tue,Wed), Day(Tue,Wed,Thu), Day(Wed,Thu,Fri), Day(Thu,Fri,Sat). However, there are two additional basic Boolean attributes if we ask to generate cyclical sequences of length 3: Day(Fri,Sat,Sun) and Day(Sat,Sun,Mon).

An additional possibility is to automatically generate negations of basic Boolean attributes. This way, attributes $\lnot$ Age(Young), $\lnot$ Age(Middle-aged), $\lnot$ Age(Senior), $\lnot$ Age(Old) can be generated.

There is also a possibility to define sets of equivalent attributes. Then, a maximum of one attribute from each set of equivalence can occur in a rule. The set {Age, Age_ed5} is an example of a useful set of equivalence.

We also mention a possibility to mark each attribute as basic or remaining. It is important that each partial cedent (see Section 3.3.2) has to contain at least one basic attribute. If we are interested in rules, where the attribute Education is always present, then we can mark this attribute as basic and all other attributes as remaining. This can be applied e.g. to the partial cedent Client, see Fig. 17.

5.3.2 Conditional association rules

The 4ft-Miner procedure also mines for conditional GUHA association rules. A conditional GUHA association rule is an expression $\varphi\approx\psi/\chi$ , where $\varphi$ , $\psi$ , and $\chi$ are Boolean attributes. A rule $\varphi\approx\psi/\chi$ is true in a data matrix $\cal M$ if a rule $\varphi\approx\psi$ is true in a data matrix ${\cal M}/\chi$ . The data matrix ${\cal M}/\chi$ consists of all rows of ${\cal M}$ satisfying $\chi$ . It is easy to show that for rules with minC and minS only it holds that:

•
if there is a row of a data matrix ${\cal M}$ satisfying $\chi$ and the rule $\varphi\land\chi\rightarrow_{p,s}\psi$ is true in $\cal M$ , then the rule $\varphi\rightarrow_{p,s}\psi$ is true in a data matrix ${\cal M}/\chi$ .

Figure 22.
Data matrices ${\cal M}_{A}$ , ${\cal M}_{B}$ and relevant 4ft-tables

However, there is no general relation between truthfulness of a rule $\varphi\land\chi\approx_{p,s}\psi$ and truthfulness of a conditional rule $\varphi\approx_{p,s}\psi/\chi$ in a data matrix ${\cal M}$ . Let us have data matrices ${\cal M}_{A}$ and ${\cal M}_{B}$ according to Fig. 22 where relevant 4ft-tables are also shown. We show that both (A) and (B) are true:
(A)
there are rules $\varphi\land\chi\approx_{p,s}\psi$ and $\varphi\approx_{p,s}\psi$ such that $\varphi\land\chi\approx_{p,s}\psi$ is true in ${\cal M}_{A}$ and $\varphi\approx_{p,s}\psi$ is false in ${\cal M}_{A}/\chi$ .
(B)
there are rules $\varphi\land\chi\approx_{p,s}\psi$ and $\varphi\approx_{p,s}\psi$ such that $\varphi\land\chi\approx_{p,s}\psi$ is false in ${\cal M}_{A}$ and $\varphi\approx_{p,s}\psi$ is true in ${\cal M}_{A}/\chi$ .

A simple proof follows.
(A)
Let us have rules $\varphi\land\chi\approx_{2,0.3}\psi$ and $\varphi\approx_{2,0.3}\psi$ . Then it holds:

–
for $\varphi\land\chi\approx_{2,0.3}\psi$ in the data matrix ${\cal M}_{A}$ : $lift=\frac{100(100+1+1+199)}{(100+1)(100+1)}=2.95$ and $support=\frac{100}{302}=0.33$ , which means that the rule $\varphi\land\chi\approx_{2,0.3}\psi$ is true in ${\cal M}_{A}$ ;
–
for $\varphi\approx_{2,0.3}\psi$ in the data matrix ${\cal M}_{A}/\chi$ : $lift=\frac{100(100+1+1+100)}{(100+1)(100+1)}=1.98$ and $support=\frac{100}{202}=0.50$ , thus the rule $\varphi\approx_{2,0.3}\psi$ is false in ${\cal M}_{A}/\chi$ .

(B)
Let us have rules $\varphi\land\chi\approx_{2,0.005}\psi$ and $\varphi\approx_{2,0.005}\psi$ . Then it holds:

–
for $\varphi\land\chi\approx_{2,0.005}\psi$ in the data matrix ${\cal M}_{B}$ : $lift=\frac{1(1+1+100+100)}{(1+1)(1+100)}=1.00$ and $support=\frac{1}{202}=0.00495$ , which means that the rule $\varphi\land\chi\approx_{2,0.005}\psi$ is false in ${\cal M}_{B}$ ;
–
for $\varphi\approx_{2,0.005}\psi$ in the data matrix ${\cal M}_{B}/\chi$ : $lift=\frac{1(1+1+1+100)}{(1+1)(1+1)}=25.75$ and $support=\frac{1}{103}=0.0097$ , thus the rule $\varphi\approx_{2,0.005}\psi$ is true in ${\cal M}_{B}/\chi$ .

We can conclude that mining with conditional association rules can produce additional interesting results.
6. Related works

Association rules were introduced as expressions $X\rightarrow Y$ , where $X$ and $Y$ are sets of items [3]. The apriori algorithm [3] was developed to mine for such association rules. It has been modified and implemented many times. Also, additional algorithms for finding all large itemsets were developed. Their overview is in book [1].

There are various attempts to generalise the association rules $X\rightarrow Y$ introduced in [3]. A more general form of association rules is introduced in [4]. The association rule is an expression $Ant\rightarrow Con$ , where both $A n t$ and $C o n$ are conjunctions of attribute-value pairs. An example of such an association rule used in [4] is

$\displaystyle(\text{age}=40)\land(\text{Salary}=50000)\rightarrow(\text{own % home}=\text{yes}).$

This can be formally written as $A_{1}(a_{1})\land\dots\land A_{u}(a_{u})\rightarrow C_{1}(c_{1})\land\dots% \land C_{v}(c_{v})$ , where $a_{1}$ is one of the possible values of attribute $A_{1}$ , etc.

The association rules of the form $A_{1}(a_{1})\land\dots\land A_{u}(a_{u})\rightarrow C_{1}(c_{1})\land\dots% \land C_{v}(c_{v})$ convey information about co-occurrence relations between items. Various approaches have been used to enhance association rules to express additional relations among items. An overview of such approaches is given in [19], where a definition of generalised association rules is introduced. A Boolean expression built from items is defined first. Propositional connectives $\land,\lor,\lnot$ as well as parentheses can be used in the usual way. Expressions $A\land(B\lor C)$ and $(D\lor E)\land\lnot F$ are examples of Boolean attributes built from the toy set $I_{\cal T}=\{A,B,C,D,E,F\}$ of items. Let $\cal I$ be a set of items and let $\{x_{1},\dots,x_{u}\}\subset{\cal I}$ , $\{y_{1},\dots,y_{u}\}\subset{\cal I}$ , and $\{x_{1},\dots,x_{u}\}\cap\{y_{1},\dots,y_{u}\}=\emptyset$ . Then a generalised association rule is each expression $\rho(x_{1},\dots,x_{u})\rightarrow\nu(y_{1},\dots,y_{u})$ , where $\rho(x_{1},\dots,x_{u})$ and $\nu(y_{1},\dots,y_{u})$ are Boolean expressions. The expression $A\land(B\lor C)\rightarrow(D\lor E)\land\lnot F$ is an example of a generalised association rule for the toy set $I_{\cal T}$ . Paper [19] then discusses problems of mining generalised association rules. Association rules with negations are also discussed in [10].

An important problem is mining association rules with numerical data. Association rules concerning data with numerical attributes are defined in [9]. Let $R$ be a (database) relation. Primitive conditions are defined first. Primitive conditions for a Boolean attribute $A$ are $A=yes$ and $A=no$ . Primitive conditions for a numerical attribute $A$ are $A=v$ and $A\in\langle v1,v2\rangle$ . Let $t$ be a tuple in $R$ , and let $t[A]$ denote the value of $t$ for the attribute $A$ . Then $t$ meets $A=v$ if $t[A]$ is equal to $v$ and $t$ meets $A\in\langle v1,v2\rangle$ if $t[A]\in\langle v1,v2\rangle$ . Conjunctions of primitive conditions are used to describe more complicated conditions. An association rule is then an expression $C_{1}\rightarrow C_{2}$ , where $C_{1}$ and $C_{2}$ are conjunctions of primitive conditions. Paper [9] then discusses problems of mining such association rules. An overview of additional approaches to mining association rules with numerical data is in [26]. Optimised gain rules [5] and optimised support rules [34] are special cases of association rules with numerical data.

The above-introduced enhancement of association rules leads to rules, which can be understood as GUHA association rules. Let us also mention an additional recent implementation of the ASSOC procedure, which mines for very general GUHA association rules, see [31].

There is also a paper comparing performance of the arules implementation of apriori and the 4ft-Miner procedure [52]. However, this comparison is flawed because of several principal faults. First of all, an unsuitable module of the LISp-Miner was chosen for a classification task, the aim of which is to produce many (millions) of association rules without any regard to its confidence. The 4ft-Miner is a GUHA procedure for exploratory analysis, where users (interactively) interpret the found results. Task parameters should have been therefore used to get a reasonable number (tens or hundreds) of association rules in the results or results should have been filtered in the sense of Section 7.2. Secondly, the solution times presented in [52] are hugely skewed by the (inappropriately) chosen MS Access as a DBMS to store results through ODBC. This is incomparable to storing results to plain text files as chosen for the arules. So presented solution times for 4ft-Miner do not measure the algorithm efficiency, but efficiency of the ODBC and the DMBS used. Further, the comparison in [52] does not take into account the proper handling of missing information, despite the used data matrix having a large number of missing values. It compares the arules (completely ignoring missing values by design) to 4ft-Miner with X-categories enabled and thus undertaking all the necessary steps for a proper handling of missing information as introduced in Section 3.2.1 and discussed in Section 4.4. Finally, the command-line modules should have been used and compared to the arules run from the command-line (and not the interactive LM Workspace module with much time spent in progress-reportings and other user-interface related functions).

There are additional ways to generalise association rules, which do not lead to GUHA association rules and are not relevant to this paper. Let us mention generalised association rules based on taxonomies introduced in [50] and their enhancement using the fuzzy approach [20]. An approach to provide a notion of importance or weight to individual items of association rules is introduced in [27]. A short overview of such additional approaches is also available in [27].

Important implementations of the GUHA ASSOC procedure are based on bit-strings (sometimes called bitmaps). There are also several approaches to using bitmaps in data mining with association rules. An application of bitmaps in the apriori algorithm is described in [23], the original apriori introduced in [3] is used. An overview of additional attempts to applying bitmap techniques in the association rule algorithms is provided in [23]. In [53], an application of bitmaps to mining maximal weighted frequent patterns is described. There are also various approaches to using bitmaps for building indexes used in mining maximal frequent itemsets, see e.g. [28, 49]. However, all these approaches differ from that used in the implementation of the ASSOC procedure.

Let us also emphasise that there are papers [32, 33, 51] concerning applications of constraint programming for itemset mining, which leads to itemsets satisfying various constrains. Paper [8] deals with rules based on first order logic and mathematical statistics. The confirmation measure is defined in this paper, a new 4ft-quantifier can be defined on the basis of the confirmation measure. Let us note that the observational logical calculi are introduced in [15]. They can be seen as modifications of a predicate calculus – only finite data structures are allowed and generalised quantifiers are added. GUHA association rules are formulas of an observational calculus of association rules [38]. An additional relevant topic is mining association rules in multiple relations [7]. This is related to mining GUHA association rules in many sorted observational calculi, see Section 16.8 in [38].

7. Conclusions and further work

7.1 Conclusions

Two approaches to data mining with association rules are compared. The first one is based on the apriori algorithm introduced in [3] and the second is the ASSOC procedure [15, 38] dealing with GUHA association rules. The conclusions follow.

7.1.1 Definitions and available theoretical background

The definitions of association rules used in both approaches are introduced, see Sections 2 and 3.1. It can be concluded that GUHA association rules are substantially more general than the association rules related to the apriori algorithm.

There are many papers and books dealing with association rules introduced in [3], their enhancements and the related apriori algorithm. Several of them are mentioned in Section 6. The concept of GUHA association rules comes from the 1960s. There is a solid theoretical background for GUHA association rules. The theory deals namely with classes of association rules, deduction rules and dealing with missing information, see [15, 38] and Sections 3.1, 3.2. Section 7.1.3 summarizes some results on the secured completion of missing information developed in [15].

7.1.2 Performance of algorithms

Performance of three algorithms for mining with association rules is discussed. The first one is apriori implemented in the arules package and described in Section 2. The second one is the 4ft-Miner procedure which implements the ASSOC procedure, see Sections 3.3.1–3.3.3. The third one is the apriori implementation, which is part of the 4ft-Miner, see Section 3.3.4. We can conclude that the ASSOC implementation in the 4ft-Miner is slower than apriori implemented in the arules. However, this is related to the fact that the 4ft-Miner keeps information necessary to apply secured completion to missing information [15, 38].

It is important that the time consumed by the 4ft-Miner is still acceptable and it is only a fraction of time necessary to solve the whole analytical task concerning real data, which usually contain missing information. In addition, the apriori implemented in the arules package produces misleading rules when dealing with missing information, see Section 7.1.3.

7.1.3 Dealing with missing information

Handling missing information used in the arules implementation of apriori algorithm produces a large number of rules (about 10–25% in some of solved tasks), which are confusing. It means that such rules are surely false in at least one possible completion of the analysed data matrix. However, in many cases there are many more than just one completion in which the produced rules are false. Moreover, a value of lift of a particular rule ranges in a large interval if we consider all possible completions of the analysed data. The same is true for confidence and also for additional measures of interest. Several examples and more details are in Section 4.4.

The core of the problem is related to the fact that arules treates the situation when we have no information about the value of a Boolean attribute $A(a)$ as if $A(a)$ is false. This is not the case of the secured completion of missing information. It computes nine-fold tables of couples of {0,1,X}-valued attributes and can be seen as a generalisation of the Kleene’s approach [22].

The 4ft-Miner uses the secured completion of missing information and does not produce the confusing rules produced by the arules.

7.1.4 Obligatory minimal confidence and support

Minimal values of both the support and confidence are obligatory parameters in the arules implementation of apriori. This can result in a loss of some rules interesting by high values of lift, see Section 4.5. This is not a concern for the 4ft-Miner procedure, where conditions for measures of interestingness can be freely combined.

7.1.5 Set of relevant GUHA association rules

There are very large sets of GUHA association rules, which can be derived from attributes – columns of an analysed data matrix. The 4ft-Miner procedure has efficient options to specify a set of relevant rules to be generated and verified. Among these options there are partial cedents and various types of coefficient $\alpha$ to be generated for relevant basic Boolean attributes $A(\alpha)$ . Sequences and cuts for ordinal attributes are very useful. Cyclical sequences of categories for attributes with a cyclical nature (e.g. days of the week) are also very efficient. Let us note that it is also possible to deal with disjunctions of basic Boolean attributes.

Several examples are in Sections 5.1.1, 5.2 and 5.3.1. Section 5.1.2 shows that some possibilities of defining a set of relevant GUHA association rules can be theoretically substituted by ASSOC families of apriori tasks. However, preparation and solving of such families is very time consuming, even if done by a script.

7.1.6 Conditional rules and couples of rules

The general form of the GUHA association rules makes sense to deal with conditional association rules, see Section 5.3.2. The 4ft-Miner procedure mines also for conditional association rules. The 4ft-Miner procedure is a part of the LISp-Miner system, which involves eight additional GUHA procedures [18, 46]. Two of them deal with interesting couples of GUHA association rules. It is possible in this way to obtain answers to new types of analytical questions.

7.2 Further work

A crucial problem of data mining with association rules is a very large number of output rules. Most of output rules are uninteresting consequences of already known items of domain knowledge. This problem is even greater when dealing with syntactically richer GUHA association rules. However, the syntactical richness makes it possible to formulate useful deduction rules, which can be used to reduce the number of output GUHA association rules. Various theoretical results have been achieved, they can be seen as a logic of association rules [38]. An original approach to the use of domain knowledge in mining association rules has been introduced in [37] and tested in [44]. The formal FOFRADAR framework has been developed in order to make it possible to formally describe the process of data mining with domain knowledge and GUHA association rules [39]. The idea is to deal with formalised items of domain knowledge corresponding to more complex patterns than single association rules. The following principles are used:

•

Each given item of domain knowledge is mapped into a set of simple association rules in co-operation with a domain expert.

•

This set is further expanded using logical deduction into a set of all association rules, which can be considered as consequences of the given item of knowledge.

•

Resulting sets of rules – consequences of given items of domain knowledge – are then used to interpret results of a data mining procedure.

All necessary steps of dealing with such items of domain knowledge are formally described by FOFRADAR and are supported by the LISp-Miner system, part of which is also the 4ft-Miner procedure [41, 44]. However, their applications involve elaborate operations in several modules of the LISp-Miner system. Thus the LMCL (LISp-Miner Control Language) scripting language has been developed [45, 48]. The LMCL makes it possible to describe necessary operations and run them automatically. First experiments with this approach are described in [45].

Further work is planned as a continuation of the above-introduced activities in the field of automation of dealing with domain knowledge in data mining with GUHA association rules.

Footnotes

Acknowledgments

The work described here has been supported by funds of institutional support for long-term conceptual development of science and research at the Faculty of Informatics and Statistics of the University of Economics, Prague and by the internal grant agency of UEP under IGA 26/2011.

References

Aggarwal

C.C.

Han

et al., Frequent Pattern Mining, Springer, Berlin, 2014.

Aggarwal

C.C.

, Data Mining, Springer, Berlin, 2015.

Agrawal

Imielinski

and Swami

, Mining Associations between Sets of Items in Large Databases, in: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data Buneman

and Jajodia

, eds, ACM Press, Fort-Collins, 1993, pp. 207–216.

Brian

Swami

A.N.

and Widom

, Clustering association rules, in: Proceedings of the Thirteenth International Conference on Data Engineering Gray

and Larson

, eds, IEEE Computer Society, 1997, pp. 220–231.

Brin

Rastogi

and Kyuseok

, Mining optimized gain rules for numeric attributes, Knowledge and Data Engineering 15 (2003), 324–338.

Dardzinska

, Action Rules Mining, Springer, Berlin, 2013.

Dehaspe

and De Raedt

, Mining association rules in multiple relations, in: Inductive Logic Programming Lavrač

and Dzeroski

, eds, Springer-Verlag, Berlin Heidelberg, 1997, pp. 125–132.

Flach

P.A.

and Lachiche

, Confirmation-guided discovery of first-order rules with Tertius, Machine Learning 42 (2001), 61–95.

Fukuda

Morimoto

Morishita

and Tokuyama

, Mining Optimized Association Rules for Numeric Attributes, Journal of Computer and System Sciences 58 (1999), 1–12.

10.

Gasmi

Ben Yahia

Nguifo

E.M.

and Bouker

, Extraction of Association Rules Based on Literalsets, in: Data Warehousing and Knowledge Discovery Song

I.Y.

Eder

and Nguyen

T.M.

, eds, Springer, 2007, pp. 293–302.

11.

Geng

and Hamilton

H.J.

, Interestingness Measures for Data Mining: A survey, ACM Computing Surveys (CSUR) 38 (2006), 1–32.

12.

Hahsler

Buchta

Gruen

and Hornik

, arules: Mining Association Rules and Frequent Itemsets. R package version 1.3-1, http://CRAN.R-project.org/package=arules, cited 12 Feb. 2016.

13.

Hájek

(guest ed.), special issue on GUHA, International Journal of Man-Machine Studies 10 (1978).

14.

Hájek

, The new version of the GUHA procedure ASSOC, in: Proceedings COMPSTAT 1984 Havránek

Šidák

and Novák

, eds, Springer-Verlag, Berlin Heidelberg, 1984, pp. 360–365.

15.

Hájek

and Havránek

, Mechanising Hypothesis Formation – Mathematical Foundations for a General Theory, Springer-Verlag, Berlin Heidelberg New York, 1978, http://www.cs.cas.cz/hajek/guhabook/, cited 12 Feb. 2016.

16.

Hájek

Havránek

and Chytil

, GUHA Method, Academia, Praha, 1983 (in Czech).

17.

Hájek

Havel

and Chytil

, The GUHA method of automatic hypotheses determination, Computing 1 (1966), 293–308.

18.

Hájek

Holeňa

and Rauch

, The GUHA method and its meaning for data mining, Journal of Computer and System Sciences 76 (2010), 34–48.

19.

Hamrouni

Sadok

B.Y.

and Engelbert

, Generalization of association rules through disjunction, Annals of Mathematics and Artificial Intelligence 59 (2010), 201–222.

20.

Hong

T.P.

Lin

K.Y.

and Wang

S.L.

, Fuzzy data mining for interesting generalized association rules, Fuzzy Sets and Systems 138 (2003), 255–269.

21.

Luo

Wang

and Tong

, Mining association rules in incomplete information systems, Journal of Central South University of Technology 151 (2008), 733–737.

22.

Kleene

S.C.

, Introduction to Metamathematics Van Nostrand

, Princeton, NJ, 1950.

23.

Lin

T.Y.

and Louie

, A fast association rule algorithm based on bitmap and granular computing, in: Proceedings The 12th IEEE International Conference on Fuzzy Systems (Volume:1) Nasraoui

Frigui

and Keller

J.M.

, eds, IEEE, Piscataway, NJ, 2003, pp. 678–683.

24.

LISp-Miner system webpage, URL: lispminer.vse.cz, cited 12 Feb. 2016.

25.

Mansingh

Osei-Bryson

K.M.

and Reichgelt

, Using ontologies to facilitate post-processing of association rules by domain experts, Information Sciences 181 (2011), 419–434.

26.

Minaei-Bidgoli

Barmaki

and Nasiri

, Mining numerical association rules via multi-objective genetic algorithms, Information Sciences 233 (2013), 15–24.

27.

Pears

Koh

Y.S.

Dobbie

and Yeap

, Weighted association rule mining via a graph based connectivity model, Information Sciences 218 (2013), 61–84.

28.

Qiao

and Zhang

, Efficiently matching frequent patterns based on bitmap inverted files built from closed itemsets, International Journal on Artificial Intelligence Tools 21 (2012), 1–19.

29.

The R Project for Statistical Computing, https://www.R-project.org/ cited 12 Feb. 2016.

30.

Ragel

and Cremilleux

, Treatment of Missing Values for Association Rules, in: 2nd Pacific-Asia Conference on Research and Development in Knowledge Discovery and Data Mining Wu

Kotagiri

and Korb

K.B.

, eds, Springer-Verlag, Berlin Heidelberg, 1998, pp. 258–279.

31.

Ralbovský

and Kuchař

, Using Disjunctions in Association Mining, in: Advances in Data Mining – Theoretical Aspects and Applications Perner

, eds, Springer-Verlag, Berlin Heidelberg, 2007, pp. 339–351.

32.

De Raedt

Tias

and Nijssen

, Constraint programming for itemset mining, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Li

Liu

and Sarawagi

, eds, ACM New York, NY, 2008, pp. 204–212.

33.

De Raedt

and Zimmermann

, Constraint-Based Pattern Set Mining, in: 7th SIAM International Conference on Data Mining Apte

Liu

, and Parthasarathy

, eds, Cambridge University Press, Cambridge, 2007, pp. 237–248.

34.

Rastogi

and Shim

, Mining optimized support rules for numeric attributes, Information Systems 26 (2001), 425–444.

35.

Rauch

, Ein Beitrag zu der GUHA method in der dreivertigen logic, Kybernetika 11 (1975), 101–113.

36.

Rauch

, Some Remarks on Computer Realizations of GUHA Procedures, International Journal of Man Machine Studies 10 (1978), 23–28.

37.

Rauch

, Considerations on Logical Calculi for Dealing with Knowledge in Data Mining, in: Advances in Data Management Ras

Z.W.

and Dardzinska

, eds, Springer-Verlag, Berlin Heidelberg, 2009, pp. 177–202.

38.

Rauch

, Observational Calculi and Association Rules, Springer-Verlag, Berlin Heidelberg, 2013.

39.

Rauch

, Formal Framework for Data Mining with Association Rules and Domain Knowledge – Overview of an Approach, Fundamenta Informaticae 137 (2015), 171–217.

40.

Rauch

and Šimůnek

, Mining for 4ft Rules, in: Discovery Science, Third International Conference Arikawa

and Morishita

, eds, Springer-Verlag, Berlin Heidelberg, 2000, pp. 268–272.

41.

Rauch

and Šimůnek

, An Alternative Approach to Mining Association Rules, in: Data Mining: Foundations, Methods, and Applications Lin

T.Y.

, eds, Springer-Verlag, Berlin Heidelberg, 2005, pp. 219–238.

42.

Rauch

and Šimůnek

, Dealing with Background Knowledge in the SEWEBAR Project, in: Knowledge Discovery Enhanced with Semantic and Social Information Berendt

, eds, Springer-Verlag, Berlin Heidelberg, 2009, pp. 89–106.

43.

Rauch

and Šimůnek

, Action Rules and the GUHA Method: Preliminary Considerations and Results, in: Foundations of Intelligent Systems Rauch

, eds, Springer-Verlag, Berlin Heidelberg, 2009, pp. 76–87.

44.

Rauch

and Šimůnek

, Applying Domain Knowledge in Association Rules Mining Process – First Experience, in: Procedings Foundations of Intelligent Systems Kryszkiewicz

, eds, Springer-Verlag, Berlin Heidelberg, 2011, pp. 113–122.

45.

Rauch

and Šimůnek

, Learning Association Rules from Data through Domain Knowledge and Automation, in: Proceedings Rules on the From Theory to Applications Bikakis

Fodor

, and Roman

, eds, Springer-Verlag, Berlin Heidelberg, 2014, pp. 266–280.

46.

Rauch

and Šimůnek

, Knowledge Discovery in Databases, Lisp-Miner and GUHA, Oeconomica, Prague, 2014 (in Czech).

47.

Šimůnek

, Academic KDD Project LISp-Miner, in: Proceedings Advances in Soft Computing and Intelligent Systems – Desing and Applications Abraham

Franke

and Koppen

, eds, Springer-Verlag, Berlin Heidelberg, 2003, pp. 263–272.

48.

Šimůnek

, LISp-Miner Control Language description of scripting language implementation, Journal of Systems Integration 5 (2014), 28–44, URL: http://wwwsi-journal.org/index.php/JSI/article/viewFile/193/140.

49.

Song

Yang

and Xu

, Index-maxminer: a new maximal frequent itemset mining algorithm, International Journal on Artificial Intelligence Tools 17 (2008), 303–320.

50.

Srikant

and Agrawal

, Mining generalized association rules, Future Generation Computer Systems 13 (1997), 161–180.

51.

Tias

Nijssen

and De Raedt

, Itemset mining: A constraint programming perspective, Artificial Intelligence 175 (2011), 1951–1983.

52.

Vojíř

Zeman

Kuchař

and Kliegr

, EasyMiner/R Preview: Towards a Web Interface for Association Rule Learning and Classification in R, in: Proceedings of the RuleML 2015 Challenge Bassiliades

, eds, 2015, http://ceur-wsorg/Vol-1417/paper10.pdf.

53.

Yun

Shin

Ryu

and Yoon

, An efficient mining algorithm for maximal weighted frequent patterns in transactional databases, Knowledge-Based Systems 33 (2012), 53–64.

Apriori and GUHA – Comparing two approaches to data mining with association rules

Abstract

Keywords

1. Introduction

2. Association rules and apriori

3.1 GUHA association rules

Table 1 Examples of 4ft-quantifiers

3.2.1 Secured X-extension

3.3.1 4ft-Miner – a part of the LISp-Miner system

3.3.2 Defining set of relevant GUHA association rules

3.3.3 Principles of ASSOC implementation

4.1 Principles of comparison

4.2 Benchmark Adult data

Table 2 Ordinal attributes created from metric attributes

Table 5 Comparing R-arules and 4ft-Miner in the Adult data matrix

Table 6 Rules with s ⁢ u ⁢ p ⁢ p ⁢ o ⁢ r ⁢ t ⩾ 0.05 , l ⁢ i ⁢ f ⁢ t ⩾ 2.96 and c ⁢ o ⁢ n ⁢ f ⁢ i ⁢ d ⁢ e ⁢ n ⁢ c ⁢ e < 0.9

5.1 Applying general basic boolean attributes

5.1.1 Segments of clients with extreme values of gain

Table 7 Family ℱ ⁢ ( Age_ed5(seq), 1-4 ) of 10 attributes

5.3.1 Additional options to define set of relevant rules

5.3.2 Conditional association rules

7. Conclusions and further work

7.1 Conclusions

7.1.1 Definitions and available theoretical background

7.1.2 Performance of algorithms

7.1.3 Dealing with missing information

7.1.4 Obligatory minimal confidence and support

7.1.5 Set of relevant GUHA association rules

7.1.6 Conditional rules and couples of rules

7.2 Further work

Footnotes

Acknowledgments

References

Table 1
Examples of 4ft-quantifiers

Table 2
Ordinal attributes created from metric attributes

Table 5
Comparing R-arules and 4ft-Miner in the Adult data matrix

Table 6
Rules with $support\geqslant 0.05$ , $lift\geqslant 2.96$ and $confidence<0.9$

Table 7
Family ${\cal F}(\text{\tt Age\_ed5(seq), 1-4})$ of 10 attributes