Abstract
Two approaches to data mining with association rules are compared – the apriori algorithm and the ASSOC procedure. The first one was developed for market basket analysis at the beginning of 1990s. An association rule is understood as an implication between conjunctions of attribute-value pairs. The ASSOC procedure is an implementation of the GUHA method of mechanizing hypothesis formation developed since the 1960s. ASSOC deals with association rules – general relations of two general Boolean attributes. Arules – a computational environment for mining association rules based on apriori and the 4ft-Miner procedure – an implementation of the ASSOC procedure are discussed and compared. It is shown that the arules approach to missing information does not correspond to Kleene’s approach and this can lead to a large number of misleading rules. It is also shown that a secured completion developed for the ASSOC procedure avoids this problem.
Introduction
Association rules were introduced in the early 1990s with a goal to better understand purchasing behaviour of customers in supermarkets [3]. An association rule is an expression
The idea of association rules has been later generalised to data in a tabular, attribute-value form. The association rule is further understood as an expression
However, the concept of association rules was introduced and studied since the 1960s in the framework of the GUHA method development [17]. Monograph [15] introduces a general theory of logic of discovery based on mathematical logic and statistics. Association rules introduced in [15] are general relations
The ASSOC procedure [13, 14] was invented to mine for GUHA association rules. The ASSOC procedure does not use the apriori algorithm; it uses depth-first walking of the search space and is based on suitable data structures built from strings of bits [36]. It was implemented several times and applied to many real-world data sets [18]. The core feature of the ASSOC procedure is its ability to mine for GUHA association rules satisfying given syntactical patterns. This means that one run of the ASSOC procedure can correspond to a large family of apriori runs. An additional important feature of the ASSOC procedure is a specific handling of missing information, i.e. secured completion (sometimes called pessimistic) and optimistic completion [15, 38].
The 4ft-Miner procedure introduced in [41] is an implementation of an enhancement of the ASSOC procedure. It provides very fine possibilities to define a relevant set of GUHA association rules and to find all relevant rules true in a given data matrix. This makes it possible to solve analytical questions, which cannot be solved by the apriori algorithm. The 4ft-Miner procedure also provides a version of the apriori algorithm based on strings of bits. The 4ft-Miner is a part of the LISp-Miner system, freely available at [24].
The goal of this paper is to compare the apriori algorithm to the GUHA approach for mining association rules. We have two implementations of the apriori algorithm – one in the arules package for R and one in the 4ft-Miner procedure. Finally, we also have the original GUHA approach provided by the 4ft-Miner procedure. Basic information on association rules and the arules package is provided in Section 2. The GUHA association rules as well as the main features of the 4ft-Miner procedure are introduced in Section 3. Comparison of two apriori algorithms and the GUHA approach is located in Section 4. Section 4 also shows that the arules approach to missing information can lead to the output of a relatively large amount of rules not true in many completions of the analysed data. Specific features of the 4ft-Miner procedure are introduced in Section 5. Related works are discussed in Section 6, conclusions and further work are in Section 7.
Association rules and apriori
A commonly used definition of the association rule comes from [3], see also [2]. Let
Example of a set 
Two basic measures of interestingness of association rules are used. The confidence is defined as
and the support is defined as
A task of mining association rules is a task of finding all association rules
An itemset is each set
It holds
The task of finding all association rules
In the first step, all itemsets In the second step, all rules
This is realised by the apriori algorithm, which has been described many times, see e.g. [2, 3]. The apriori algorithm substantially uses a simple fact that if
The algorithm implemented in the arules package deals with transaction data in a form of Boolean data matrices introduced in the right part of Fig. 1. The Boolean data matrices are represented in a sparse representation. This means that a vector of indices of the non-zero elements (row-wise starting with the first row), and pointers where each row starts are stored. Furthermore, data represented as a data matrix with non-binary attributes can be mined. An example is a data matrix
Representation of data matrix 
The data matrix
The arules package deals with data matrix
An association rule
and
The GUHA association rules are introduced in Section 3.1. Dealing with missing information in the ASSOC procedure is introduced in Section 3.2. Main features of the 4ft-Miner procedure are presented in Section 3.3.
GUHA association rules
The GUHA association rule is an expression
Basic Boolean attributes are created first. A basic Boolean attribute is an expression
The 4ft-quantifier
A condition related to a 4ft-quantifier
Examples of 4ft-quantifiers
Examples of 4ft-quantifiers
Data matrix 
Data matrix 
We assume

The 4ft quantifiers
and the 4ft-quantifiers
There are also important 4tft-quantifiers based on statistical hypothesis tests [15, 38], see 4ft-quantifier
Remember the data matrices
This way, each association rule
Secured X-extension
Missing information is a serious problem of data mining. A specific approach called secured X-extension is developed in [15, 35]and applied to the ASSOC procedure. Each attribute can get a value
Data matrix 
It is natural that in some cases the values of Boolean attributes derived from columns of a data matrix
An association rule An association rule Otherwise, the value of
This approach is introduced in [15] as the principle of secured X-extension, see also [35, 38]. The main features of this approach are summarized below.
Boolean attributes can be assigned values from
Values of derived three-valued attributes.
Three-valued attributes in data matrix 
This means that we deal with association rules concerning three-valued attributes with possible values

Here,
We are interested if a given association rule
for 4ft-quantifiers
Let us note that in many cases there are more than one secured 4ft-tables for a given nine-fold table
There are two additional ways of dealing with missing information related to the ASSOC procedure: the optimistic completion and the deleting of missing information [16, 38]. If we use the optimistic completion, then an association rule
If we use the deleting of missing information, then a value of an association rule
Let us note that the arules uses an approach to missing information that is different from each of the above-introduced (secured completion, optimistic completion and deleting). The arules approach can be described such that if
However, such ignoring of missing information means that no information about the value of a Boolean attribute
Ignoring and secured completion.
Let us emphasise that we introduce only those approaches to missing information, which are implemented in the arules or in the 4ft-Miner procedure. General approaches to missing information in data mining are presented in [2], additional approaches to dealing with missing information in data mining with association rules are introduced in [21, 30].
4ft-Miner – a part of the LISp-Miner system
The 4ft-Miner procedure is an implementation of the enhanced ASSOC procedure. It is one of nine GUHA procedures implemented in the LISp-Miner system. LISp-Miner is an academic system for knowledge discovery in databases (KDD for short) developed as rich and possibly an easy-to-use environment for the education of basic and advanced courses and for research in KDD [46, 47]. System architecture, an internal metadata storage model and user interface have been designed with this in mind. The LM Workspace module especially is meant for interactive use and repeated iterations among data preprocessing, modelling and result interpretation phases of the KDD process.
The aim was to offer rich possibilities to formulate analytical tasks, based on the GUHA method approach, with nine GUHA procedures currently available. There are also functions for further processing of found results, namely to compare them with consequences of already known items of domain knowledge and to filter them accordingly [44]. But there are also means implemented for a more automated approach to KDD using script language and specially designed modules for solving KDD tasks in the background [45, 48].
The goal of this paper is to compare the apriori and ASSOC approach to mining with association rules. We first have to outline 4ft-Miner options for the definition of a set of relevant association rules and to introduce principles of the ASSOC implementation. This is done in Sections 3.3.2 and 3.3.3. A slightly modified version of the apriori algorithm was implemented into the 4ft-Miner procedure for mining simple association rules. So we can compare the GUHA approach of depth-first walking through the searched space of potentially interesting relevant patterns to apriori breadth-first walking. This implementation is shortly introduced in Section 3.3.4. Additional features of the 4ft-Miner and the LISp-Miner approach to mining with association rules are then shortly introduced in Section 5.3.
Defining set of relevant GUHA association rules
The 4ft-Miner procedure mines for the GUHA association rules
The derived Boolean attributes
Let us consider a group
There are also very rich possibilities to define a set
Examples of 4ft-quantifiers are in Table 1, additional 4ft-quantifiers are in [15, 38]. It is important that a relevant 4ft-quantifier can be defined as a conjunction of several simple 4ft-quantifiers. A detailed description of 4ft-quantifiers implemented in the 4ft-Miner procedure is out of the range of this paper and is available in [46]. There are four possibilities to deal with missing information: secured completion, optimistic completion, deleting and ignoring of missing information. Descriptions of additional parameters specifying details of an output are out of the range of this paper.
The 4ft-Miner procedure generates and verifies all GUHA association rules
Principles of ASSOC implementation
Dealing with general Boolean attributes and with data with incomplete information, together with a necessity to quickly compute 4ft-tables and nine-fold tables, led to the employment of bitstrings as already described in [36], see also [41]. Each attribute
Cards of categories and X-card of the attribute 
Cards of Boolean attributes
It is important that the bit-wise Boolean operations
The relevant partial cedents are generated in a depth-first way. This can be outlined by a toy example concerning a set of relevant partial cedents created from attributes
A more detailed description of the 4ft-Miner algorithm and implementation is out of the range of this paper. The basic ideas come from [36], for more details see [41]. Let us note that there is an analogy to the important fact used in the apriori algorithm, i.e. if
A slightly modified version of the apriori algorithm was implemented into the 4ft-Miner procedure for mining simple association rules. Therefore, we can compare the GUHA approach of depth-first walking through the searched space of potentially interesting relevant patterns to breadth-first walking introduced in [3]. The format of association rules mined for by the apriori algorithm is more simple than the GUHA format, so the apriori algorithm could be only used in the following scenarios:
Handling of missing information is set to ignore, Coefficients of attributes consist of exactly one category, No other additional options to define set of relevant rules are used (see Section 5.3.1), Only the logical operation of conjuction is used between Boolean attributes and no negation of Boolean attributes is possible, and Minimal support criterion is included into task quantifiers.
The already implemented memory representation of data by bitstrings was used to quickly compute frequencies of Boolean attributes and their conjunctions. A sorted working list of bitstrings is initialised by inserting all cards of categories fulfilling the minimal support criterion. The bitstrings are sorted in descending order by number of 1’s in each of them. While the working list is not empty, the most frequent card at the top of the list is repeatedly removed from the list and used:
To prepare an antecedent corresponding to the currently processed bitstring (i.e. either a single literal, or a conjunction of two or more literals). This antecedent is combined progressively with every possible succedent (apriori allows for single literals in succedent only). All rules true in data are written to the output; To create bitstrings for all possible longer conjunctions consisting of the currently processed bitstring together with all the more frequent bitstrings representing one category and still in the working list. Newly-created bitstrings fulfilling the minimal support criterion are inserted at a proper position in the list to maintain sorting by frequencies.
To compensate for a large memory footprint of using bitstrings to compute frequencies of itemsets, a packed bitstring variant was implemented to represent sparse values in large datasets (more than 10 million rows). The apriori algorithm implementation in LISp-Miner allows to specify not only the maximal length of the antecedent, but its minimal length as well. Moreover, it is used separately for each cedent (in antecedent, succedent and/or in condition, see Section 5.3.2), so even conditional association rules could be generated. Moreover, the both approaches could be used simoultaneously to generate a single association rule (GUHA approach to generate part of the rule with a richer syntax and apriori to generate the simple part).
To preserve other LISp-Miner functionalities (e.g. partitioning the searched space of a task into smaller pieces to be processed in parallel or the pruning of results from logically derived rules), some memory structures had to be maintained even if the apriori branch is taken. This makes this implementation of apriori a little slower, see Section 4.3.
The goal of this section is to compare the apriori to GUHA approaches when solving the same tasks. We use two implementations of apriori – the first one provided by the arules package for R introduced in Section 2 and the second one implemented in the LISp-Miner system (and the 4ft-Miner procedure especially) as described in Section 3.3.4. The GUHA approach is introduced in Sections 3.3.2 and 3.3.3.
Principles of the comparison are in Section 4.1. We use the Adult data set also used in [12], a brief introduction is in Section 4.2. Solution times and the number of found rules for several association-rules-discovery tasks are presented in Section 4.3 to demonstrate scalability of all considered approaches for mining association rules. The results show remarkable differences between apriori and GUHA when dealing with missing information. Comments to this fact are located in Section 4.4. An important characteristic of a tool for data mining with association rules is its ability to deal with additional measures of the interestingness of rules. Obligatory use of both the minimal support and the minimal confidence in the arules package can lead to loss of interesting rules, see Section 4.5.
Principles of comparison
To obtain meaningful results, we need to compare only the algorithm cores, excluding unrelated steps (i.e. data preparations steps or user-interface related steps like progress-reporting) and excluding times spent in communications with other systems (e.g. DBMS). Thus, the data preprocessing phase was completed in advance both in the R console and in the LISp-Miner environment. Both compared systems supports logging with timestamps to get the most precise times. Both 32-bit and 64-bit versions are available in both systems. We compare the 32-bit versions.
We compare solutions times of the apriori function in the R console to solution times of command-line modules in LISp-Miner with all the progress-reporting and other user-interface features disabled. LISp-Miner offers both single-thread processing of tasks (by the LM TaskPooler module) and multi-thread parallel processing (by the LM SamePooler module). We provide both solution times.
Both systems have options to store results into a text file. Furthermore, LISp-Miner could store them into any DBMS accessible through ODBC. Anyway, storing of results was excluded from solution times because its duration is a pure function of HDD speed (in the first case) or the speed of the chosen DBMS and its ODBC database driver (in the second case).
Benchmark Adult data
We use the Adult data matrix to compare the core algorithms in terms of speed. This data matrix is also used in [12]. The Adult data matrix is a result of transformations of an original data matrix from the UCI machine learning repository. The Adult data matrix has 48 842 rows and 15 columns, i.e. attributes. We use nine categorical attributes and four metric attributes. Four metric attributes were transformed to ordinal attributes, see Table 2. Seven categorical attributes are nominal, see Table 3. Two categorical attributes are ordinal, see Table 4. If an attribute has missing values, then the number of missings is given in the corresponding row.
Ordinal attributes created from metric attributes
Ordinal attributes created from metric attributes
Nominal attributes – columns of the Adult data matrix
Ordinal attributes – columns of the Adult data matrix
The testing platform was the HP ProDesk with Intel i5-4590S (4 cores at 3 GHz), 4 GB RAM and HDD Toshiba 500 GB DT01ACA050 running Windows 7 Professional x64 with SP1. We have used the R system version of 3.2.3 and LISp-Miner system version of 27.00.01.
Results of comparison are in Table 5. Solutions times and the numbers of found association rules are presented for the apriori approach implemented in the arules package, for the apriori implemented in the 4ft-Miner and for the GUHA approach with secured completion of missing information implemented in the 4ft-Miner. There are two columns for the 4ft-Miner implementations – the first one for a single-thread solution using one processor marked ST and the second one for a multi-thread parallel solution using all available processor cores, which is marked MT.
Comparing R-arules and 4ft-Miner in the Adult data matrix
Comparing R-arules and 4ft-Miner in the Adult data matrix
Tasks inspired by the arules application described in [12] are used. This means that the Adult data matrix is analysed and association rules with antecedents of length from 1 to 10 are mined. Remember that we use 13 columns of the Adult, i.e. attributes introduced in Tables 2–4.
There are 17 variants of minimal support minS and minimal confidence minC used, see columns A and B. The rest of Table 5 presents solution times in seconds and the number of found rules for each combination of minS and minC and of a particular algorithm used. Solution times for apriori implementation in the arules package for R are in column C, solution times for 4ft-Miner apriori implementation are in column D (single-thread task solving) and in column E (multi-thread parallel task solving). The number of found rules is in column F (this number is the same for both apriori implementations). The number of found rules with secured completion used is in column G and is significantly different from column F, see Section 4.4. Solution times for the GUHA approach with secured completion are in column H (single-thread task solving) and in column I (multi-thread parallel task solving).
We can see that apriori implemented in the arules package (column C) is faster than apriori implemented in the 4ft-Miner (columns D and E). The higher number of output rules, the higher difference between both implementations. However, the time used by the 4ft-Miner is not much too longer, and it is only a fraction of the whole time necessary to solve a real task.
Regarding solutions times for the apriori algorithm in columns D and E, we have to stress that the implementation in the LISp-Miner system is meant as an add-on functionality provided just for a few cases when a task description is suitable for apriori (i.e. really simple, without missing information and does not use any of the rich-syntax possibilities the GUHA approach offers). It was implemented on the partial cedents level only, so no existing functionality of the LISp-Miner system is lost. Therefore, some optimisations regarding the whole association rule could not be used and some time is spent on preparing additional memory structures. It results in time overhead compared to the apriori implementation in the arules package.
Solution times for the secured completion are longer because the GUHA approach of the depth-first walking have to be used and simultaneously, it is necessary to maintain additional information to compute nine-fold tables. Thus, solution times in columns H and I are remarkably higher than in columns C, D and E. However, we state that arules results are misleading in the case of ignoring missing information, see Section 4.4. The time consumed by the secured completion is still acceptable, and it is only a fraction of time necessary to solve the whole analytical task concerning real data, which usually does contain missing information. Moreover, the GUHA approach makes it possible to use not only the secured completion but any of the available different types of handling of missing information (see Section 3.2.2), as well as dealing with general Boolean attributes and with conditional association rules. These additional features can be only hardly realised by apriori, see Section 5.
The secured completion of missing information seems to be useful even when solving a task that can be solved by the apriori algorithm. This is because some rules produced by the arules can be misleading. Last but not least, the secured completion decreases the usually very high number of output rules to only ones, which are surely true in any completion of the analysed data. In the last three rows of Table 5, we can see that the number of rules decreases by 13–25 percent. See Section 4.4 for more details.
We can see a big difference between apriori and secured completion of missing information in columns F and G of Table 5. There are many extra rules produced by the apriori algorithm. Each of these extra rules has a problem – there is a completion of the analysed data matrix Adult in which the rule is not valid. Let us consider minsup
Tables 
A secured four-fold table
In Table 5, we deal with the 4ft-quantifier
The rule with highest lift in the set Lift4 of rules.
However, if we use the secured approach for the lift, there is no rule with lift
Tables 
It holds
We can conclude that:
The association rule mining process usually suffers from a problem of a large number of output rules, The ignoring-of-missing-information approach used in the arules package produces rules, which do not satisfy the constraints (i.e. are false) in some completions of the analysed data matrix, the number of such rules can be relatively large, A value of lift of a rule produced by the arules package for a data matrix with missing information can be confusing, and The secured completion produces a lower number of rules, which surely satisfy the constraints (i.e. are true) in all possible completions of the analysed data matrix.
A more detailed description of dealing with missing information in the 4ft-Miner procedure is out of the scope of this paper, for details see namely [15, 38].
We present an example of loss of rules, which are interesting because of the high value of the lift measure. The loss is related to obligatory use of minimal values of both support and confidence in the arules package. The following instructions for R>rules <- apriori(Adult,parameter = list(support=0.05, confidence=0.9))>inspect(head(sort(rules, by = "lift"), n = 1)) result to 14 012 rules, the highest lift is 2.956264, and it belongs to the rule age=Young, relationship=Own-child, sex=Male, capital-loss=none, native-country=United-States => marital-status=Never-married , see Fig. 15. However, if we apply the 4ft-Miner procedure to search all rules satisfying
Rules with
,
and
Rules with
The highest lift for apriori; support 
The rule with the highest 
We can conclude that the obligatory use of minimal values of both support and confidence applied in the arules package can lead to remarkable loss of rules interesting for their high values of lift. It is practically not possible to estimate a degree of such loss. This is the same for additional measures of interestingness. This problem does not occur in the 4ft-Miner procedure, where conditions concerning various measures of interestingness can be freely combined.
The goal of this section is to present 4ft-Miner specific features as well as to point to possibilities of the LISp-Miner system to deal with association rules. An example of 4ft-Miner application, which cannot be practically solved by the arules package, is located in Section 5.1. This example is based on dealing with basic Boolean attribute
Applying general basic boolean attributes
First we show that the 4ft-Miner procedure can be used to search for segments of clients with a high chance to have a maximal gain. This is done in Section 5.1.1. In Section 5.1.2, we show that it is very difficult to solve this task by the arules package in the same way.
Segments of clients with extreme values of gain
An example of input parameters of the 4ft-Miner is in Fig. 17.
4ft-Miner input example.
These parameters mean that we search for all rules
The set
The attribute Age_ed5 was created by means of the LISp-Miner as a discretisation of the attribute Age into equidistant intervals of length 5. It has 15 categories
Categories of the attribute Age_ed5 and their frequencies.
The attribute Education has 16 categories, see Table 4. This means that the row Education(seq),1-4 B,pos specifies 16
The attribute Hours_per_week has four categories, see Table 4. The row Hours_per_week(subset),1-1 B,pos specifies the set
Categories of the attribute Capital_gain_exp and their frequencies are in Fig. 19.
Categories of the attribute Capital_gain_exp.
The row Capital_gain_exp(rcut), 1-8 B,pos in the column SUCCEDENT specifies a set
The run of the 4ft-Miner with parameters according to Fig. 17 gets 686 rules. We have used secured completion of missing information. Five rules with the highest lift are displayed in a short form in Fig. 20.
The highest lift – 12.787 has the rule
There are details for each rule analogous to that in Fig. 16. Basic information concerning the run of the 4ft-Miner are in the upper left part of Fig. 20. The 4ft-Miner procedure has generated and verified 30 831 256 rules and skipped many more, see Section 3.3.3. The run took 25 minutes and 26 seconds. The minimal confidence in the output set of 686 rules introduced above is 0.048.
The task solved in the previous section can also be theoretically solved by the R package arules if we accept the ignoring approach to missing information. However, there is still a problem concerning basic Boolean attributes with more than one category,
We can use the attribute Age_ed5 to get all sequences of categories of length 1. Then we need attributes
Family
of 10 attributes
Family
4ft-Miner output example.
A similar family
Family
The category Preschool#1st-4th of the attribute
The row Capital_gain_exp(rcut), 1-8 B,pos in the column SUCCEDENT specifies a set of eight basic Boolean attributes
Family
Having prepared all 28 attributes belonging to families
Then we can use 28 attributes introduced in Tables 7–9 together with eight additional attributes, see rows Hours_per_week(subset), 1-1, …, Workclass(subset), 1-1 in the column ANTECEDENT in Fig. 17. However, this leads to a huge number of redundant rules. The rule
The problem of redundant rules in the arules can be avoided such that we use a family of runs of the arules package on data matrices with attributes
The minimal confidence in the output set of 686 rules introduced above is 0.048 and the minimal support is 0.002. A run of the arules package task with these minimal values in the Adult data matrix results in 2 783 322 rules and requires 11 seconds. Thus, 800 runs of similar tasks requires more than 146 minutes. Moreover, it is necessary to prepare families of attributes
An additional possibility is to try to combine both the approaches outlined above. We can conclude that a substitution of the run of the 4ft-Miner procedure specified in Fig. 17 by applications of the arules is very laborious.
We show two examples of additional reasonable applications of the 4ft-Miner procedure, which can only hardly be substituted by the arules package. We use the two following analytical questions Q1 and Q2.
Are there any segments of clients specified by combinations of attributes Age, Education, Hours_per_week, Marital_status, Native_country, Occupation, Race, Relationship, Sex, Workclass such that clients from this segment have both large income and extreme capital gain?
Are there any segments of clients specified by combinations of attributes Age, Education, Hours_per_week, Marital_status, Native_country, Occupation, Race, Relationship, Sex, Workclass such that clients from this segment have large income or extreme capital gain?
We consider both Q1 and Q2 reasonable since segments of clients having both large income and extreme capital gain and also segments of clients having large income or extreme capital gain are interesting segments.
Question Q1 can be solved such that we find a set of all GUHA association rules
Definitions of relevant succedents for Q1 and Q2.
The corresponding run of the 4ft-Miner resulted in a set of 190 rules. The rule Education(Prof-school)
Question Q2 can be solved in a similar way; we deal with GUHA association rules
The run of the 4ft-Miner resulted in a set of 174 rules. The highest lift 1.535 has the rule
The results of the four above-introduced runs of the 4ft-Miner can be transformed into answers to questions Q1 and Q2. We used the secured completion of missing information, which gives more precise results than ignoring missing information as in the arules package. However, even if we use the ignoring approach implemented in the arules package, it is not easy to get equivalent results. The only way is to use suitable families of attributes and/or arules tasks similar to those introduced in Section 5.1.2.
We briefly introduce some additional features of the 4ft-Miner procedure to deal with association rules. We outline several, yet not mentioned, options to tune the set of relevant rules in Section 5.3.1. Then we briefly introduce conditional association rules in Section 5.3.2, which can be also mined by the 4ft-Miner.
The 4ft-Miner procedure is a part of the LISp-Miner system which involves eight other GUHA procedures [18, 46]. Two of them deal with interesting couples of GUHA association rules. The procedure SD4ft-Miner [42] mines for couples of conditional GUHA association rules expressing differences between two subsets of rows of the analysed data matrix. The procedure Ac4ft-Miner [43] mines for very general action rules [6]. However, a more detailed description of these procedures is out of the scope of this paper.
Additional options to define set of relevant rules
There are several additional options to tune a set of relevant rules. There is a possibility to define coefficients – cyclical sequences. Cyclical sequences can be used for days of the weak and similar attributes with cyclical categories. If we have an attribute Day with categories Sun, Mon, Tue, Wed, Thu, Fri, Sat, then there are the following basic Boolean attributes – sequences of length 3: Day(Sun,Mon,Tue), Day(Mon,Tue,Wed), Day(Tue,Wed,Thu), Day(Wed,Thu,Fri), Day(Thu,Fri,Sat). However, there are two additional basic Boolean attributes if we ask to generate cyclical sequences of length 3: Day(Fri,Sat,Sun) and Day(Sat,Sun,Mon).
An additional possibility is to automatically generate negations of basic Boolean attributes. This way, attributes
There is also a possibility to define sets of equivalent attributes. Then, a maximum of one attribute from each set of equivalence can occur in a rule. The set {Age, Age_ed5} is an example of a useful set of equivalence.
We also mention a possibility to mark each attribute as basic or remaining. It is important that each partial cedent (see Section 3.3.2) has to contain at least one basic attribute. If we are interested in rules, where the attribute Education is always present, then we can mark this attribute as basic and all other attributes as remaining. This can be applied e.g. to the partial cedent Client, see Fig. 17.
Conditional association rules
The 4ft-Miner procedure also mines for conditional GUHA association rules. A conditional GUHA association rule is an expression
if there is a row of a data matrix
Data matrices 
However, there is no general relation between truthfulness of a rule
there are rules
there are rules
A simple proof follows.
Let us have rules
for for
Let us have rules
for for
We can conclude that mining with conditional association rules can produce additional interesting results.
Association rules were introduced as expressions
There are various attempts to generalise the association rules
This can be formally written as
The association rules of the form
An important problem is mining association rules with numerical data. Association rules concerning data with numerical attributes are defined in [9]. Let
The above-introduced enhancement of association rules leads to rules, which can be understood as GUHA association rules. Let us also mention an additional recent implementation of the ASSOC procedure, which mines for very general GUHA association rules, see [31].
There is also a paper comparing performance of the arules implementation of apriori and the 4ft-Miner procedure [52]. However, this comparison is flawed because of several principal faults. First of all, an unsuitable module of the LISp-Miner was chosen for a classification task, the aim of which is to produce many (millions) of association rules without any regard to its confidence. The 4ft-Miner is a GUHA procedure for exploratory analysis, where users (interactively) interpret the found results. Task parameters should have been therefore used to get a reasonable number (tens or hundreds) of association rules in the results or results should have been filtered in the sense of Section 7.2. Secondly, the solution times presented in [52] are hugely skewed by the (inappropriately) chosen MS Access as a DBMS to store results through ODBC. This is incomparable to storing results to plain text files as chosen for the arules. So presented solution times for 4ft-Miner do not measure the algorithm efficiency, but efficiency of the ODBC and the DMBS used. Further, the comparison in [52] does not take into account the proper handling of missing information, despite the used data matrix having a large number of missing values. It compares the arules (completely ignoring missing values by design) to 4ft-Miner with X-categories enabled and thus undertaking all the necessary steps for a proper handling of missing information as introduced in Section 3.2.1 and discussed in Section 4.4. Finally, the command-line modules should have been used and compared to the arules run from the command-line (and not the interactive LM Workspace module with much time spent in progress-reportings and other user-interface related functions).
There are additional ways to generalise association rules, which do not lead to GUHA association rules and are not relevant to this paper. Let us mention generalised association rules based on taxonomies introduced in [50] and their enhancement using the fuzzy approach [20]. An approach to provide a notion of importance or weight to individual items of association rules is introduced in [27]. A short overview of such additional approaches is also available in [27].
Important implementations of the GUHA ASSOC procedure are based on bit-strings (sometimes called bitmaps). There are also several approaches to using bitmaps in data mining with association rules. An application of bitmaps in the apriori algorithm is described in [23], the original apriori introduced in [3] is used. An overview of additional attempts to applying bitmap techniques in the association rule algorithms is provided in [23]. In [53], an application of bitmaps to mining maximal weighted frequent patterns is described. There are also various approaches to using bitmaps for building indexes used in mining maximal frequent itemsets, see e.g. [28, 49]. However, all these approaches differ from that used in the implementation of the ASSOC procedure.
Let us also emphasise that there are papers [32, 33, 51] concerning applications of constraint programming for itemset mining, which leads to itemsets satisfying various constrains. Paper [8] deals with rules based on first order logic and mathematical statistics. The confirmation measure is defined in this paper, a new 4ft-quantifier can be defined on the basis of the confirmation measure. Let us note that the observational logical calculi are introduced in [15]. They can be seen as modifications of a predicate calculus – only finite data structures are allowed and generalised quantifiers are added. GUHA association rules are formulas of an observational calculus of association rules [38]. An additional relevant topic is mining association rules in multiple relations [7]. This is related to mining GUHA association rules in many sorted observational calculi, see Section 16.8 in [38].
Conclusions and further work
Conclusions
Two approaches to data mining with association rules are compared. The first one is based on the apriori algorithm introduced in [3] and the second is the ASSOC procedure [15, 38] dealing with GUHA association rules. The conclusions follow.
Definitions and available theoretical background
The definitions of association rules used in both approaches are introduced, see Sections 2 and 3.1. It can be concluded that GUHA association rules are substantially more general than the association rules related to the apriori algorithm.
There are many papers and books dealing with association rules introduced in [3], their enhancements and the related apriori algorithm. Several of them are mentioned in Section 6. The concept of GUHA association rules comes from the 1960s. There is a solid theoretical background for GUHA association rules. The theory deals namely with classes of association rules, deduction rules and dealing with missing information, see [15, 38] and Sections 3.1, 3.2. Section 7.1.3 summarizes some results on the secured completion of missing information developed in [15].
Performance of algorithms
Performance of three algorithms for mining with association rules is discussed. The first one is apriori implemented in the arules package and described in Section 2. The second one is the 4ft-Miner procedure which implements the ASSOC procedure, see Sections 3.3.1–3.3.3. The third one is the apriori implementation, which is part of the 4ft-Miner, see Section 3.3.4. We can conclude that the ASSOC implementation in the 4ft-Miner is slower than apriori implemented in the arules. However, this is related to the fact that the 4ft-Miner keeps information necessary to apply secured completion to missing information [15, 38].
It is important that the time consumed by the 4ft-Miner is still acceptable and it is only a fraction of time necessary to solve the whole analytical task concerning real data, which usually contain missing information. In addition, the apriori implemented in the arules package produces misleading rules when dealing with missing information, see Section 7.1.3.
Dealing with missing information
Handling missing information used in the arules implementation of apriori algorithm produces a large number of rules (about 10–25% in some of solved tasks), which are confusing. It means that such rules are surely false in at least one possible completion of the analysed data matrix. However, in many cases there are many more than just one completion in which the produced rules are false. Moreover, a value of lift of a particular rule ranges in a large interval if we consider all possible completions of the analysed data. The same is true for confidence and also for additional measures of interest. Several examples and more details are in Section 4.4.
The core of the problem is related to the fact that arules treates the situation when we have no information about the value of a Boolean attribute
The 4ft-Miner uses the secured completion of missing information and does not produce the confusing rules produced by the arules.
Obligatory minimal confidence and support
Minimal values of both the support and confidence are obligatory parameters in the arules implementation of apriori. This can result in a loss of some rules interesting by high values of lift, see Section 4.5. This is not a concern for the 4ft-Miner procedure, where conditions for measures of interestingness can be freely combined.
Set of relevant GUHA association rules
There are very large sets of GUHA association rules, which can be derived from attributes – columns of an analysed data matrix. The 4ft-Miner procedure has efficient options to specify a set of relevant rules to be generated and verified. Among these options there are partial cedents and various types of coefficient
Several examples are in Sections 5.1.1, 5.2 and 5.3.1. Section 5.1.2 shows that some possibilities of defining a set of relevant GUHA association rules can be theoretically substituted by ASSOC families of apriori tasks. However, preparation and solving of such families is very time consuming, even if done by a script.
Conditional rules and couples of rules
The general form of the GUHA association rules makes sense to deal with conditional association rules, see Section 5.3.2. The 4ft-Miner procedure mines also for conditional association rules. The 4ft-Miner procedure is a part of the LISp-Miner system, which involves eight additional GUHA procedures [18, 46]. Two of them deal with interesting couples of GUHA association rules. It is possible in this way to obtain answers to new types of analytical questions.
Further work
A crucial problem of data mining with association rules is a very large number of output rules. Most of output rules are uninteresting consequences of already known items of domain knowledge. This problem is even greater when dealing with syntactically richer GUHA association rules. However, the syntactical richness makes it possible to formulate useful deduction rules, which can be used to reduce the number of output GUHA association rules. Various theoretical results have been achieved, they can be seen as a logic of association rules [38]. An original approach to the use of domain knowledge in mining association rules has been introduced in [37] and tested in [44]. The formal FOFRADAR framework has been developed in order to make it possible to formally describe the process of data mining with domain knowledge and GUHA association rules [39]. The idea is to deal with formalised items of domain knowledge corresponding to more complex patterns than single association rules. The following principles are used:
Each given item of domain knowledge is mapped into a set of simple association rules in co-operation with a domain expert. This set is further expanded using logical deduction into a set of all association rules, which can be considered as consequences of the given item of knowledge. Resulting sets of rules – consequences of given items of domain knowledge – are then used to interpret results of a data mining procedure.
All necessary steps of dealing with such items of domain knowledge are formally described by FOFRADAR and are supported by the LISp-Miner system, part of which is also the 4ft-Miner procedure [41, 44]. However, their applications involve elaborate operations in several modules of the LISp-Miner system. Thus the LMCL (LISp-Miner Control Language) scripting language has been developed [45, 48]. The LMCL makes it possible to describe necessary operations and run them automatically. First experiments with this approach are described in [45].
Further work is planned as a continuation of the above-introduced activities in the field of automation of dealing with domain knowledge in data mining with GUHA association rules.
Footnotes
Acknowledgments
The work described here has been supported by funds of institutional support for long-term conceptual development of science and research at the Faculty of Informatics and Statistics of the University of Economics, Prague and by the internal grant agency of UEP under IGA 26/2011.
