TFI-Apriori : Using new encoding to optimize the apriori algorithm

Abstract

In this paper we propose a new optimization for Apriori-based association rule mining algorithms where the frequency of items can be encoded and treated in a special manner drastically increasing the efficiency of the frequent itemset mining process. An efficient algorithm, called TFI-Apriori, is developed for mining the complete set of frequent itemsets. In the preprocessing phase of the proposed algorithm, the most frequent items from the database are selected and encoded. The TFI-Apriori algorithm then takes advantage of the encoded information to decrease the number of candidate itemsets generated in the mining process, and consequently drastically reduces execution time in candidate generation and support counting phases. Experimental results on actual datasets – databases coming from applications with very frequent items – demonstrate how the proposed algorithm is an order of magnitude faster than the classical Apriori approach without any loss in generation of the complete set of frequent itemsets. Additionally, TFI-Apriori has a smaller memory requirement than the traditional Apriori-based algorithms and embedding this new optimization approach in well-known implementations of the Apriori algorithm allows reuse of existing processing flows.

Keywords

Frequent pattern mining association rule mining apriori knowledge discovery data mining

1. Introduction

Data mining is a process of discovering previously unknown, though useful, knowledge from data in large databases. Discovery of association rules is an important database mining problem. The problem of association rules mining (ARM) over basket data was introduced in [1]. An association rule is a set of transactions, each transaction being a set of literals (called items), is an expression of the form $X\to Y$ , where $X$ and $Y$ are each a set of items. The intuitive meaning of such a rule is that transactions of the database containing $X$ tend to contain $Y$ as well. For example, “30% of transactions that contain milk also contain cheese; 2% of all transactions contain both items” is equivalent to association rule: Buy (milk) $\to$ Buy (cheese). Here 30% and 2% are referred to as the confidence and the support of the rule, respectively. The goal is to find all association rules that satisfy the user-specified minimum support and minimum confidence.

The problem of discovering all association rules can be decomposed into two sub-problems: first, finding the frequent itemsets, and second, generating the association rules. The support for an itemset is the number of transactions that contain the itemset. Itemsets with a support value above the minimum support are called the frequent itemsets. The main time-consuming task of every ARM algorithm is to discover the sets of items that frequently appear together (i.e., the frequent itemsets). In addition to finding association rules, mining frequent items is a fundamental and crucial step employed to solve problems in areas such as correlations [26], sequential patterns [27, 28], classification [29], maximal frequent patterns [30, 31, 32], and frequent closed patterns [33, 34, 35].

The Apriori algorithm [2] is one of the most popular data mining approaches for finding frequent itemsets from transactional datasets. The Apriori algorithm is the main basis and genesis of many other well-known algorithms and implementations. It is worth noting that the main challenge encountered by researchers in frequent itemset mining has been to reduce the execution time.

Many algorithms have been proposed to find frequent itemsets from very large databases. The number of database scans required for the task has been reduced from a number equal to the size of the largest itemset in Apriori [2], to typically just a single scan in modern ARM algorithms such as FP-Growth [7]. Frequent patterns discovered via mining processes are not only interesting themselves but also useful to other data analysis and mining tasks, including associative classification, clustering, cube computation and analysis, gradient mining, and multi-dimensional discriminant analysis [18].

Bodon [14] published one of the best implementations of the Apriori algorithm regarding efficiency by showing that the use of efficient data structures and an effective implementation is very important in improving the performance of the Apriori algorithm. He proposed a fast Apriori implementation using the trie data structure [23] instead of a hash tree which was used in classical approaches [41]. In work presented in this paper, we have adapted Bodon’s trie-based Apriori source code as the base of our works and implementation with the classical Apriori algorithm being optimized to reduce execution time. In our optimization, the top frequent items from the dataset are selected and encoded.

Item ordering has been considered in some recent works [14, 39] by showing that the performance of the Apriori algorithm may be faster if the items are sorted during the first iteration. For example, Bodon [14] shows that the search is faster if the order of items corresponds to the frequency order. As the exact frequency order is known only after the first scan of the whole database, everything is at hand for building the trie with the frequency ordering, and all this accomplished without imposing any overhead on the original algorithm.

In [39], Goethals shows that it is advantageous to recode frequent items according to the ascending order of their frequencies (i.e., the inverse of the frequency codes) because candidate generation is faster this way. In [40], Bayardo has considered dynamic ordering in his work. In dynamic ordering, and in each step of the algorithm, items are rearranged to improve the efficiency,

In our proposed algorithm, the set of most frequent items have been considered and are encoded to improve the efficiency of the Apriori algorithm. The proposed algorithm takes advantage of the encoded information for its mining process. Using this approach, the execution time can be reduced drastically. Experimental results show the efficiency of our new algorithm (called TFI-Apriori) in real datasets.

The paper is organized as follows. Section 2 provides a summary of previous well-known algorithms for association rule mining. Section 3 presents our proposed algorithm, TFI-Apriori. Section 4 presents the experimental results of our implementation and Section 5 concludes the paper.

2. Backgrounds and related works

Association rule mining was first introduced in 1993 [1], and since then many algorithms with different approaches have been developed for the ARM problem. In this section, some of prior works are concisely mentioned and, moreover, those algorithms that form the basis for our algorithm are reviewed.

2.1 Association rule mining algorithms

Hipp et. al. [4] provides a general survey on efficient mining of association rules in transactional and relational databases. AIS [21], SETM [22] and Apriori [2] can be considered as the first generation of association rule mining algorithms with the Apriori algorithm being by far the most well-known of all. Extending the basic Apriori approach, in AprioriTID [2] (an extension of the Apriori) Instead of relying on the raw database, each transaction is internally represented by current candidates it contains. In AprioriHybrid both Apriori and AprioriTID approaches are combined [2].

Some optimizations have been made on the Apriori algorithm, yielding DHP [24], a hash-based technique reducing the size of candidate itemsets where each itemset is hashed into a corresponding bucket by using an appropriate hash function. DHP efficiently controls the number of candidate 2-itemsets and uses the transaction trimming idea to lower the size of the dataset at each phase of the Apriori-based frequent itemset mining.

DIC [6] is a further variation of the Apriori Algorithm, softening the strict separation between counting and generating candidates. Generally, after the $k_{th}$ interval, the $(k+1)$ -candidate itemsets are generated and counted together with the previous itemsets in the later intervals. Read the database from the end to the beginning and counts those candidate itemsets that were not completely counted. The actual number of database scans is based on the interval size, meaning that for small intervals, the generation of all itemsets is done in first scan and final counting will be done in the second scan.

The Partition algorithm [5] is an Apriori-like algorithm that exploits set intersections to determine support values. The Partition algorithm divides the database into several partitions overcoming the memory problem. Each partition processed independently in main memory (RAM). When the frequent itemsets for all database parts are determined, another extra scan is done to remove those locally frequent itemsets which are not globally frequent.

Another algorithm that reduces the database activity substantially is the sampling method [3] whose idea is picking a random sample, using which to find all association rules, and then verify the results with the rest of the database. In [8, 9] some other variants and revisions of sampling methods have been proposed.

One of the most outstanding algorithms to find frequent itemsets is the FP-growth [7] (Frequent Pattern Growth) that succeeded in eliminating the candidate generation process [25]. In a preprocessing step, FP-growth derives a highly condensed representation of the transaction data, the FP-tree, generated by counting occurrences. FP-tree-based mining adopts pattern fragment growth and partitioning-based divide-and-conquer methods to avoid generation of too many candidate sets and scale down the search space respectively. Next, to decompose the mining task into some smaller ones for mining in conditional databases, a partitioning-based divide-and-conquer method is used [25].

To enhance the FP-growth algorithm regarding performance, Lin et al. conceived the IFP-growth [42] algorithm, distinguishing itself from FP-growth in exploiting an address-table structure by lowering the complexity of forming the entire FP-tree. Second, it makes use of a new structure by the name of FP-tree+ to eclipse the need for building conditional FP-trees recursively. Lastly, through employing the address-table and the FP-tree+, the proposed algorithm requires less memory and has better performance compared to FP-tree-based algorithms.

As Han et al. stated “Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in a horizontal data format (i.e., {TID: itemset}), where TID is a transaction. Alternatively, mining can also be performed with data presented in a vertical data format (i.e., {item: TID_set}).” in [18]. Zaki [17] conducted work based on [18] and proposed the Equivalence CLASS Transformation (Eclat) algorithm. Derived from the mentioned Eclat algorithm are the two new methods presented by Sandy Moens [43] by the names of Dist-Eclat [43] and Big-FIM [43] specifically tailored for mining large datasets.

The CBAR algorithm proposed by Tsay and Chiang uses a cluster table to load the databases into main memory [36]. By performing the support count of the cluster table, CBAR saves time and the need to scan all the transactions stored in the cluster table.

Dong and Han proposed an effective algorithm entitled BitTableFI [37] employing the BitTable data structure to compress database horizontally for quick candidate itemsets generation and vertically to increase support count speed. A fresh way to go about the database was presented by Wu et al. who treated the database as a stream of data and then found frequent patterns by scanning the database only once [38].

In recent years, several algorithms and implementations have been conceived making use of different concepts such as the concept of non-drivable items proposed in [11], TBAR (Tree-based association rule mining) introduced in [20], and other efficient implementations of Apriori proposed by Borgelt [10] and Bodon [14, 15, 16].

Another recent work is a new structure called LP-tree [45], which is composed of array forms accessed linearly rather than exploiting pointers. With this applied to the mining process through the suggested algorithm, it improves the efficiency of frequent pattern mining.

Even with all mentioned by far, the line of innovations to mine frequent patterns does not end here, whether by incorporating a new data structure to optimize the process or by applying Genetic algorithms such as follows.

To overcome time, space, and computational complexity, One-Time Scan plus Pyramid (OTSP) [46] scans the database only once, compartmentalizing the data items with respect to the degree of combination, and then storing them along with the occurrence frequency rates.

To handle data mining in distributed systems by exploiting genetic algorithms and semantic ontology, Kannan et al. [47] proposed the approach of maintaining an agent container at all locations. On query submission, the method identifies a set of locations where the information is available and generates a group of agents to fetch the information from the different nodes within the network that points to the retrieved results that are then evaluated with genetic algorithms for relevancy of query submission.

In recent years, association rule mining algorithms are widely used on social media datasets [49]. Gole and Tidke proposed a new method, ClustBigFIM, which works on a MapReduce framework for mining large datasets to extract meaningful information from social media data in the form of associations and sequential patterns [49].

2.2 The apriori algorithm

The Apriori algorithm proposed by Agrawal in [2] is the basis for many other (FIM) algorithms.

In the first pass of the Apriori algorithm, the occurrence of each item is counted, and the items with insufficient support counts are removed to create $L_{1}$ , i.e., the collection of large (frequent) 1-itemsets.

A following pass, $k$ , consists of two steps:

•
The frequent $(k-1)$ -itemsets collection $L_{(k-1)}$ found in the previous pass is used to generate $C_{k}$ , i.e. the list of candidates $k$ -itemsets; which is a superset of the set of all large $k$ -itemsets. A candidate is generated from each two large $(k-1)$ -itemsets which are similar in their first $k$ -2 items. Then, the candidates that have infrequent subsets are removed from the set of candidates.
•
The database is scanned, and the support count for each candidate itemset in $C_{k}$ is determined. By removing items with support counts less than the minimum support from $C_{k}$ , the large k-itemsets ( $L_{k})$ are generated.

Efficiently finding all the k-subsets of a potential candidate itemset requires that all frequent itemsets in $L_{k}$ are stored in a hash-table; whereas candidate itemsets are stored in a hash-tree. A node of the hash-tree either contains a list of itemsets (a leaf node) or a hash table (an interior node). Further information about the hash-tree data structure is presented in [2].
2.3 Trie-based apriori

Another commonly used data structure that is more efficient than the hash-tree is the trie (or prefix-tree) [23]. The trie data structure was originally introduced by [12, 13] as a rooted labeled tree. As Ansari et al. stated at [44], “In the FIM (Frequent itemset mining) process, each label is considered as an item. The root is defined to be at depth 0 and a node at depth $d$ can point to nodes at depth $d+1$ . A pointer is also referred to as an edge or a link. Each node represents an item sequence that is the concatenation of labels of the edges that are on the path from the root to the node so that a path from the root to each node represents an itemset. In this implementation, the value of each node is the support count for the itemset it represents. In Fig. 1 a trie is displayed alongside the path to the node representing itemset {A, B} whose support count is 7.

Figure 1.

A sample trie.

For each transaction record $T$ in the database, the trie (containing the candidate itemsets) is recursively traversed and the value of each leaf node will be incremented if $T$ contains the itemset represented by that node. The traverse of the trie is driven by elements of $T$ . At the end, nodes with a support count less than the required minimum support will be pruned. In the candidate generation phase, a leaf node is added to its left siblings to create new valid candidates, eliminating the need for further processing. Figure 2 shows a trie structure before and after the new candidates have been generated.

Figure 2.

Candidate generation on a trie data structure.

3. TFI-Apriori

In many real “market basket” databases, some items appear much more frequently than others such as bread and mineral water in customers’ purchases of a supermarket or the high-speed violation fine in a database of driving fines. Given how these high-frequency items would be more involved in the data mining process they are to be manipulated in a special manner. In this paper, we propose a new efficient algorithm called TFI-Apriori. The two important and time-consuming phases of the Apriori algorithm are the candidate generation and support counting phases. In our proposed algorithm, we try to reduce the number of candidate itemsets, and adapt a faster method for counting the support of the candidate itemsets, thus scaling down the run-time of the algorithm considerably. TFI-Apriori is majorly concerned with selecting and decoding the top frequent items (TFIs), then combining them to yield itemsets (i.e., TFITs).

TFI-Apriori consists of two phases: the preparation phase and the Apriori-based phase. In the preparation phase, the most frequent items are found, extracted and encoded in a bit vector so their occurrences can be checked by fast bit manipulations.

In the Apriori algorithm, the recurrence of most frequent items in candidate itemsets in each phase can be considerable compared to the rest of the items the proposed algorithm ignores said frequent items in the candidate itemset generation phase and adapts the support counting phase to count the frequency of itemsets with and without those frequent items in a single iteration. This way, we can save some processing time and memory with a smaller trie (number of candidate itemsets) without losing any results. In some real datasets, the efficiency rate of the TFI-Apriori algorithm exceeds 80% of that in the original Apriori.

Based on Ansari et al. definition, the following defines terms used in TFI-Apriori:

“Let $I=$ { $i_{1}$ , $i_{2}$ , $\ldots$ , $i_{m}$ } be the items in a certain domain. An itemset is a subset of $I$ and a $k$ -itemset is an itemset with $k$ items from $I$ . A database (DB) is a list of transactions where each transaction $T$ is also a subset of $I$ . For any itemset $X$ and transaction $T$ we say $T$ contains X if and only if $X\subseteq T$ . For any itemset, $X$ Support(X), is the number of transactions in DB which contain $X$ . For a given minimum support threshold $s$ , $X$ is large (or frequent) if Support(X) $\geqslant$ s $\times D$ , where $D$ is the number of transactions in database DB; In the following, $L$ denotes the frequent itemsets in DB, and $L_{(k)}$ denotes the large $k$ -itemsets in $L$ . The main task of the association rule mining algorithm is to find the globally frequent itemsets $L$ ” [48].

NS – Number of Selected Items Parameter: The NS parameter represents the number of top most frequent items that must be selected from the database in the preparation phase.

TF – Top Frequent Items List: The NS parameter represents the number of frequent Items that have appeared more frequently in comparison with others within the database. The items within the database are sorted by their frequency, and the first NS items are selected to create the TF list. Each item in the TF list is called a Top Frequent Item or a TFI.

TI –List of Itemsets that can be created from TF: TI represents a list consisting of all the combinations of the itemsets that can be created from TF. The length of each itemset within the TI is between 0 (an empty itemset) to NS (maximal itemset), where NS is the number of selected items. In this paper, we call each member of TI, a TFIT (itemset from TFIs). The number of TFITs is 2 ${}^{NS}$ . As NS is usually small, the number of TFITs (size of TI) is limited.

For example, if NS $=$ 4, and if 3, 5, 7, 10 are items with a higher support threshold, then each of these items 3, 5, 7 and 10 is a TFI, hence TF $=$ {3, 5, 7, 10}. Each itemset that can be created by TFIs is a TFIT (e.g., {3, 5, 7} and {3, 10}). As we know, we can create 2 ${}^{4}$ itemsets from TF, hence, $|$ TI $|$ $=$ 2 ${}^{4}$ $=$ 16. Moreover, TI $=$ {{}, {3}, {5}, {7}, {10}, {3,5}, {3,7}, {3,10}, {5,7}, {5,10}, {7,10}, {3,5,7}, {3,5,10}, {3,7,10}, {5,7,10}, {3,5,7,10}}.

Figure 3.

Example 1. (a) Original database and support count for items. (b) Reduced database after the preparation phase, TF and TI lists.

RDB – Reduced Database: The database created in the preparation phase of TFI-Apriori. In the reduced database, TFIs are replaced with a bit vector representing their existence in each transaction. To create RDB, we remove the top frequent items from each transaction in the original database and store their presence into a single attribute as a bit vector. This new field called SNT (Specific Number for Transaction) is the power-sum of the indices of the corresponding TFIs within the TF list. For example, if TF $=$ {{4}, {7}, {8}, {11}} and the items 4, 8 and 11 are included in transaction $t$ , then these items will be deleted from the transaction $t$ and just 2 ${}^{0}$ $+$ 2 ${}^{2}$ $+$ 2 ${}^{3}=$ 13 will be stored in the SNT field of the new transaction $t$ in RDB. In the support-counting phase, we use the SNT field to find out which items of the TF list are included in the transactions. For example, if in a transaction $t$ , $t$ . SNT $=$ 15, this means that transaction $t$ includes the items at indices 0, 1, 2 and 3 in the TF list. Example 1 gives an illustration for building the RDB along with the rest of the preparation phase (first phase of TFI-Apriori).

Example 1: Figure 3a represents a transactional database DB. In the preparation phase, the algorithm counts the support of 1-itemsets and selects the corresponding TFIs. Considering NS parameter 2, the algorithm creates the TF and TI lists and subsequently scans the DB to create the new reduced transactional database RDB. Figure 3b shows RDB, TF and TI lists for this example.

Figure 4.

An example of the TFISup array representation.

In the second phase of TFI-Apriori, using RDB instead of the original database and separating TFIs from the rest of the items will have effects in candidate generation and support counting phases.

In candidate generation, as TFIs are not included in the database anymore, they will not be added to candidate itemsets. Instead, we add a new support counter list in each node of the trie data structure called TFISup to keep track of TFIs and not lose any results in comparison with the original Apriori. Each TFISup entity represents the combination of the respected itemset and one of the TFITs where TFITs are all possible combinations of TFIs. Figure 4 shows the use of the TFISup array. In the support-counting phase, the support for each candidate itemset and its combinations with TFITs will be calculated together and stored in the TFISup array.

In the support-counting phase, we do not need to determine the existence of TFIs for each part of scanning the database. In other words, the existence of TFIs in each transaction is calculated once in the preparation phase and will be used every time. As we expect the TFIs to be frequent in transactions, this will save a lot of processing time in database scanning.

Sections 3.1 and 3.2 describe phases 1 and 2 of the proposed algorithm.

3.1 Preparation phase

The preparation phase is described by Algorithm 1. In this algorithm, after counting and calculating the support of 1-itemset (frequency of items), NS frequent items that appear more than others are selected and TF is constructed (Lines 1–4). In the next step, TI is constructed (Lines 5–11). TI will contain the set of all possible itemsets that can be produced from TF.

In the next step, RDB is built by replacing the TFIs contained in each transaction with a coded number that describes the existence of the corresponding TFIs (Lines 12–21). Here, the transactional database is read again, and for each transaction $t$ , the TF list is searched for the TFIs that are contained in $t$ . Then the corresponding TFIs are deleted from $t$ , and the index information is added to the new record under the SNT field. The index information is the sum of the indices of the corresponding TFIs within the TF list.

To calculate the index information, a bit vector is designed (Line 18). By the end of the preparation phase, a new reduced database (RDB) will be constructed.

Algorithm 1: Preparation Phase

Inputs:

NS: Number of Selected Items parameter

DB: transactional database

Outputs:

RDB: Reduced transactional database

1) Read DB and calculate support of items, and sort them anticlimactic.

2) TF

=

{}

3) FOR (i

=

0; i

<

NS; i

++

)

4) TF [i]

=

items[i]

5) TI

=

{}

6) For (i

=

0; i

<

power (2, NS); i

++

) {

7) TITemp

=

{}

8) For (j

=

0; j

<

NS; j

++

)

9) If (power (2, j) & i)

10) TITemp.add (TF[j])

11) TI.add (TITemp)}

12) RDB

=

{};

13) For every transaction t in DB {

15) For all items

x

in t

16) for (i

=

0; i

<

NS; i

++

)

17) if (TI[i]

==

x

){

18) t.SNT

+=

power (2, i)

19) t.Delete (x)}

21) RDB.add (t)}

3.2 Apriori-based phase

This section describes the second phase of our proposed algorithm, TFI-Apriori, which is similar to the Apriori algorithm and is a modified version of classic Apriori. However, in our proposed algorithm, a reduced database (RDB) is mined for the frequent itemsets in contrast with the Apriori algorithm which uses a transactional database. Similar to the Apriori algorithm, TFI-Apriori mines the complete set of all the frequent itemsets. TFI-Apriori is represented in pseudo code style by Algorithm 2. TFI-Apriori uses the following trivial lemma to find all the frequent itemsets.

Lemma 1: If transaction t contains itemset A and itemset B, t contains itemset {A $+$ B}. Where, {A $+$ B}, means the composition of itemset A and B.

By using Lemma 1, after being assured that candidate itemset $c$ is included in transaction $t$ and furthermore, transaction $t$ also contains TI [i], then it can be concluded that itemset {c $+$ TI [i]} will also be included in $t$ . TFI-Apriori uses this lemma to find all the frequent itemsets.

The first part of the Apriori-based phase is the support counting part. In this part, an array called the TFISup with size 2 ${}^{NS}$ has been used to save the support count of each candidate Itemset $c$ . Each entry of the TFISup array will indicate the support count of itemset { $c$ $+$ one of the TFITs}, where a TFIT is an itemset that can be created from the top frequent items (TFs). Using the “and” bit vector operation, the algorithm can easily determine which entry of the TFISup array belongs to which itemset of TI (i.e., the TFISup array represents the TI list). Therefore, c.TFISup[0] stores the support count of itemset $c$ , and c.TFISup[2 ${}^{NS}$ ] stores the support count of itemset { $c+$ TI [2 ${}^{NS}$ ]}. Although the size of TFISup is 2 ${}^{NS}$ , and TFISup is added to each trie node (candidate itemset), this would not increase the actual memory footprint of TFI-Apriori. First, NS is a small number, and second, by removing TFIs from trie (made possible by using the TFISup array) the number of trie nodes will be decreased drastically, and this compensates for the larger node sizes.

In each pass $K$ of the support counting phase of Algorithm 2 (Lines 5–9), when candidate itemset $c$ is included in transaction $t$ , for every itemset $x$ (where $x$ is the $i_{th}$ member of TI), if $x$ is included in $t$ , then c.TFISup[i] is increased by one. The t.SNT value calculated from the preparation phase (i.e., Algorithm 1) is used to indicate which TFIT (TI members) is included in t.

Algorithm 2: TFI-Apriori Phase:

Inputs:

RDB: created in preparation phase

TI: created in preparation phase

NS: Number of Selected Items parameter

MinSupp: Minimum support threshold

Output:

Frequent itemsets

1) TISize

=

pow(2, NS)

2) For (k

=

1; L

{}_{k}

=

{}; k++){

3) C

{}_{k}=

apriori_generation (L

{}_{k-1})

// if k==1 C

{}_{k}=

frequent items

4) L

{}_{k}=

{}

5) For each transaction t in RDB {

6) For each itemset c from C

{}_{k}

that is included in t

7) for (i

=

0; i

<

TISize; i++)

8) if ((t.SNT & i)

==

9) c.TFISup[i]++}

10) For all itemsets

c

C_{k}

do {

11) For (i

=

0; i

<

TISize; i++)

12) If (c.TFISup[i]

>

MinSup)

13) Display (c

+

TI [i])

14) If (c.TFISup[0]

>

MinSup)

15) L

{}_{k}=

{}_{k}+

c}}

16) For each itemset

I

in TI list

17) if (Support(

I)

>

MinSup)

18) Display (

I)

In the next step of pass K of Algorithm 2 (Lines 10–14), first the frequent itemsets calculated in this pass are displayed, and then the infrequent itemsets are pruned from C ${}_{k}$ , and thus generating L ${}_{k}$ . In displaying frequent itemsets phase, for each itemset $c$ we must consider the whole of array c.TFISup. For every entry of this array (Lines 11–13), if c.TFISup[i] $>$ MinSup, the algorithm will display itemset { $c$ $+$ TI[i]} in its output because it is frequent.

The pruning step of candidate itemsets in the TFI-Apriori algorithm is similar to classic Apriori. Only, for candidate itemset $c$ , the pruning is performed by considering $c$ .TFISup[0] and not all of the c.TFISup array. Therefore if c.TFISup[0] $<$ MinSup, $c$ is deleted, because if $c$ is not a frequent itemset then neither is {c $+$ TI[i]} (Lines 14–15). In the proposed algorithm, there is no need to consider TFIs in $L_{k}$ , as they have already been considered in the previous phase. Therefore, as can be seen in Algorithm 2, by adding frequent items to $L$ (frequent itemset list) the TIs are ignored.

Finishing pass $K$ , the algorithm increments $K$ by one and continues its process again until there is no frequent itemset in L ${}_{k}$ .

Up to this point, itemsets in TI are not considered yet. In Lines 16–18, the algorithm displays these itemsets if they satisfy the minimum support threshold.

Figure 5.

TFI-Apriori frequent itemset mining with minimum support at 50% on the dataset in Fig. 3.

Other parts of our algorithm are the same as the original Apriori and have the same characteristics. Example 2 shows an illustration of the new Algorithm by using the RDB generated in Example 1.

Example 2: This example shows the operation of TFI-Apriori by using the RDB, TF, and TI represented in Example 1. The support threshold is set to 50% in this example. In the first pass, the TFI-Apriori algorithm creates the candidate itemset C ${}_{1}=$ {B}, {D}, {E}, {F}, {G}, {H}.

Then by deploying the transaction Specific numbers (SNT) calculated from the preparation phase, the algorithm calculates the support of each candidate itemset and the support of its combination with the TI list. For example, for candidate itemset {B}, the support counts of {B}, {B, A}, {B, C} and {B, A, C} are calculated, where TI $=$ {{}, {A}, {C}, {A, C}}. Figure 5a shows the result for this phase at $k=$ 1. All itemsets that have frequency bigger than the support threshold will be displayed. So, the itemsets that are displayed at the end of pass 1 are: {B}, {B, A}, {B, C}, {F}, {F, A}, {F, C}, {F, A, C}, {G}, {G, A}.

After completing pass 1 of the TFI-Apriori algorithm, L ${}_{1}$ will contain {B}, {F}, and {G}.

In pass 2 of the TFI-Apriori algorithm, C ${}_{2}$ is created. In our example C ${}_{2}=$ {B, F}, {B, G}, {F, G}. Then, the support counting phase is performed and the candidate frequent items are generated (Fig. 5)). The displayed itemsets at the end of pass 2 include: {B, F}, {B, F, A}, {B, G}, {B, G, A}, {F, G}, {F, G, A}. And L ${}_{2}=$ {B, F}, {B, G}, {F, G}. In pass 3, C ${}_{3}=$ {B, F, G} is created. And after the support counting phase, the data in Fig. 5c will be generated. The displayed itemsets in this pass are {B, F, G}, {B, F, G, A}, and L ${}_{3}=$ {B, F, G}. Eventually the algorithm will terminate in this pass and Itemsets {A}, {B}, and {A, B} that are in TI will be displayed finally.

As explained earlier, TFI-Apriori uses the assumption that TFIs appear very often in the database and subsequently in frequent itemsets to offer improvements over the classical Apriori in two aspects. First, retrieving TFIs from the database becomes more efficient using RDB. Second and more importantly, by ignoring TFIs in the candidate itemset generation, the number of itemsets that need to be generated and tested becomes significantly smaller. This reduction is illustrated in Fig. 6. Figure 6a shows the trie constructed by classic Apriori for the data of Example 1. Figure 6b shows the trie constructed by TFI-Apriori for the same database generating the same results as classic Apriori ultimately. In both representations, infrequent items from pass 1 have been ignored.

We may generate and test some itemsets that we already know are infrequent from previous phases. Consider the following example: In Fig. 5, itemset {B, A, C} is not frequent, and this fact can be seen in the results of phase 1. There is no need to calculate the support count of itemsets {B, F, A, C} and {B, G, A, C} in phase 2 and {B, F, G, A, C} in phase 3 because they contain infrequent itemset {B, A, C} and therefore are infrequent. Algorithm 2 presented above does not regard this fact. We could, however, apply the following changes to the apriori_generation function and Algorithm 2 to deal with this matter,

a) Change the pruning part of the apriori_generation function that has been used in Algorithm 2 to:

1) for each itemset c from Ck 2) for all (k-1)-subsets s of c do 3) If (sLk-1) then 4) Delete c from C ${}_{k}$ 5) else for (i $=$ 0; i $<$ TISize; i++) 6) If (s.TFISup[i] $<$ minsup) 7) c.TFISup[i] $=$ $-$ 1; 8) else 9) C1.TFISup[i] $=$ 0;

b) Line 8 of Algorithm 2 must be rewritten as follows:

8) if (c.TFISup[i] $>$ $=$ 0 AND (t.SNT & i) $==$ i)

This would not provide any enhancement because the extra support counting for the useless itemsets is not an expensive operation (it is just a conditional increment operation for each transaction) and replacing that with a conditional statement would not offer any difference in speed and just makes the algorithm more complicated.

3.3 Memory Usage of the new data structure

As seen, a new data structure has been used in the TFI-Apriori algorithm. Instead of using the original and ordinary trie, the algorithm uses a trie with an array along each node referred to as the TFISup Array. Each TFISup entity represents the combination of the respected itemset and one of the TFITs. An example of the TFISup array has been shown in Fig. 4. As described in the previous sections, in the support-counting phase, the support for each candidate itemset and its combinations with TFITs will be calculated together and stored in the TFISup array.

At first, it may seem that using this array will increase the size of trie nodes and consequently the size of used data structure because each node of the trie, instead of saving a value for its respective itemset, saves and uses an array with size 2 ${}^{NS}$ , resulting in 2 ${}^{NS}$ values for each trie node instead of 1. For example, in Fig. 4 when NS $=$ 2, a trie node has 3 extra values of support counts in comparison with the old data structure.

However, by deleting TFIs from the trie, the size of the trie is reduced enormously, and the required information about TFIs and their combination with other itemsets are stored in the TFISup array. As evident in Fig. 6, some itemsets (i.e., some trie nodes) have been ignored in the new data structure, thus leading us to have a smaller trie in comparison with the original trie. But, as described before, each node must have a 2 ${}^{2}$ array instead of a value for each node. Therefore, each node of the trie has an array of size 4.

Figure 6.

Created Tries from represented database in Example 1. (a) With original Apriori. (b) With the TFI-Apriori algorithm. Infrequent Items (1-itemsets) have been ignored in Trie.

As seen in Fig. 6a, we need 27 trie nodes for frequent itemset mining. In Fig. 6b the needed trie nodes are equal to 7. This new data structure has 20 fewer trie nodes than the original data structure. However, each node in the new data structure needs to have an array of size 4 to save frequencies, and this leads to 7 $\times$ 3 $=$ 21 extra entries and for this example TFI-Apriori used 21 extra integers for the new trie data structure to do complete support counting, and has saved memory by using 20 fewer nodes.

We now must consider whether the new data structure is more memory efficient than the original trie. The answer to this question is not determinate and strongly depends on the specification of our database and the value of NS. In TFI-Apriori, memory efficiency behaves the same as time efficiency, and both depend on the database and selection of the NS parameter (it will be demonstrated in the experimental result phase). For example, if TFIs have been repeated very much in the database and consequently infrequent itemsets the new data structure will help us to save some memory in comparison with the original trie data structure as in this case, the high number of nodes will be ignored and will be embedded in the TFISup Array as described before. On the other hand, if the size of NS becomes very large, we end up having a big array for each trie node, thus wasting an immense amount of memory. In Section 4.2 there is an experimental study on memory usage of the new algorithm.

4. Experimental results

The TFI-Apriori and Apriori algorithms have been both implemented based on Bodon’s proposed data structure [14]. We have used Bodon’s source code and implementation of the original Apriori algorithm (ver. 2.4.9, date of issue: 11. March 2005)1

¹
http://www.cs.bme.hu/ $\sim$ bodon/en/apriori/.

and modified it to implement the TFI-Apriori algorithm. Both algorithms have been executed on different well-known real and synthetic datasets [19] with various minimum support thresholds. To evaluate the effect of the NS parameter on the efficiency of the proposed algorithm, for each dataset, TFI-Apriori was executed with different values for the NS p.

Figure 7.

Execution time comparison with different minimum supports.

Figure 8.

Memory usage with different minimum supports.

Figure 7 compares the execution time of the original Apriori with two different NS parameters of the TFI-Apriori. Figure 8 shows the memory consumption for two different test cases. Figures 9 and 10 present the effects of the NS parameter on the TFI-Apriori algorithm. Tables A.1–A.4 exhibit the run-time of our experiments on different datasets. In each table, the last column shows the best efficiency between different NS parameters in comparison with the original Apriori algorithm. Tables A.5–A.7 show the memory consumption of the proposed TFI-Apriori and the original Apriori algorithm on some datasets for different minimum supports and NS parameters. All tables are in Appendix A at the end of this paper.

Figure 9.

Effect of the NS parameter on run-time.

Note that in all the tables and figures NS $=$ 0 represents the original Apriori algorithm.

4.1 Performance

As seen in Fig. 7a–c on some real datasets, TFI-Apriori had an outstanding performance in comparison with the original Apriori algorithm and, in most cases, the best results belong to a NS parameter between 5 and 7. For smaller minimum support values, the efficiency of the algorithm was more apparent. As can be seen from these results, the efficiency reached up to 90% in some cases. However, in the synthetic dataset T40I10D100K, there was not much gain in efficiency from the proposed algorithm. This lack of improvement was because, in the well-known synthetic datasets, there are no high-frequency items and this is shown in Fig. 7d. For example, and based on gained statistics, the frequency of the most frequent item in the synthetic dataset T40I10D100K is 7828, which is about 8%. This value is too small and negligible to be considered. The frequency of the most frequent itemset in the accident dataset is about 100% meaning that there is one item that is included in all transactions. Second, third, and fourth TFIs (Top frequent items after the first TFI) in the accident dataset have a frequency higher than 99%. The frequency of TFI in dataset pumsb and mushroom are more than 99%. The frequency of TFI in dataset pumsb-star is about 75%.

Figure 10.

Effect of NS parameter on memory usage.

4.2 Memory requirements

Figure 8 shows the memory consumption of the TFI-Apriori in comparison with the original Apriori, demonstrating that not only does the execution of the proposed algorithm not require extra memory allocation, but it also decreased in most cases (as described in Section 4.3). In the execution of TFI-Apriori, there is approximately a uniform relation between the memory consumption and the run-time efficiency.

4.3 Effect of the NS parameter

In test cases on real datasets, the NS parameter was an important variable to consider for the best performance. As evident in Fig. 9, in specific datasets, the NS parameter had a uniform impact on various minimum supports values. For example, for different minimum supports, the best result for dataset accident is NS $=$ 5. A simple heuristic, which we used to determine the best value for the NS, was used to run the algorithm on big minimum support values. This method is effective because the big minimum support values require less execution run-time.

Furthermore, the NS parameter is not only an important factor on the runtime of the algorithm, but also has a substantial effect on the amount of memory usage. Figure 10 shows the amount of memory required by the algorithm for different NS parameters, on various datasets. These datasets are the same datasets evaluated in Fig. 9. By comparison of Figs 9 and 10, it can be seen that the behavior of the NS parameter on a dataset in runtime and memory usage are approximately similar. Therefore, reduction of memory usage within the algorithm means a reduction on the number of trie nodes and consequently reduction of the processing time required to find frequent itemsets. Thus, we can say that by reducing the memory usage in any NS parameter, the runtime may be reduced.

4.4 Comparison with other implementations

Comparing TFI-Apriori along with its implementation to other well-known algorithms, such as FP-Growth and Eclat, would not be fair or have a benefit. It is known that an algorithm would perform differently given different implementations. Thus, a fair comparison of algorithms requires identical implementation, framework, and structure.

Also, TFI-Apriori is not a brand-new algorithm and is a modification and a revised version of Apriori. This new modification has been performed on Bodon’s implementation that is one of the best implementations of Apriori. So, the best comparison for our work is to compare the performance of TFI-Apriori with the original Apriori when two algorithms are implemented with the same framework. This comparison and performance evaluation has been provided in Section 4 of this paper.

5. Conclusions

In most databases, there are some items that appear much more frequently than in others. Changing our perspective on these very frequent items can lead to some major improvements in our data mining tasks. In this paper, we have proposed a new optimization that can be adopted in all Apriori-based frequent itemset mining implementations. The stated optimization has been performed on a trie-based implementation. In the preparation phase of our proposed algorithm, the most frequent items from the database are selected and encoded. Then the proposed TFI-Apriori algorithm takes advantage of the encoded information for its mining process. Experimental results show that major improvements in execution time have been gained for frequent itemsets mining in real datasets. However, in datasets where the difference between item frequencies is less notable (i.e., in cases where there are no items with very high frequencies in comparison with other items - usually not in real world), the proposed algorithm is not that much effective. Moreover, the experimental results show that not only does the execution of the proposed algorithm not require extra memory allocation in comparison with the Apriori-based algorithms, but it has reduced the required memory for its execution. The experimental results show a reduction in execution time without an increase in memory allocation in comparison to the Apriori-based algorithms. It is noted that memory required was reduced for this algorithm.

In the presented algorithm, the criterion for TFI selection has the largest frequency. This selection method leads to some significant results for real databases. We did not, however, consider other criteria for selection of TFIs, there may be other criteria for TFI selection that can be considered in the future. Such a study may also give more insight into finding the optimum NS values for any given input database and minimum support threshold. Moreover, we may be able to apply the presented idea to other data mining tasks and optimize them by treating some special items in a more efficient manner.

Footnotes

Appendix A – complete test results

Table A.1

Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset Accident with different minimum supports

[height=0.8cm,width=2.4cm]Min SupNS	0 (original)	1	2	3	4	5	6	7	Best Opt.
0.61731	20.390	19.985	19.797	19.515	19.828	19.984	20.218	20.687	4.30%
0.49973	30.734	25.625	22.937	21.750	22.015	22.313	23.156	25.578	29.2%
0.41154	107.281	66.437	45.640	35.906	34.843	34.844	38.219	51.500	67.5%
0.36450	202.609	116.859	73.093	51.968	48.125	46.671	49.610	67.000	77.0%
0.33511	276.640	155.593	93.890	64.093	57.375	54.203	56.625	74.625	80.4%
0.30571	458.296	216.078	126.218	82.188	71.813	64.656	70.766	86.093	85.9%
0.29513	565.109	247.937	141.500	90.609	78.359	69.796	75.406	90.953	87.6%

Table A.2

Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset mushroom with different minimum supports

[height=0.6cm,width=2.2cm]Min SupNS	0 (original)	1	2	3	4	5	6	7	8	Best Opt.%
0.14771	12.046	7.421	5.203	3.953	3.609	3.546	3.656	3.796	4.437	70.6
0.11078	23.015	12.718	8.406	6.438	5.828	5.178	5.110	5.703	6.531	77.8
0.04924	596.531	351.859	228.735	165.968	137.796	120.468	109.406	113.656	120.375	81.7
0.02954	1284.94	791.531	535.500	408.781	351.856	319.046	294.313	294.860	308.984	77.1
0.02462	2,151.86	1,365.77	949.703	748.093	625.875	596.531	564.328	561.750	578.406	73.9

Table A.3

Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset pumsb with different minimum supports

[height=0.6cm,width=2.2cm]Min SupNS	0 (original)	1	2	3	4	5	6	7	8	Best Opt.%
0.83594	9.265	8.671	8.500	8.265	8.343	8.359	8.343	8.281	8.515	10.8
0.77478	29.25	25.218	16.250	14.375	14.110	13.735	13.656	13.562	13.906	53.6
0.74419	78.843	50.765	36.718	34.468	31.765	29.046	23.750	23.500	23.812	70.2
0.71361	301	177.485	119.171	86.203	73.265	65.406	58.843	56.687	57.203	81.2
0.69322	624.921	365.343	238.109	169.640	140.765	123.156	107.718	101.515	102.000	83.8
0.67283	946.281	549.796	360.062	257.110	215.891	188.625	167.125	156.687	156.734	83.4
0.65652	1,423.16	819.875	529.265	375.437	312.390	271.578	238.515	224.953	225.406	84.2

Table A.4

Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset pumsb_star with different minimum supports

[height=0.6cm,width=2.2cm]Min SupNS	0 (original)	1	2	3	4	5	6	7	8	Best Opt.%
0.46895	5.859	5.812	5.734	5.703	5.890	5.906	6.109	6.625	7.356	2.7
0.40778	13.531	13.500	10.328	10.421	10.281	9.953	10.437	13.296	16.781	26.4
0.36700	44.125	44	28.843	28.406	27.375	26.375	29.171	48.187	74.156	40.2
0.34661	78.781	76.360	40.703	41.062	37.084	33.594	35.125	53.875	80.781	57.4
0.30584	296.140	296.593	154.468	169.812	125.968	89.015	72.734	86.203	117.328	75.4
0.28545	725.031	700.269	368.875	409.468	308.593	230.031	183.140	183.756	222.734	74.7

Table A.5

Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset T40I10D100K with different minimum supports

[height=0.8cm,width=2.4cm]Min SupNS	0 (original)	1	2	3	4	Best Opt.
0.03	8.796	9.078	9.328	10.015	11.734	$-$ 3.2%
0.01	31.031	31.547	30.062	32.125	38.656	3.1
0.009	98.125	100.421	72.906	79.281	95.000	25.7
0.008	126.984	124.375	100.546	108.250	129.531	20.8
0.0058	248.078	228.906	201.218	213.734	243.656	18.9
0.0037	449.297	419.812	393.641	407.453	460.703	12.4
0.00215	1,413.880	1,306.810	1,231.450	1,210.300	1,315.300	14.4

Table A.6

Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset T10I4D100K with different minimum supports

[height=0.8cm,width=2.4cm]Min SupNS	0 (original)	1	2	3	Best Opt.
0.00010	62.171	51.750	51.421	51.750	17.3
0.00007	106.562	91.125	90.250	90.984	15.3
0.00004	267.843	237.156	231.344	MEM_EXC	13.6
0.00003	415.156	492.765	MEM_EXC	MEM_EXC	$-$ 18.7

Table A.7

Comparison of memory usage (MB) for Original Apriori and TFI-Apriori on dataset Accident with different minimum supports

[height=0.8cm,width=2.4cm]Min SupNS	0 (original)	1	2	3	4	5	6	7
0.61731	4,424	4,304	4,200	4,104	4,028	3,968	3,920	3,892
0.49973	9,592	9,280	8,984	8,716	8,484	8,308	8,140	8,028
0.41154	21,928	21,112	20,412	19,768	19,212	18,744	18,336	17,976
0.36450	25,456	24,320	23,376	22,620	21,948	21,368	20,936	20,560
0.33511	26,140	24,696	23,636	22,768	22,092	21,512	21,072	20,764
0.30571	27,416	25,632	24,280	23,284	22,540	21,976	21,516	21,288
0.29513	27,908	25,976	24,468	23,432	22,632	22,080	21,612	21,424

Table A.8

Comparison of memory usage (MB) for Original Apriori and TFI-Apriori on dataset pumsb_star with different minimum supports

[height=0.8cm,width=2.4cm]Min SupNS	0 (original)	1	2	3	4	5	6	7	8
0.46895	2,588	2,576	2,548	2,544	2,536	2,552	2,596	2,712	2,960
0.40778	4,136	4,200	3,992	4,044	3,980	3,984	4,092	4,332	4,916
0.36700	6,380	6,548	6,012	6,208	6,172	6,436	7,200	8,836	12,364
0.34661	7,540	7,584	6,736	7,084	6,820	6,256	7,748	9,380	12,880
0.30584	12,000	12,676	10,100	10,708	9,468	9,520	10,160	11,752	15,304
0.28545	18,444	19,036	14,296	14,679	14,048	13,484	14,220	16,688	21,924

Table A.9

Comparison of memory usage (MB) for Original Apriori and TFI-Apriori on dataset pumsb with different minimum supports

[height=0.8cm,width=2.4cm]Min SupNS	0 (original)	1	2	3	4	5	6	7	8
0.83594	2,524	2,236	2,032	1,968	1,920	1,660	1,984	2,104	2,304
0.77478	8,576	5,876	4,520	3,712	3,352	3,353	3,332	3,532	3,924
0.74419	17,820	11,612	8,228	6,148	5,464	5,208	5,056	5,336	5,620
0.71361	37,712	24,072	15,888	11,656	9,556	8,796	8,572	8,592	9,548
0.69322	61,176	37,016	24,536	17,356	13,916	12,652	11,848	12,200	12,480
0.67283	91,700	56,548	35,608	24,880	19,216	17,328	16,165	16,004	16,980
0.65652	124,860	74,736	47,812	32,196	25,344	22,056	20,716	20,096	21,132

References

Agrawal

Imielinski

and Swami

, Mining association rules between sets of items in large databases, in: Proceedings 1993 ACM SIGMOD Intl. Conf. on Management of Data, Washington, DC, May 1993, pp. 207–216.

Agrawal

and Srikant

, Fast algorithms for mining association rules, in: Proceedings 20th Intl. Conf. on Very Large Data Bases, Santiago, Chile, September 1994, pp. 487–499.

Toivonen

Vijayaraman

T.M.

Buchmann

A.P.

Mohan

and Sarda

N.L.

, Sampling large databases for association rules, in: proc. 22nd International Conference on Very Large Data Bases, 1996, 134–145.

Hipp

Güntzer

Nakhaeizadeh

, Algorithms for association rule mining – a general survey and comparison, ACM SIGKDD Explorations Newsletter 2(1) (2000), 58–64.

Savasere

Omiecinski

and Navathe

S.B.

, An Efficient Algorithm for Mining Association Rules in Large Databases, in: Proceedings of the 21th International Conference on Very Large Data Bases, 11–15 September 1995, pp. 432–444.

Brin

Motwani

Ullman

J.D.

and Tsur

, Dynamic itemset counting and implication rules for market basket data, in: Proceedings 1997 ACM SIGMOD Intl. Conf. on Management of Data, Tucson, AZ, May 1997, pp. 255–264.

Han

Pei

Yin

and Mao

, Mining frequent pattern without candidate generation: a frequent pattern tree approach, Data Mining and Knowledge Discovery 8 (2004), 53–87.

Chen

Haas

and Scheuermann

, A new two-phase sampling based algorithm for discovering association rules, in: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 462–468.

Parthasarathy

, Efficient progressive sampling for association rules, in: Proceeding of 2002 IEEE International Conference on Data Mining, 2002, pp. 354–361.

10.

Borgelt

, Efficient implementations of Apriori and éclat, in: Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), volume 90 of CEUR Workshop proceedings, Melbourne, Florida, USA, 19 November 2003.

11.

Calders

and Goethals

, Mining all non-derivable frequent itemsets, in: Proc. Principles and Practice of Knowledge Discovery in Databases PKDD’02, volume 2431 of LNAI, Helsinki, FIN, Aug. 2002, Springer-Verlag, pp. 74–85.

12.

de la Briandais

, File searching using variable-length keys, in: Western Joint Computer Conference, March 1959, pp. 295–298.

13.

Fredkin

, Trie memory, Communications of the ACM 3(9) (1960), 490–499.

14.

Bodon

, A Fast Apriori Implementation, IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), Melbourne, Florida, USA, 2003.

15.

Bodon

, Surprising Results of Trie-based FIM Algorithm, in: Goethals

Zaki

M.J.

and Bayardo

, editors, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Vol. 90 of CEUR Workshop Proceedings, 2004.

16.

Bodon

, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006.

17.

Zaki

M.J.

, Scalable algorithms for association mining, IEEE Transaction on Knowledge Data Engineering 12 (2000), 372–390.

18.

Han

Cheng

Xin

and Yan

, Frequent pattern mining: current status and future directions, Data Mining and Knowledge Discovery Volume 15, 2007, 55–86.

19.

FIMI, Frequent Itemset Mining Dataset Repository, http://fimi.cs.helsinki.fi/data/.

20.

Berzal

Cubero

J.-C.

Marın

and Serrano

J.-M.

, TBAR: an efficient method for association rule mining in relational databases, Data & Knowledge Engineering 37 (2001), 47–64.

21.

Agrawal

Imielinski

and Swami

, Mining association rules between sets of items in large databases, in: Proceedings 1993 ACM SIGMOD Intl. Conf. on Management of Data, Washington, DC, May 1993, pp. 207–216.

22.

Houtsma

and Swami

, Set-oriented data mining in relational databases, Data & Knowledge Engineering 17(3) (December 1995), 245–262.

23.

Bodon

and R’onyai

, Trie: An alternative data structure for data mining algorithms, Mathematical and Computer Modelling 38(7) (October 2003), 739–751(13).

24.

Park

J.S.

Chen

M.-S.

and Yu

P.S.

, Using a Hash-Based Method with Transaction Trimming for Mining Association Rules 9(5) (September 1997), 813–825.

25.

Kumar

Quinlan

J.R.

Ghosh

Yang

Motoda

McLachlan

G.J.

Liu

P.S.

and Zhou

Z.-H.

, Michael Steinbach, David J. Hand, Dan Steinberg, Top 10 algorithms in data mining, Knowledge and Information Systems 14(1) (2008), 1–37.

26.

Brin

Motwani

and Silverstein

, Beyond market baskets: generalizing association rules to correlations, in: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, 1997, pp. 265–276.

27.

Agrawal

and Srikant

, Mining sequential patterns, in: Proceedings of the 11th International Conference on Data Engineering, 1995, pp. 3–14.

28.

Srikant

and Agrawal

, Mining sequential patterns: generalizations and performance improvements, in: Proceedings of the Fifth International Conference on Extending Database Technology, 1996, pp. 3–17.

29.

Meretakis

Fragoudis

and Likothanassis

, Scalable association-based text classification, in: Proceedings of the Ninth International Conference on Information and Knowledge Management, 2000, pp. 5–11.

30.

Agarwal

R.C.

Aggarwal

C.C.

and Prasad

V.V.V.

, Depth first generation of long patterns, in: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000, pp. 108–118.

31.

Burdick

Calimlim

and Gehrke

, MAFIA: a maximal frequent itemset algorithm for transactional databases, in: Proceedings of the 17th International Conference on Data Engineering, 2001, pp. 443–452.

32.

Bayardo

R.J.

, Efficiently mining long patterns from databases, in: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, 1998, pp. 85–93.

33.

Pei

Han

and Mao

, CLOSET: an efficient algorithm for mining frequent closed itemsets, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000, pp. 11–20.

34.

Zaki

M.J.

and Hsiao

C.J.

, CHARM: an efficient algorithm for closed itemset mining, in: Proceedings of the Second SIAM International Conference on Data Mining, 2002.

35.

Tseng

F.C.

Hsu

C.C.

and Chen

, Mining frequent closed itemsets with the frequent pattern list, in: Proceedings of the IEEE International Conference on Data Mining, 2001, pp. 653–654.

36.

Tsay

Y.J.

and Chiang

J.Y.

, CBAR: an efficient method for mining association rules, Knowledge-Based Systems 18(2–3) (2005), 99–105.

37.

Dong

and Han

, BitTableFI: An efficient mining frequent itemsets algorithm, Knowledge-Based Systems 20(4) (May 2007), 329–335.

38.

Chiang

S.-W.

and Lin

J.-R.

, A new approach to mine frequent patterns using item-transformation methods, Information Systems 32(7) (November 2007), 1056–1072.

39.

Goethals

, Survey on frequent pattern mining, Technical report, Helsinki Institute for Information Technology, 2003.

40.

Bayardo

R.J.

, Efficiently mining long patterns from databases, Proceedings of ACM SIGMOD International Conference on Management of Data, ACM Press, Seattle, Washington, June 1998, pp. 85–93.

41.

Goethals

and Van den Bussche

, A priori versus a posteriori filtering of association rules, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 1999.

42.

Lin

K.-C.

Liao

I.-E.

and Chen

Z.-S.

, An Improved frequent pattern growth method for mining association rules, Expert Systems with Applications 38 (2011), 5154–5161.

43.

Moens

Aksehirli

and Goethals

, Frequent Itemset Mining for Big Data, Universiteit Antwerpen, Belgium, IEEE Big Data, 2013.

44.

Ansari

Dastghaibifard

Keshtkaran

and Kaabi

, Distributed frequent itemset mining using trie data structure, IAENG International Journal of Computer Science 35(3) (2008), 377–381.

45.

Pyun

Yun

and Ryu

K.H.

, Efficient frequent pattern mining based on Linear Prefix tree Knowledge-Based Systems, vol. 55, 2014, pp. 125–139.

46.

Jeya Sheela

and Krishnaveni

S.H.

, A Novel Frequent Pattern Mining Approach with OTSP, 2015.

47.

Syed Kousar Niasi

and Kannan

, Multi agent Approach for Evolving Data Mining in Parallel and Distributed Systems using Genetic Algorithms and Semantic Ontology, 2014.

48.

Ansari

Sadreddini

M.H.

and Dastghaybifard

G.H.

, Distributed frequent Itemset Mining using Trie Data Structure, World Applied Sciences Journal 9(12) (2010), 1353–1358.

49.

Gole

and Tidke

, Frequent Itemset Mining for Big Data in social media using ClustBigFIM algorithm, Pervasive Computing (ICPC) 2015 International Conference on, 2015, pp. 1–6.

TFI-Apriori : Using new encoding to optimize the apriori algorithm

Abstract

Keywords

1. Introduction

2. Backgrounds and related works

2.1 Association rule mining algorithms

2.2 The apriori algorithm

3.2 Apriori-based phase

1 http://www.cs.bme.hu/ ∼ bodon/en/apriori/.

4.3 Effect of the NS parameter

4.4 Comparison with other implementations

5. Conclusions

Footnotes

Appendix A – complete test results

References

¹
http://www.cs.bme.hu/ $\sim$ bodon/en/apriori/.