Abstract
In this paper we propose a new optimization for Apriori-based association rule mining algorithms where the frequency of items can be encoded and treated in a special manner drastically increasing the efficiency of the frequent itemset mining process. An efficient algorithm, called TFI-Apriori, is developed for mining the complete set of frequent itemsets. In the preprocessing phase of the proposed algorithm, the most frequent items from the database are selected and encoded. The TFI-Apriori algorithm then takes advantage of the encoded information to decrease the number of candidate itemsets generated in the mining process, and consequently drastically reduces execution time in candidate generation and support counting phases. Experimental results on actual datasets – databases coming from applications with very frequent items – demonstrate how the proposed algorithm is an order of magnitude faster than the classical Apriori approach without any loss in generation of the complete set of frequent itemsets. Additionally, TFI-Apriori has a smaller memory requirement than the traditional Apriori-based algorithms and embedding this new optimization approach in well-known implementations of the Apriori algorithm allows reuse of existing processing flows.
Introduction
Data mining is a process of discovering previously unknown, though useful, knowledge from data in large databases. Discovery of association rules is an important database mining problem. The problem of association rules mining (ARM) over basket data was introduced in [1]. An association rule is a set of transactions, each transaction being a set of literals (called items), is an expression of the form
The problem of discovering all association rules can be decomposed into two sub-problems: first, finding the frequent itemsets, and second, generating the association rules. The support for an itemset is the number of transactions that contain the itemset. Itemsets with a support value above the minimum support are called the frequent itemsets. The main time-consuming task of every ARM algorithm is to discover the sets of items that frequently appear together (i.e., the frequent itemsets). In addition to finding association rules, mining frequent items is a fundamental and crucial step employed to solve problems in areas such as correlations [26], sequential patterns [27, 28], classification [29], maximal frequent patterns [30, 31, 32], and frequent closed patterns [33, 34, 35].
The Apriori algorithm [2] is one of the most popular data mining approaches for finding frequent itemsets from transactional datasets. The Apriori algorithm is the main basis and genesis of many other well-known algorithms and implementations. It is worth noting that the main challenge encountered by researchers in frequent itemset mining has been to reduce the execution time.
Many algorithms have been proposed to find frequent itemsets from very large databases. The number of database scans required for the task has been reduced from a number equal to the size of the largest itemset in Apriori [2], to typically just a single scan in modern ARM algorithms such as FP-Growth [7]. Frequent patterns discovered via mining processes are not only interesting themselves but also useful to other data analysis and mining tasks, including associative classification, clustering, cube computation and analysis, gradient mining, and multi-dimensional discriminant analysis [18].
Bodon [14] published one of the best implementations of the Apriori algorithm regarding efficiency by showing that the use of efficient data structures and an effective implementation is very important in improving the performance of the Apriori algorithm. He proposed a fast Apriori implementation using the trie data structure [23] instead of a hash tree which was used in classical approaches [41]. In work presented in this paper, we have adapted Bodon’s trie-based Apriori source code as the base of our works and implementation with the classical Apriori algorithm being optimized to reduce execution time. In our optimization, the top frequent items from the dataset are selected and encoded.
Item ordering has been considered in some recent works [14, 39] by showing that the performance of the Apriori algorithm may be faster if the items are sorted during the first iteration. For example, Bodon [14] shows that the search is faster if the order of items corresponds to the frequency order. As the exact frequency order is known only after the first scan of the whole database, everything is at hand for building the trie with the frequency ordering, and all this accomplished without imposing any overhead on the original algorithm.
In [39], Goethals shows that it is advantageous to recode frequent items according to the ascending order of their frequencies (i.e., the inverse of the frequency codes) because candidate generation is faster this way. In [40], Bayardo has considered dynamic ordering in his work. In dynamic ordering, and in each step of the algorithm, items are rearranged to improve the efficiency,
In our proposed algorithm, the set of most frequent items have been considered and are encoded to improve the efficiency of the Apriori algorithm. The proposed algorithm takes advantage of the encoded information for its mining process. Using this approach, the execution time can be reduced drastically. Experimental results show the efficiency of our new algorithm (called TFI-Apriori) in real datasets.
The paper is organized as follows. Section 2 provides a summary of previous well-known algorithms for association rule mining. Section 3 presents our proposed algorithm, TFI-Apriori. Section 4 presents the experimental results of our implementation and Section 5 concludes the paper.
Backgrounds and related works
Association rule mining was first introduced in 1993 [1], and since then many algorithms with different approaches have been developed for the ARM problem. In this section, some of prior works are concisely mentioned and, moreover, those algorithms that form the basis for our algorithm are reviewed.
Association rule mining algorithms
Hipp et. al. [4] provides a general survey on efficient mining of association rules in transactional and relational databases. AIS [21], SETM [22] and Apriori [2] can be considered as the first generation of association rule mining algorithms with the Apriori algorithm being by far the most well-known of all. Extending the basic Apriori approach, in AprioriTID [2] (an extension of the Apriori) Instead of relying on the raw database, each transaction is internally represented by current candidates it contains. In AprioriHybrid both Apriori and AprioriTID approaches are combined [2].
Some optimizations have been made on the Apriori algorithm, yielding DHP [24], a hash-based technique reducing the size of candidate itemsets where each itemset is hashed into a corresponding bucket by using an appropriate hash function. DHP efficiently controls the number of candidate 2-itemsets and uses the transaction trimming idea to lower the size of the dataset at each phase of the Apriori-based frequent itemset mining.
DIC [6] is a further variation of the Apriori Algorithm, softening the strict separation between counting and generating candidates. Generally, after the
The Partition algorithm [5] is an Apriori-like algorithm that exploits set intersections to determine support values. The Partition algorithm divides the database into several partitions overcoming the memory problem. Each partition processed independently in main memory (RAM). When the frequent itemsets for all database parts are determined, another extra scan is done to remove those locally frequent itemsets which are not globally frequent.
Another algorithm that reduces the database activity substantially is the sampling method [3] whose idea is picking a random sample, using which to find all association rules, and then verify the results with the rest of the database. In [8, 9] some other variants and revisions of sampling methods have been proposed.
One of the most outstanding algorithms to find frequent itemsets is the FP-growth [7] (Frequent Pattern Growth) that succeeded in eliminating the candidate generation process [25]. In a preprocessing step, FP-growth derives a highly condensed representation of the transaction data, the FP-tree, generated by counting occurrences. FP-tree-based mining adopts pattern fragment growth and partitioning-based divide-and-conquer methods to avoid generation of too many candidate sets and scale down the search space respectively. Next, to decompose the mining task into some smaller ones for mining in conditional databases, a partitioning-based divide-and-conquer method is used [25].
To enhance the FP-growth algorithm regarding performance, Lin et al. conceived the IFP-growth [42] algorithm, distinguishing itself from FP-growth in exploiting an address-table structure by lowering the complexity of forming the entire FP-tree. Second, it makes use of a new structure by the name of FP-tree+ to eclipse the need for building conditional FP-trees recursively. Lastly, through employing the address-table and the FP-tree+, the proposed algorithm requires less memory and has better performance compared to FP-tree-based algorithms.
As Han et al. stated “Both the Apriori and FP-growth methods mine frequent patterns from a set of transactions in a horizontal data format (i.e., {TID: itemset}), where TID is a transaction. Alternatively, mining can also be performed with data presented in a vertical data format (i.e., {item: TID_set}).” in [18]. Zaki [17] conducted work based on [18] and proposed the Equivalence CLASS Transformation (Eclat) algorithm. Derived from the mentioned Eclat algorithm are the two new methods presented by Sandy Moens [43] by the names of Dist-Eclat [43] and Big-FIM [43] specifically tailored for mining large datasets.
The CBAR algorithm proposed by Tsay and Chiang uses a cluster table to load the databases into main memory [36]. By performing the support count of the cluster table, CBAR saves time and the need to scan all the transactions stored in the cluster table.
Dong and Han proposed an effective algorithm entitled BitTableFI [37] employing the BitTable data structure to compress database horizontally for quick candidate itemsets generation and vertically to increase support count speed. A fresh way to go about the database was presented by Wu et al. who treated the database as a stream of data and then found frequent patterns by scanning the database only once [38].
In recent years, several algorithms and implementations have been conceived making use of different concepts such as the concept of non-drivable items proposed in [11], TBAR (Tree-based association rule mining) introduced in [20], and other efficient implementations of Apriori proposed by Borgelt [10] and Bodon [14, 15, 16].
Another recent work is a new structure called LP-tree [45], which is composed of array forms accessed linearly rather than exploiting pointers. With this applied to the mining process through the suggested algorithm, it improves the efficiency of frequent pattern mining.
Even with all mentioned by far, the line of innovations to mine frequent patterns does not end here, whether by incorporating a new data structure to optimize the process or by applying Genetic algorithms such as follows.
To overcome time, space, and computational complexity, One-Time Scan plus Pyramid (OTSP) [46] scans the database only once, compartmentalizing the data items with respect to the degree of combination, and then storing them along with the occurrence frequency rates.
To handle data mining in distributed systems by exploiting genetic algorithms and semantic ontology, Kannan et al. [47] proposed the approach of maintaining an agent container at all locations. On query submission, the method identifies a set of locations where the information is available and generates a group of agents to fetch the information from the different nodes within the network that points to the retrieved results that are then evaluated with genetic algorithms for relevancy of query submission.
In recent years, association rule mining algorithms are widely used on social media datasets [49]. Gole and Tidke proposed a new method, ClustBigFIM, which works on a MapReduce framework for mining large datasets to extract meaningful information from social media data in the form of associations and sequential patterns [49].
The apriori algorithm
The Apriori algorithm proposed by Agrawal in [2] is the basis for many other (FIM) algorithms.
In the first pass of the Apriori algorithm, the occurrence of each item is counted, and the items with insufficient support counts are removed to create
A following pass,
The frequent The database is scanned, and the support count for each candidate itemset in
Efficiently finding all the k-subsets of a potential candidate itemset requires that all frequent itemsets in
Another commonly used data structure that is more efficient than the hash-tree is the trie (or prefix-tree) [23]. The trie data structure was originally introduced by [12, 13] as a rooted labeled tree. As Ansari et al. stated at [44], “In the FIM (Frequent itemset mining) process, each label is considered as an item. The root is defined to be at depth 0 and a node at depth
A sample trie.
For each transaction record
Candidate generation on a trie data structure.
In many real “market basket” databases, some items appear much more frequently than others such as bread and mineral water in customers’ purchases of a supermarket or the high-speed violation fine in a database of driving fines. Given how these high-frequency items would be more involved in the data mining process they are to be manipulated in a special manner. In this paper, we propose a new efficient algorithm called TFI-Apriori. The two important and time-consuming phases of the Apriori algorithm are the candidate generation and support counting phases. In our proposed algorithm, we try to reduce the number of candidate itemsets, and adapt a faster method for counting the support of the candidate itemsets, thus scaling down the run-time of the algorithm considerably. TFI-Apriori is majorly concerned with selecting and decoding the top frequent items (TFIs), then combining them to yield itemsets (i.e., TFITs).
TFI-Apriori consists of two phases: the preparation phase and the Apriori-based phase. In the preparation phase, the most frequent items are found, extracted and encoded in a bit vector so their occurrences can be checked by fast bit manipulations.
In the Apriori algorithm, the recurrence of most frequent items in candidate itemsets in each phase can be considerable compared to the rest of the items the proposed algorithm ignores said frequent items in the candidate itemset generation phase and adapts the support counting phase to count the frequency of itemsets with and without those frequent items in a single iteration. This way, we can save some processing time and memory with a smaller trie (number of candidate itemsets) without losing any results. In some real datasets, the efficiency rate of the TFI-Apriori algorithm exceeds 80% of that in the original Apriori.
Based on Ansari et al. definition, the following defines terms used in TFI-Apriori:
“Let
NS – Number of Selected Items Parameter: The NS parameter represents the number of top most frequent items that must be selected from the database in the preparation phase.
TF – Top Frequent Items List: The NS parameter represents the number of frequent Items that have appeared more frequently in comparison with others within the database. The items within the database are sorted by their frequency, and the first NS items are selected to create the TF list. Each item in the TF list is called a Top Frequent Item or a TFI.
TI –List of Itemsets that can be created from TF: TI represents a list consisting of all the combinations of the itemsets that can be created from TF. The length of each itemset within the TI is between 0 (an empty itemset) to NS (maximal itemset), where NS is the number of selected items. In this paper, we call each member of TI, a TFIT (itemset from TFIs). The number of TFITs is 2
For example, if NS
Example 1. (a) Original database and support count for items. (b) Reduced database after the preparation phase, TF and TI lists.
RDB – Reduced Database: The database created in the preparation phase of TFI-Apriori. In the reduced database, TFIs are replaced with a bit vector representing their existence in each transaction. To create RDB, we remove the top frequent items from each transaction in the original database and store their presence into a single attribute as a bit vector. This new field called SNT (Specific Number for Transaction) is the power-sum of the indices of the corresponding TFIs within the TF list. For example, if TF
An example of the TFISup array representation.
In the second phase of TFI-Apriori, using RDB instead of the original database and separating TFIs from the rest of the items will have effects in candidate generation and support counting phases.
In candidate generation, as TFIs are not included in the database anymore, they will not be added to candidate itemsets. Instead, we add a new support counter list in each node of the trie data structure called TFISup to keep track of TFIs and not lose any results in comparison with the original Apriori. Each TFISup entity represents the combination of the respected itemset and one of the TFITs where TFITs are all possible combinations of TFIs. Figure 4 shows the use of the TFISup array. In the support-counting phase, the support for each candidate itemset and its combinations with TFITs will be calculated together and stored in the TFISup array.
In the support-counting phase, we do not need to determine the existence of TFIs for each part of scanning the database. In other words, the existence of TFIs in each transaction is calculated once in the preparation phase and will be used every time. As we expect the TFIs to be frequent in transactions, this will save a lot of processing time in database scanning.
Sections 3.1 and 3.2 describe phases 1 and 2 of the proposed algorithm.
The preparation phase is described by Algorithm 1. In this algorithm, after counting and calculating the support of 1-itemset (frequency of items), NS frequent items that appear more than others are selected and TF is constructed (Lines 1–4). In the next step, TI is constructed (Lines 5–11). TI will contain the set of all possible itemsets that can be produced from TF.
In the next step, RDB is built by replacing the TFIs contained in each transaction with a coded number that describes the existence of the corresponding TFIs (Lines 12–21). Here, the transactional database is read again, and for each transaction
To calculate the index information, a bit vector is designed (Line 18). By the end of the preparation phase, a new reduced database (RDB) will be constructed.
Apriori-based phase
This section describes the second phase of our proposed algorithm, TFI-Apriori, which is similar to the Apriori algorithm and is a modified version of classic Apriori. However, in our proposed algorithm, a reduced database (RDB) is mined for the frequent itemsets in contrast with the Apriori algorithm which uses a transactional database. Similar to the Apriori algorithm, TFI-Apriori mines the complete set of all the frequent itemsets. TFI-Apriori is represented in pseudo code style by Algorithm 2. TFI-Apriori uses the following trivial lemma to find all the frequent itemsets.
By using Lemma 1, after being assured that candidate itemset
The first part of the Apriori-based phase is the support counting part. In this part, an array called the TFISup with size 2
In each pass
In the next step of pass K of Algorithm 2 (Lines 10–14), first the frequent itemsets calculated in this pass are displayed, and then the infrequent itemsets are pruned from C
The pruning step of candidate itemsets in the TFI-Apriori algorithm is similar to classic Apriori. Only, for candidate itemset
Finishing pass
Up to this point, itemsets in TI are not considered yet. In Lines 16–18, the algorithm displays these itemsets if they satisfy the minimum support threshold.
TFI-Apriori frequent itemset mining with minimum support at 50% on the dataset in Fig. 3.
Other parts of our algorithm are the same as the original Apriori and have the same characteristics. Example 2 shows an illustration of the new Algorithm by using the RDB generated in Example 1.
Then by deploying the transaction Specific numbers (SNT) calculated from the preparation phase, the algorithm calculates the support of each candidate itemset and the support of its combination with the TI list. For example, for candidate itemset {B}, the support counts of {B}, {B, A}, {B, C} and {B, A, C} are calculated, where TI
After completing pass 1 of the TFI-Apriori algorithm, L
In pass 2 of the TFI-Apriori algorithm, C
As explained earlier, TFI-Apriori uses the assumption that TFIs appear very often in the database and subsequently in frequent itemsets to offer improvements over the classical Apriori in two aspects. First, retrieving TFIs from the database becomes more efficient using RDB. Second and more importantly, by ignoring TFIs in the candidate itemset generation, the number of itemsets that need to be generated and tested becomes significantly smaller. This reduction is illustrated in Fig. 6. Figure 6a shows the trie constructed by classic Apriori for the data of Example 1. Figure 6b shows the trie constructed by TFI-Apriori for the same database generating the same results as classic Apriori ultimately. In both representations, infrequent items from pass 1 have been ignored.
We may generate and test some itemsets that we already know are infrequent from previous phases. Consider the following example: In Fig. 5, itemset {B, A, C} is not frequent, and this fact can be seen in the results of phase 1. There is no need to calculate the support count of itemsets {B, F, A, C} and {B, G, A, C} in phase 2 and {B, F, G, A, C} in phase 3 because they contain infrequent itemset {B, A, C} and therefore are infrequent. Algorithm 2 presented above does not regard this fact. We could, however, apply the following changes to the apriori_generation function and Algorithm 2 to deal with this matter,
1) for each itemset c from Ck 2) for all (k-1)-subsets s of c do 3) If (sLk-1) then 4) Delete c from C
8) if (c.TFISup[i]
This would not provide any enhancement because the extra support counting for the useless itemsets is not an expensive operation (it is just a conditional increment operation for each transaction) and replacing that with a conditional statement would not offer any difference in speed and just makes the algorithm more complicated.
As seen, a new data structure has been used in the TFI-Apriori algorithm. Instead of using the original and ordinary trie, the algorithm uses a trie with an array along each node referred to as the TFISup Array. Each TFISup entity represents the combination of the respected itemset and one of the TFITs. An example of the TFISup array has been shown in Fig. 4. As described in the previous sections, in the support-counting phase, the support for each candidate itemset and its combinations with TFITs will be calculated together and stored in the TFISup array.
At first, it may seem that using this array will increase the size of trie nodes and consequently the size of used data structure because each node of the trie, instead of saving a value for its respective itemset, saves and uses an array with size 2
However, by deleting TFIs from the trie, the size of the trie is reduced enormously, and the required information about TFIs and their combination with other itemsets are stored in the TFISup array. As evident in Fig. 6, some itemsets (i.e., some trie nodes) have been ignored in the new data structure, thus leading us to have a smaller trie in comparison with the original trie. But, as described before, each node must have a 2
Created Tries from represented database in Example 1. (a) With original Apriori. (b) With the TFI-Apriori algorithm. Infrequent Items (1-itemsets) have been ignored in Trie.
As seen in Fig. 6a, we need 27 trie nodes for frequent itemset mining. In Fig. 6b the needed trie nodes are equal to 7. This new data structure has 20 fewer trie nodes than the original data structure. However, each node in the new data structure needs to have an array of size 4 to save frequencies, and this leads to 7
We now must consider whether the new data structure is more memory efficient than the original trie. The answer to this question is not determinate and strongly depends on the specification of our database and the value of NS. In TFI-Apriori, memory efficiency behaves the same as time efficiency, and both depend on the database and selection of the NS parameter (it will be demonstrated in the experimental result phase). For example, if TFIs have been repeated very much in the database and consequently infrequent itemsets the new data structure will help us to save some memory in comparison with the original trie data structure as in this case, the high number of nodes will be ignored and will be embedded in the TFISup Array as described before. On the other hand, if the size of NS becomes very large, we end up having a big array for each trie node, thus wasting an immense amount of memory. In Section 4.2 there is an experimental study on memory usage of the new algorithm.
The TFI-Apriori and Apriori algorithms have been both implemented based on Bodon’s proposed data structure [14]. We have used Bodon’s source code and implementation of the original Apriori algorithm (ver. 2.4.9, date of issue: 11. March 2005)1
http://www.cs.bme.hu/
Execution time comparison with different minimum supports.
Memory usage with different minimum supports.
Figure 7 compares the execution time of the original Apriori with two different NS parameters of the TFI-Apriori. Figure 8 shows the memory consumption for two different test cases. Figures 9 and 10 present the effects of the NS parameter on the TFI-Apriori algorithm. Tables A.1–A.4 exhibit the run-time of our experiments on different datasets. In each table, the last column shows the best efficiency between different NS parameters in comparison with the original Apriori algorithm. Tables A.5–A.7 show the memory consumption of the proposed TFI-Apriori and the original Apriori algorithm on some datasets for different minimum supports and NS parameters. All tables are in Appendix A at the end of this paper.
Effect of the NS parameter on run-time.
Note that in all the tables and figures NS
As seen in Fig. 7a–c on some real datasets, TFI-Apriori had an outstanding performance in comparison with the original Apriori algorithm and, in most cases, the best results belong to a NS parameter between 5 and 7. For smaller minimum support values, the efficiency of the algorithm was more apparent. As can be seen from these results, the efficiency reached up to 90% in some cases. However, in the synthetic dataset T40I10D100K, there was not much gain in efficiency from the proposed algorithm. This lack of improvement was because, in the well-known synthetic datasets, there are no high-frequency items and this is shown in Fig. 7d. For example, and based on gained statistics, the frequency of the most frequent item in the synthetic dataset T40I10D100K is 7828, which is about 8%. This value is too small and negligible to be considered. The frequency of the most frequent itemset in the accident dataset is about 100% meaning that there is one item that is included in all transactions. Second, third, and fourth TFIs (Top frequent items after the first TFI) in the accident dataset have a frequency higher than 99%. The frequency of TFI in dataset pumsb and mushroom are more than 99%. The frequency of TFI in dataset pumsb-star is about 75%.
Effect of NS parameter on memory usage.
Figure 8 shows the memory consumption of the TFI-Apriori in comparison with the original Apriori, demonstrating that not only does the execution of the proposed algorithm not require extra memory allocation, but it also decreased in most cases (as described in Section 4.3). In the execution of TFI-Apriori, there is approximately a uniform relation between the memory consumption and the run-time efficiency.
Effect of the NS parameter
In test cases on real datasets, the NS parameter was an important variable to consider for the best performance. As evident in Fig. 9, in specific datasets, the NS parameter had a uniform impact on various minimum supports values. For example, for different minimum supports, the best result for dataset accident is NS
Furthermore, the NS parameter is not only an important factor on the runtime of the algorithm, but also has a substantial effect on the amount of memory usage. Figure 10 shows the amount of memory required by the algorithm for different NS parameters, on various datasets. These datasets are the same datasets evaluated in Fig. 9. By comparison of Figs 9 and 10, it can be seen that the behavior of the NS parameter on a dataset in runtime and memory usage are approximately similar. Therefore, reduction of memory usage within the algorithm means a reduction on the number of trie nodes and consequently reduction of the processing time required to find frequent itemsets. Thus, we can say that by reducing the memory usage in any NS parameter, the runtime may be reduced.
Comparison with other implementations
Comparing TFI-Apriori along with its implementation to other well-known algorithms, such as FP-Growth and Eclat, would not be fair or have a benefit. It is known that an algorithm would perform differently given different implementations. Thus, a fair comparison of algorithms requires identical implementation, framework, and structure.
Also, TFI-Apriori is not a brand-new algorithm and is a modification and a revised version of Apriori. This new modification has been performed on Bodon’s implementation that is one of the best implementations of Apriori. So, the best comparison for our work is to compare the performance of TFI-Apriori with the original Apriori when two algorithms are implemented with the same framework. This comparison and performance evaluation has been provided in Section 4 of this paper.
Conclusions
In most databases, there are some items that appear much more frequently than in others. Changing our perspective on these very frequent items can lead to some major improvements in our data mining tasks. In this paper, we have proposed a new optimization that can be adopted in all Apriori-based frequent itemset mining implementations. The stated optimization has been performed on a trie-based implementation. In the preparation phase of our proposed algorithm, the most frequent items from the database are selected and encoded. Then the proposed TFI-Apriori algorithm takes advantage of the encoded information for its mining process. Experimental results show that major improvements in execution time have been gained for frequent itemsets mining in real datasets. However, in datasets where the difference between item frequencies is less notable (i.e., in cases where there are no items with very high frequencies in comparison with other items - usually not in real world), the proposed algorithm is not that much effective. Moreover, the experimental results show that not only does the execution of the proposed algorithm not require extra memory allocation in comparison with the Apriori-based algorithms, but it has reduced the required memory for its execution. The experimental results show a reduction in execution time without an increase in memory allocation in comparison to the Apriori-based algorithms. It is noted that memory required was reduced for this algorithm.
In the presented algorithm, the criterion for TFI selection has the largest frequency. This selection method leads to some significant results for real databases. We did not, however, consider other criteria for selection of TFIs, there may be other criteria for TFI selection that can be considered in the future. Such a study may also give more insight into finding the optimum NS values for any given input database and minimum support threshold. Moreover, we may be able to apply the presented idea to other data mining tasks and optimize them by treating some special items in a more efficient manner.
Footnotes
Appendix A – complete test results
Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset Accident with different minimum supports
[height=0.8cm,width=2.4cm]Min SupNS
0 (original)
1
2
3
4
5
6
7
Best Opt.
0.61731
20.390
19.985
19.797
19.515
19.828
19.984
20.218
20.687
4.30%
0.49973
30.734
25.625
22.937
21.750
22.015
22.313
23.156
25.578
29.2%
0.41154
107.281
66.437
45.640
35.906
34.843
34.844
38.219
51.500
67.5%
0.36450
202.609
116.859
73.093
51.968
48.125
46.671
49.610
67.000
77.0%
0.33511
276.640
155.593
93.890
64.093
57.375
54.203
56.625
74.625
80.4%
0.30571
458.296
216.078
126.218
82.188
71.813
64.656
70.766
86.093
85.9%
0.29513
565.109
247.937
141.500
90.609
78.359
69.796
75.406
90.953
87.6%
Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset mushroom with different minimum supports
[height=0.6cm,width=2.2cm]Min SupNS
0 (original)
1
2
3
4
5
6
7
8
Best Opt.%
0.14771
12.046
7.421
5.203
3.953
3.609
3.546
3.656
3.796
4.437
70.6
0.11078
23.015
12.718
8.406
6.438
5.828
5.178
5.110
5.703
6.531
77.8
0.04924
596.531
351.859
228.735
165.968
137.796
120.468
109.406
113.656
120.375
81.7
0.02954
1284.94
791.531
535.500
408.781
351.856
319.046
294.313
294.860
308.984
77.1
0.02462
2,151.86
1,365.77
949.703
748.093
625.875
596.531
564.328
561.750
578.406
73.9
Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset pumsb with different minimum supports
[height=0.6cm,width=2.2cm]Min SupNS
0 (original)
1
2
3
4
5
6
7
8
Best Opt.%
0.83594
9.265
8.671
8.500
8.265
8.343
8.359
8.343
8.281
8.515
10.8
0.77478
29.25
25.218
16.250
14.375
14.110
13.735
13.656
13.562
13.906
53.6
0.74419
78.843
50.765
36.718
34.468
31.765
29.046
23.750
23.500
23.812
70.2
0.71361
301
177.485
119.171
86.203
73.265
65.406
58.843
56.687
57.203
81.2
0.69322
624.921
365.343
238.109
169.640
140.765
123.156
107.718
101.515
102.000
83.8
0.67283
946.281
549.796
360.062
257.110
215.891
188.625
167.125
156.687
156.734
83.4
0.65652
1,423.16
819.875
529.265
375.437
312.390
271.578
238.515
224.953
225.406
84.2
Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset pumsb_star with different minimum supports
[height=0.6cm,width=2.2cm]Min SupNS
0 (original)
1
2
3
4
5
6
7
8
Best Opt.%
0.46895
5.859
5.812
5.734
5.703
5.890
5.906
6.109
6.625
7.356
2.7
0.40778
13.531
13.500
10.328
10.421
10.281
9.953
10.437
13.296
16.781
26.4
0.36700
44.125
44
28.843
28.406
27.375
26.375
29.171
48.187
74.156
40.2
0.34661
78.781
76.360
40.703
41.062
37.084
33.594
35.125
53.875
80.781
57.4
0.30584
296.140
296.593
154.468
169.812
125.968
89.015
72.734
86.203
117.328
75.4
0.28545
725.031
700.269
368.875
409.468
308.593
230.031
183.140
183.756
222.734
74.7
Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset T40I10D100K with different minimum supports
[height=0.8cm,width=2.4cm]Min SupNS
0 (original)
1
2
3
4
Best Opt.
0.03
8.796
9.078
9.328
10.015
11.734
3.2%
0.01
31.031
31.547
30.062
32.125
38.656
3.1
0.009
98.125
100.421
72.906
79.281
95.000
25.7
0.008
126.984
124.375
100.546
108.250
129.531
20.8
0.0058
248.078
228.906
201.218
213.734
243.656
18.9
0.0037
449.297
419.812
393.641
407.453
460.703
12.4
0.00215
1,413.880
1,306.810
1,231.450
1,210.300
1,315.300
14.4
Comparison of Execution Times (sec) for Original Apriori and TFI-Apriori on dataset T10I4D100K with different minimum supports
[height=0.8cm,width=2.4cm]Min SupNS
0 (original)
1
2
3
Best Opt.
0.00010
62.171
51.750
51.421
51.750
17.3
0.00007
106.562
91.125
90.250
90.984
15.3
0.00004
267.843
237.156
231.344
MEM_EXC
13.6
0.00003
415.156
492.765
MEM_EXC
MEM_EXC
18.7
Comparison of memory usage (MB) for Original Apriori and TFI-Apriori on dataset Accident with different minimum supports
[height=0.8cm,width=2.4cm]Min SupNS
0 (original)
1
2
3
4
5
6
7
0.61731
4,424
4,304
4,200
4,104
4,028
3,968
3,920
3,892
0.49973
9,592
9,280
8,984
8,716
8,484
8,308
8,140
8,028
0.41154
21,928
21,112
20,412
19,768
19,212
18,744
18,336
17,976
0.36450
25,456
24,320
23,376
22,620
21,948
21,368
20,936
20,560
0.33511
26,140
24,696
23,636
22,768
22,092
21,512
21,072
20,764
0.30571
27,416
25,632
24,280
23,284
22,540
21,976
21,516
21,288
0.29513
27,908
25,976
24,468
23,432
22,632
22,080
21,612
21,424
Comparison of memory usage (MB) for Original Apriori and TFI-Apriori on dataset pumsb_star with different minimum supports
[height=0.8cm,width=2.4cm]Min SupNS
0 (original)
1
2
3
4
5
6
7
8
0.46895
2,588
2,576
2,548
2,544
2,536
2,552
2,596
2,712
2,960
0.40778
4,136
4,200
3,992
4,044
3,980
3,984
4,092
4,332
4,916
0.36700
6,380
6,548
6,012
6,208
6,172
6,436
7,200
8,836
12,364
0.34661
7,540
7,584
6,736
7,084
6,820
6,256
7,748
9,380
12,880
0.30584
12,000
12,676
10,100
10,708
9,468
9,520
10,160
11,752
15,304
0.28545
18,444
19,036
14,296
14,679
14,048
13,484
14,220
16,688
21,924
Comparison of memory usage (MB) for Original Apriori and TFI-Apriori on dataset pumsb with different minimum supports
[height=0.8cm,width=2.4cm]Min SupNS
0 (original)
1
2
3
4
5
6
7
8
0.83594
2,524
2,236
2,032
1,968
1,920
1,660
1,984
2,104
2,304
0.77478
8,576
5,876
4,520
3,712
3,352
3,353
3,332
3,532
3,924
0.74419
17,820
11,612
8,228
6,148
5,464
5,208
5,056
5,336
5,620
0.71361
37,712
24,072
15,888
11,656
9,556
8,796
8,572
8,592
9,548
0.69322
61,176
37,016
24,536
17,356
13,916
12,652
11,848
12,200
12,480
0.67283
91,700
56,548
35,608
24,880
19,216
17,328
16,165
16,004
16,980
0.65652
124,860
74,736
47,812
32,196
25,344
22,056
20,716
20,096
21,132
