A frequent itemset mining algorithm based on composite granular computing

Abstract

This paper proposes a frequent itemset mining algorithm based on the divide and conquer strategy in composite granular computing. In order to construct composite information granules (CIGs) and find the frequent patterns, an iterative approach is used in this algorithm. First, create atomic information granules. Next, atomic composite information granules are generated by atomic information granules. Then, through the intersect operation between atomic composite information granules and prune action, the frequent 2-CIGs that will be used to construct frequent 3-CIGs will be constructed, and so on, until no more frequent CIGs can be found. When creating CIGs, this method will improve the computing speed by logical operation in binary. It can avoid scanning database frequently and avoid using complex data structure, so it will reduce the I/O overhead and save a lot of memory space. And it also can optimize the generation of candidate CIGs and compress the transaction database dynamically. The experimental results show that this algorithm has good performance and has low computational complexity and high efficiency.

Keywords

Frequent itemsets granular computing composite information granules binary compress

1. Introduction

Data mining, the knowledge discovery in database (KDD) [1], has been the hot topic in research of artificial intelligence [2, 3, 4, 5]. One important task in data mining is mining the association rules, which needs to find out interesting rules in transaction database [6, 7, 8]. Apriori algorithm is one of the most classic algorithm in data mining, which is a seminal algorithm proposed for mining frequent itemsets for association rules [9]. From Apriori was proposed by Agrawal and Srikant in 1994, many researches have proposed a lot of algorithms for mining association rules [10, 11, 12]. According to the characteristics of generating frequent itemsets for these mining algorithms, we can divide them into two types.

(1)
Mining association rules with candidate generation. This kind of mining algorithms needs to generate candidates when searching frequent itemsets. They are based on Apriori algorithm framework. The advantages of this kind of algorithms are that they use simple data structure, low memory usage, and these algorithms are easy to be realized. The disadvantage is that they need to scan database frequently and generate a large amount of candidates.
(2)
The other type is the mining algorithms that search frequent itemsets without candidate generation. They are based on FP-growth algorithm framework. The advantage is that the running speed is fast. But the disadvantage is that because this kind of algorithms needs to construct tree, a complex data structure, they will use a large amount of memory space.

Granular computing (GrC) is a new method for simulating human thinking and solving complicated problems [13, 14]. This kind of information processing is referred to as the processing of information granulation [15]. When you use GrC to solve problems, you can analyze the same problem at different levels of granularity. This low cost problem solving method can greatly simplify the complicated problems.

Rough set theory, quotient space theory and fuzzy set theory are three main theory models in GrC [16, 17, 18, 19, 20, 21]. The main task of GrC is about representing, constructing, and processing of information granules [22]. Information granules can be formalized by many different approaches [23, 24]. GrC as a methodology has been introduced in many fields [25, 26]. Li et al. [27] presented a line flow GrC approach for power flow calculation to assist the investigation on economic dispatch (ED) with line constraints, where the hierarchy method is adopted to divide the power network into multiple layers to reduce computational complexity. Kok and Chan [28] proposed a novel crowd segmentation framework based on GrC to enable the problem of crowd segmentation to be conceptualized at different levels of granularity, and to map problems into computationally tractable sub-problems. Roselin and Thangavel [29] presented the rough entropy based on GrC to segment mammogram images.

Recently, GrC has been introduced in data mining [30, 31]. more and more researchers focus on finding association rules based on the GrC [32, 33, 34]. Xu et al. [35] focused on the association rules and decision-making rules of the information system based on classical granular computing and association rules. Fang and Wu [36] proposed a kind of finding frequent spatiotemporal association patterns mining based on granular computing. Although some algorithms described association rule mining based on formal methods, they are also restrained by traditional mining algorithm frameworks. The efficiencies of these methods are undesirable. For instance, Ju et al. [37] designed an improved frequent patterns mining algorithm based on transactional granule. It is also a mining algorithm based on the Apriori framework. Tsai et al. [38] found the generalized negative association rules by granule computing. The running speed is fast. While, it is also a mining algorithm based on the FP-growth framework.

In order to make more use of the advantages of these two traditional classic mining algorithm frameworks and further improve the efficiency of association rule mining algorithm, this paper puts forward a kind of frequent itemset mining algorithm based on composite granular computing (FIMCG), which uses the simple data structure and uses divide and conquer strategy in composite granular computing to calculate the support counts of itemsets. It can avoid scanning database repeatedly and reduce the computational complexity and the I/O overhead. When FIMCG generates new CIGs, it can use binary operation to improve the computing speed. It also can optimize the generation of candidate CIGs, and compress the transaction database dynamically.
2. Related concepts and definitions of composite granular computing

In this section, we present the related concepts and definitions of composite granular computing based on the set theory.

Definition 1[39] Take EIS as the extended information system, where $\textit{EIS}=\;<U,A,V,f,R>$ . The meaning of each element is as follows.

1)
$U$ , a nonempty finite set, refers to the universe, the set of objects to be discussed. Suppose there are n objects, that is, $U=\{u_{1},u_{2},...,u_{n}\}$ .
2)
$A$ , a nonempty finite set, refers to the set of attributes, which includes all attributes of the object. Suppose the number of attributes is $m$ , then the attributes set, $A$ , can be denoted by $A=\{a_{1},a_{2},...,a_{m}\}$ , where the type of attribute $a_{i}(i=1,2,...,m)$ should be Boolean.
3)
$V$ refers to the value set. It’s the corresponding value set of attributes $\{a_{1},a_{2},...,a_{m}\}$ . The attribute values are Boolean. In other words, $|v_{a_{i}}|=1(i=1,2,...,m)$ and the value set $V=$ $\{v_{a_{1}},v_{a_{2}},...,v_{a_{m}}\}$ .
4)
$f$ refers to the mapping function. $f(u,a)\in v_{a}$ , where $\forall a\in A,u\in U$ .
5)
$R$ refers to the corresponding relationship between combination of attributes and binary sequence. As to this combination of attributes $\{a_{i_{1}},a_{i_{2}},...,a_{i_{k}}\}\subseteq A$ , where $1\leqslant i_{k}\leqslant m$ and $k\leqslant m$ , the corresponding binary sequence is as follow,

$\displaystyle b=b_{1}b_{2}...b_{m},\text{ where }b_{j}=\left\{{{\begin{array}[% ]{{20}c}1&{j\in{\{}i_{1},i_{2},...,i_{k}{\}}}\\ 0&{j\notin{\{}i_{1},i_{2},...,i_{k}{\}}}\\ \end{array}}}\right.,j=1,2,...,m.$

Note that, this EIS has one more element, $R$ , than general information system do.

Definition 2 Composite information granule CIG is defined as $\textit{CIG}=\;<\delta,\delta^{\prime},g(\delta)>$ . The meaning of each element is as follows,

1)
$\delta$ refers to the intension of information granule, which is the collection of attributes. $\delta=\{a_{1},a_{2},...,a_{k}\}(a_{i}\in A,i=1,2,...,k)$ .
2)
$\delta^{\prime}$ refers to the image of $\delta$ . It is the binary sequence that under the corresponding relationship $R$ in EIS.
3)
$g(\delta)$ refers to the extension. It is a set of objects that have a common set of attributes $\delta$ . $g(\delta)=\{u|\forall a\in\delta,f(u,a)\in V_{a}\}$ .

For example, if $I$ is the set of all attributes, $I=\{a,b,c,d,e,f,g\}$ , then the image of this combination of attributes $\{a,e,f\}$ is $(1000110)$ . If the extension of CIG, $g(\delta)$ , satisfies $|g(\delta)|\geqslant|U|\times\textit{min}\_\textit{sup}$ ( $\textit{min}\_\textit{sup}$ refers to minimum support), then this kind of CIG is referred to as frequent composite information granule (FCIG).

Definition 3 Atomic composite information granule (ACIG) is defined as $\textit{ACIG}=\;<\delta,\delta^{\prime},g(\delta)>(|\delta|=1)$ .

It is constructed by the frequent atomic information granule (FAIG). ACIG is also frequent. It has one more element (the binary sequence $\delta^{\prime}$ , image of $\delta$ ) than FAIG, where $\textit{FAIG}=\;<\delta,g(\delta)>(|\delta|=1,|g(\delta)|\geqslant|U|\times% \textit{min}\_\textit{sup})$ .

Definition 4 Suppose there are two $\textit{CIGs},\textit{CIG}_{\alpha}\!=\;<\delta_{\alpha},\delta^{\prime}_{% \alpha},g(\delta_{\alpha})>$ and $\textit{CIG}_{\beta}\!=\;<\delta_{\beta},\delta^{\prime}_{\beta},g(\delta_{% \beta})>$ [39]. The intersect operation between two CIGs is $\textit{CIG}=\textit{CIG}_{\alpha}\otimes\textit{CIG}_{\beta}=\;<\delta_{% \alpha}\cup\delta_{\beta},\delta^{\prime}_{\alpha}|\delta^{\prime}_{\beta},g(% \delta_{\alpha})\cap g(\delta_{\beta})>$ ( $|$ is the $O R$ operation in binary).

Definition 5 Suppose there are two different $(k-1)-\textit{CIGs}$ . Each intension of CIG includes $(k-1)$ items. $\textit{CIG}_{\alpha}=\;<\delta_{\alpha},\delta^{\prime}_{\alpha},g(\delta_{% \alpha})>$ and $\textit{CIG}_{\beta}=\;<\delta_{\beta},\delta^{\prime}_{\beta},g(\delta_{\beta% })>$ . In the process of constructing a new $k-\textit{CIG}$ based on the intersect operation between two $(k-1)-\textit{CIGs}$ , we can restrain the construction processing in order to reduce the amount of candidate CIGs. That is, the intersect operation, $\textit{CIG}=\textit{CIG}_{\alpha}\otimes\textit{CIG}_{\beta}$ , is performed, if

$\displaystyle(\textit{CIG}_{\alpha}.\delta[1]=\textit{CIG}_{\beta}.\delta[1])% \wedge(\textit{CIG}_{\alpha}.\delta[2]=\textit{CIG}_{\beta}.\delta[2])\wedge..% .\wedge(\textit{CIG}_{\alpha}.\delta[k-2]=\textit{CIG}_{\beta}.\delta[k-2]).$

Corollary 1 Sort transactions and items with in a transaction according to the lexicographic order [40]. For two $(k-1)-\textit{CIGs}$ , $\textit{CIG}_{\alpha}$ and $\textit{CIG}_{\beta}$ , if $\delta_{\alpha}$ and $\delta_{\beta}$ can’t be merged, the intensions of $(k-1)-\textit{CIGs}$ that all after $\delta_{\beta}$ don’t need to be judged.

Proof: If transactions and items with in a transaction are sorted in lexicographic order, the intensions of $(k-1)-\textit{CIGs}$ that all after $\delta_{\beta}$ must satisfy $\textit{CIG}_{\beta+n}.\delta[k-2]>\textit{CIG}_{\beta}.\delta[k-2]$ , That is, $\textit{CIG}_{\alpha}.\delta[k-2]\neq\textit{CIG}_{\beta+n}.\delta[k-2]$ . So the intensions of all $(k-1)-\textit{CIGs}$ that after $\delta_{\beta}$ can’t be merged with $\delta_{\alpha}$ . Then they don’t need to be judged.

Definition 6 Assume $I$ is the set of all items, $L_{k}$ is the frequent $k$ -itemsets. If the transaction $t\subseteq I$ , and $|t|=k$ , then $t$ can be deleted in the subsequent mining.

Corollary 2 In the universe, $U$ , where $U=\{t_{1},t_{2},...,t_{n}\}$ , if $|t_{i}|=k$ , we can delete $t_{i}$ when calculating the supports of $(k+1)-\textit{CIGs}$ .

Proof*: When this algorithm generates $(k+1)-\textit{CIGs}$ , the intention of each $(k+1)-\textit{CIG}$ must satisfy $|\delta|=k+1$ . Then, if $|t_{i}|=k$ , we can get $|t_{i}|<|\delta|$ and $\delta\subseteq t_{i}$ . Obviously, $t_{i}$ can be deleted when we calculate the supports of $(k+1)-\textit{CIGs}$ .

The corresponding relations between CIG, information granule, and traditional association rules are shown in Table 1 [41].

Table 1
The corresponding relations of concepts

Traditional association rules Information granule CIG

Relational database Information system (IS) EIS

Transaction (set) Individual (set) Complexes

Itemset The intension of information granule The intension of CIG

Set of transactions containing itemset The extension of information granule The extension of CIG

1-itemset The intension of atomic information granule The intension of ACIG

Frequent itemset The intension of feature information granule The intension of FCIG

Note that: the CIG is an extended model of information granule. The composite information granule, which is proposed by us, can compress the relational database to reduce data, and fast generate the intension of CIGs, candidate itemsets, through the intersect operation between two CIGs. There are more details about the use of CIG as described in Section 3.1.
3. The algorithm design based on composite granular computing

Traditional association rules	Information granule	CIG
Relational database	Information system (IS)	EIS
Transaction (set)	Individual (set)	Complexes
Itemset	The intension of information granule	The intension of CIG
Set of transactions containing itemset	The extension of information granule	The extension of CIG
1-itemset	The intension of atomic information granule	The intension of ACIG
Frequent itemset	The intension of feature information granule	The intension of FCIG

In this section, one detail example about the use of CIGs and the algorithm description which is used to find frequent itemsets efficiently is given. And then, the performance analysis and experiment results are displayed to verify the efficiency of this new algorithm.

3.1 The example

According to the related concepts and definitions of composite granular computing in Section 2, this kind of frequent itemset mining algorithm based on composite granular computing, FIMCG algorithm, can be described as follows: according to the minimum support, find out all $\textit{FCIG}=\;<\delta,\delta^{\prime},g(\delta)>$ in the extended information system EIS. The intensions of all FCIGs, $\delta$ , are the frequent itemsets that we want to find out.

Although FIMCG algorithm in this paper will generate candidate itemsets, the way of generating candidate itemsets is different from Apriori algorithm framework. According to composite granule computing, FIMCG only need to scan database once, and then find out the frequent itemsets, so it can reduce I/O overhead as FP-growth algorithm framework. And FIMCG don’t need to traverse the complex frequent pattern tree like FP-growth. It won’t use a large amount of memory space.

Suppose there are itemset $X(X\subseteq I)$ and transaction set $T(T\subseteq D)$ . $I=\{a,b,c,d,e,f,g\}$ , $T_{1}=\{a,b,c,d,e,f\}$ , $T_{2}=\{a,c,e,f,g\}$ , $T_{3}=\{b,c,e\}$ , $T_{4}=\{a,b,c,e,f,g\}$ , $T_{5}=\{a,e,f\}$ , $T_{6}=\{a,c,f,g\}$ . The construct processing of CIGs is as follows.

Scan the database, set support at 50%, we can get the following atom information granules,

AIG ${}_{1}=$ <‘a’,{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ ,T ${}_{6}$ }>, AIG ${}_{2}=$ <‘b’,{T ${}_{1}$ ,T ${}_{3}$ ,T ${}_{4}$ }>, AIG ${}_{3}=$ <‘c’,{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{3}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, AIG ${}_{4}=$ <‘d’,{T ${}_{1}$ }> (prune), AIG ${}_{5}=$ <‘e’,{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{3}$ ,T ${}_{4}$ ,T ${}_{5}$ }>, AIG ${}_{6}=$ <‘f’,{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ ,T ${}_{6}$ }>, AIG ${}_{7}=$ <‘g’,{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>.

The support count of $g(d)$ can’t satisfy the support count that is set by us, as a result, it will be pruned off directly. According to the intension of each atomic information granule, it constructs the corresponding image. Finally, we get ACIGs as follows,

ACIG ${}_{1}=$ <‘a’,(1000000),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ ,T ${}_{6}$ }>, ACIG ${}_{2}=$ <‘b’, (0100000),{T ${}_{1}$ ,T ${}_{3}$ ,T ${}_{4}$ }>, ACIG ${}_{3}=$ <‘c’, (0010000),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{3}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, ACIG ${}_{5}=$ <‘e’, (0000100),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{3}$ ,T ${}_{4}$ ,T ${}_{5}$ }>, ACIG ${}_{6}=$ <‘f’, (0000010),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ ,T ${}_{6}$ }>, ACIG ${}_{7}=$ <‘g’, (0000001),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>.

Next, the new 2-CIGs will be generated by the intersect operation between ACIGs based on bottom-up method. The remark-“prune” after the information granules means these information granule have been deleted. According to the Definition 4, using the intension and the binary sequence of each ACIG, we will get the following CIGs,

CIG ${}_{12}=$ <’ab’, (1100000),{T ${}_{1}$ ,T ${}_{4}$ }> (prune), CIG ${}_{13}=$ <’ac’, (1010000),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{15}=$ <’ae’, (1000100),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ }>, CIG ${}_{16}=$ <’af’, (1000010),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ ,T ${}_{6}$ }>, CIG ${}_{17}=$ <’ag’, (1000001),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{23}=$ <’bc’, (0110000),{T ${}_{1}$ ,T ${}_{3}$ ,T ${}_{4}$ }>, CIG ${}_{25}=$ <’be’, (0100100),{T ${}_{1}$ ,T ${}_{3}$ ,T ${}_{4}$ }>, CIG ${}_{26}=$ <’bf’, (0100010),{T ${}_{1}$ ,T ${}_{4}$ }> (prune), CIG ${}_{27}=$ <’bg’, (0100001),{T ${}_{4}$ }> (prune), CIG ${}_{35}=$ <’ce’, (0010100),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{3}$ ,T ${}_{4}$ }>, CIG ${}_{36}=$ <’cf’, (0010010),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{37}=$ <’cg’, (0010001),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{56}=$ <’ef’, (0000110),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ }>, CIG ${}_{57}=$ <’eg’, (0000101),{T ${}_{2}$ ,T ${}_{4}$ }> (prune), CIG ${}_{67}=$ <’fg’, (0000011),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>.

This FIMCG algorithm executes intersect operation continuously. According to the Definition 5 and Corallary 1, this algorithm is optimized continuously, when it generates the intensions of CIGs. According to the Definition 6 and Corallary 2, FIMCG calculates support counts, prunes some branches off, and compresses transaction database dramatically when it generates the extensions of composite granules. The rest of the CIGs are as follows,

CIG ${}_{135}=$ <’ace’, (1010100),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ }>, CIG ${}_{136}=$ <’acf’, (1010010),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{137}=$ <’acg’, (1010001),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{156}=$ <’aef’, (1000110),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{5}$ }>, CIG ${}_{157}=$ <’aeg’, (1000101),{T ${}_{2}$ ,T ${}_{4}$ }> (prune), CIG ${}_{167}=$ <’afg’, (1000011),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{235}=$ <’bce’, (0110100),{T ${}_{1}$ ,T ${}_{3}$ ,T ${}_{4}$ }>, CIG ${}_{356}=$ <’cef’, (0010110),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ }>, CIG ${}_{357}=$ <’ceg’, (0010101),{T ${}_{2}$ ,T ${}_{4}$ }> (prune), CIG ${}_{367}=$ <’cfg’, (0010011),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>, CIG ${}_{1356}=$ <’acef’, (1010110),{T ${}_{1}$ ,T ${}_{2}$ ,T ${}_{4}$ }>, CIG ${}_{1357}=$ <’aceg’, (1010101),{T ${}_{2}$ ,T ${}_{4}$ }> (prune), CIG ${}_{1367}=$ <’acfg’, (1010011),{T ${}_{2}$ ,T ${}_{4}$ ,T ${}_{6}$ }>.

3.2 Algorithm description

Input: 1) $D$ , a database of transaction; 2) $\textit{min}\_\textit{sup}$ , the minimum support threshold.

Output: $L$ , frequent itemsets in $D$ .

Method:

1) scan $D$ ; //Scan transaction database.

2) create $IS=\;<U,A,V,f>$ ; // Create information system.

3) $L_{\textit{AIG}}=\bigcup\textit{AIG}_{i}=\;<\delta,g(\delta)>(|g(\delta)|% \geqslant\textit{min}\_\textit{sup}\times|D|)$ ; // Construct frequent atomic information granules.

4) create $\textit{EIS}=\;<U,A,V,f,R>$ ; // Create the extended information system.

5) $L_{1}=\bigcup\textit{ACIG}_{i}=\;<\delta,\delta^{\prime},g(\delta)>$ // Construct ACIGs.

6) for (int $k=2$ ; $L_{k-1}\neq null$ ; $k++$ ) {

7) $C_{k}=$ create_CIG ( $L_{k-1}$ );

8) if ( $k>$ 1) {

9) for each $T\in D$ // Compress transaction database dynamically.

10) if ( $|T|=k-1$ )

11) delete $T$ ;

12) }

13) $L_{k}=\{c\in C_{k},|C_{k}.g(\delta)|\geqslant\textit{min}\_\textit{sup}\times|% D|\}$ ;

14) }

15) return $L=\bigcup{L_{k}}$ ;

16) create_CIG ( $L_{k-1}$ ) // Construct CIGs by using simple linear array datastructure.

17) for each $X_{1}\in L_{k-1}$

18) for each $X_{2}\in L_{k-1}$

19) if $((X_{1}.\delta[1]=X_{2}.\delta[1])\wedge(X_{1}.\delta[2]=X_{2}.\delta[2])% \wedge...\wedge(X_{1}.\delta[k-2]=X_{2}.\delta[k-2])\wedge(X_{1}.\delta[k-1]<X% _{2}.\delta[k-1]))$

20) {

21) $\textit{CIG}=X_{1}\otimes X_{2}=\;<X_{1}.\delta\cup X_{2}.\delta,X_{1}.\delta^% {\prime}|X_{2}.\delta^{\prime},X_{1}.g(\delta)\cap X_{2}.g(\delta)>$ ; // Optimize the constructing of candidate itemsets and do the intersect operation.

22) if (has_infrequent_CIG (CIG, $L_{k-1}$ )) // Only add $F C I G s$ to candidate itemsets.

23) deleteCIG;

24) else

25) add CIG to $C_{k}$ ;

26) }

27) else

28) break;

29) has_infrequent_CIG (CIG, $L_{k-1}$ ) // Check whether the intension of $k-\textit{CIG}$ has infrequent ( $k-1$ )-subset.

30) for each ( $k-1$ )-subset n of $\textit{CIG}.\delta$

31) if n $\notin L_{k-1}$

32) return TRUE;

33) else

34) return FALSE;

Next, we analyze the performance of Apriori algorithm, the improved apriori algorithm based on transactional granule (IG-Apriori), FP-growth and FIMCG algorithm. $|C_{k}|$ refers to the size of the candidate K-itemsets.

Table 2
Performance analysis

Indexes	Apriori	IG-Apriori	FP-growth	FIMCG
Scan times of database	$\sum\limits_{k=1}^{n}{\left\|{C_{k}}\right\|}$	$\sum\limits_{k=1}^{n}{\left\|{C_{k}}\right\|}$	2	1
Data structure	Simple	Simple	Complex	Simple
Whether generate candidates?	Yes	Yes	No	Yes
Memory usage	More	Less	Much more	Very less
CPU overhead	More	More	More	Less

From Table 2, that shows about the scan times of database, data structure, generate candidates, memory usage, and CPU overhead, we know that FIMCG performs more excellently compared with Apriori and IG-Apriori and FP-growth. Therefore, it suggests that FIMCG, the improved algorithm in this paper, is effective and feasible.

3.3 Experimental results

In order to verify that FIMCG has more advantages in runtime and memory usage compared with Apriori, IG-Apriori, and FP-growth. The experimental results are given as follows.

Experiment environment: Windows 8 (32-bit), Intel (R) Core (TM) i3-3120M CPU @ 2.50 GHZ, 4 GB RAM. Software development environment: Microsoft Visual Studio 2013, C# programming language.

The test data in Figs 1 and 3 comes from dataset, Food Mart 2000, in SQL Server 2000, which is filtered from 164558 sales data in 1998. There are 34015 records in dataset, and the number of attributes is 23. This dataset is used to reflect the performance of each algorithm under the same transaction number, but the different supports.

The test data in Figs 2 and 4 comes from 4 sub-datasets in ExtendedBakery. They are 1000, 5000, 20000 and 75000 records in each sub-dataset respectively, and the number of attributes is 50. They comes from a website, which is https://wiki.csc.calpoly.edu/datasets/wiki. These 4 sub-datasets are used to reflect the performance of each algorithm under the same support, but the very different numbers of transactions.

Figure 1.

Runtime of the four algorithms under different supports on dadaset Food Mart 2000.

Figure 2.

Runtime of the four algorithms under the support 0.5% on 4 sub-datasets in ExtendedBakery.

Figure 3.

Memory usage of the four algorithms under different supports on dadaset Food Mart 2000.

Figure 4.

Memory usage of the four algorithms under the support 0.5% on 4 sub-datasets in ExtendedBakery.

Figure 1 shows that with the change of support, the FIMCG algorithm in this paper run faster than the classic Apriori algorithm, FP-growth, and IG-Apriori algorithm. Figure 2 shows that the performances of classic Apriori algorithm, FP-growth and IG-Apriori algorithm dramatically decline with the increase of transactions. Particularly when there are a lot of transactions and attributes the efficiency of FIMCG algorithm is higher because the FIMCG use binary image operation to improve the computing speed, optimize the generation of candidate CIGs, and compress the transaction database dynamically.

The memory usage of four algorithms as shown in Figs 3 and 4. Memory usage refers to the max memory usage, including the memory space used by codes and temporary storage units.

Figures 3 and 4 show that the memory usage of FIMCG are less than Apriori, IG-Apriori and FP-growth, especially much less than FP-growth. The main reason is FP-growth uses the complex tree data structure, while the FIMCG use the simple linear array.

4. Conclusions

In order to further discuss the GrC in the application of association rule mining, this paper presented a new improved frequent itemset mining algorithm based on composite granular computing. First, AIGs were founded. Next, ACIGs were generated by AIGs. Then, by the intersect operation between ACIGs and prune actions, the frequent 2-CIGs that will be used to construct frequent 3-CIGs were constructed, and so on, until no more frequent CIGs can be found. This method used simple data structure, avoided multiple scanning database, improved the operation speed, and reduced the computational complexity, memory usage and I/O overhead. The algorithm also optimized the generation of candidate itemsets and compressed certain degree of transaction database dynamically. Experiments showed that this algorithm improved the algorithm efficiency, and had lower complexity.

Footnotes

Acknowledgments

This research is supported by Scientific and Technological Research Program of Chongqing Municipal Education Commission (Grant No. KJ1401010), Chongqing Municipal Key Laboratory of Institutions of Higher Education (Grant No. [2017]3), and Scientific and Technological Research Program of Chongqing Municipal Education Commission (Grant No. KJ1601015).

Conflict of interest

The authors declare that they have no competing interests.

References

Piatestsky-Shapiro

(ed.), Knowledge discovery in databases, AAAI/MIT Press, 1991.

Zhang

Kamaha

and Behera

, Prediction of surface water supply sources for the district of columbia using least squares support vector machines (LS-SVM) method, Advances in Computer Science an International Journal 4(1) (2015), 1–9.

Hooks

and Ding

, A framework for data mining on combinatorial game theory, Journal of Computational Methods in Sciences and Engineering 9(2) (2009), 91–98.

Gortzis

L.G.

Sakellaropoulos

Ilias

et al., Investigating the prognostic accuracy of standardized data mining algorithms in intensive care unit, Journal of Computational Methods in Sciences and Engineering 8(4) (2008), 253–262.

Han

Cheng

Xin

et al., Frequent pattern mining: current status and future directions, Data Mining and Knowledge Discovery 15(1) (2007), 55–86.

Chen

, Mining top-k frequent patterns over data streams sliding window, Journal of Intelligent Information Systems 42(1) (2014), 111–131.

Chen

, Efficiently mining recent frequent patterns over online transactional data streams, International Journal of Software Engineering and Knowledge Engineering 19(5) (2009), 707–725.

Chatterjee

and Perrizo

, Maximum likelihood function used to calculate confidence of association rules in market baskets, Journal of Computational Methods in Sciences and Engineering 12 (2012), 119–127.

Agrawal

and Srikant

, Fast algorithms for mining association rules, in: Proc. 20th VLDB Conf., Santiago, Chile, 1994, pp. 487–499.

10.

Najadat

Shatnawi

and Obiedat

, A new perfect hashing and pruning algorithm for mining association rule, Communications of the Ibima 2011 (2011), 4715–4725.

11.

Chen

Zhang

et al., Weighted FP-tree mining algorithms for conversion time data flow, International Journal of Database Theory and Application 9(1) (2016), 169–184.

12.

Boutsinas

, A new biclustering algorithm based on association rule mining, International Journal on Artificial Intelligence Tools 22(3) (2013), 1350017-1-13.

13.

Fan

and William

, Comparison of discretization approaches for granular association rule mining, Canadian Journal of Electrical and Computer Engineering 37(3) (2014), 157–167.

14.

Yao

, Granular computing: basic issues and possible solutions, in: Proc. 5th Joint Conf. on Information Sciences, Atlantic, NJ, USA, 2000, pp. 186–189.

15.

Pedrycz

, Granular computing: an introduction, in: Proc. 20th NAFIPS Int. Conf., Vancouver, BC, 2001, pp. 1349–1354.

16.

Zhang

and Miao

, Two basic double-quantitative rough set models of precision and grade and their investigation using granular computing, International Journal of Approximate Reasoning 54(8) (2013), 1130–1148.

17.

Pedrycz

Hirota

Pedrycz

et al., Granular representation and granular computing with fuzzy sets, Fuzzy Sets and Systems 203(21) (2012), 17–32.

18.

Chiaselotti

Gentile

Infusino

et al., The adjacency matrix of a graph as a data table, Annali di Matematica Pura ed Applicata 196(3) (2017), 1073–1112.

19.

Pedrycz

, Granular computing: analysis and design of intelligent systems, CRC Press, 2013.

20.

Lin

T.Y.

, Data mining: granular computing approach, Lecture Notes in Computer Science 1574 (1999), 24–33.

21.

Chiaselotti

Ciucci

Gentile

et al., The granular partition lattice of an information table, Information Sciences 373 (2016), 57–78.

22.

and Li

, Granular computing approach to two-way learning based on formal concept analysis in fuzzy datasets, IEEE Trans. on Cybernetics 46(2) (2016), 366–379.

23.

Chen

Zhong

and Yao

, A hypergraph model of granular computing, in: Proc. IEEE International Conference on Granular Computing, 2008, pp. 130–135.

24.

Bisi

Chiaselotti

Ciucci

et al., Micro and macro models of granular computing induced by the indiscernibility relation, Information Sciences 388–389 (2017), 247–273.

25.

Leung

and Mi

, Granular computing and knowledge reduction in formal contexts, IEEE Trans. on Knowledge and Data Engineering 21(10) (2009), 1461–1474.

26.

Yao

, Interpreting concept learning in cognitive informatics and granular computing, IEEE Trans. on Systems, Man, and Cybernetics 39(4) (2009), 855–866.

27.

Fang

Zhang

and Zhao

, A line flow granular computing approach for economic dispatch with line constraints, IEEE Trans. on Power Systems, 2017, DOI: 10.1109/TPWRS.2017.2665583.

28.

Kok

V.J.

and Chan

C.S.

, GrCS: Granular computing-based crowd segmentation, IEEE Trans. on Cybernetics, 2016, DOI: 10.1109/TCYB.2016.2538765.

29.

Roselin

and Thangavel

, Mammogram image segmentation using granular computing based on rough entropy, in: Int. Conf. on Pattern Recognition, Informatics and Medical Engineering (PRIME), 21–23 March, 2012, pp. 318–323.

30.

Yao

and Zhong

, Potential applications of granular computing in knowledge discovery and data mining, in: Proc. of World Multiconference on Systemics, Cybernetics and Informatics, 1999, pp. 573–580.

31.

Yao

, Modeling data mining with granular computing, in: Proc. COMPSAC 2001 Conf., Chicago, IL, 2001, pp. 638–643.

32.

Qiu

Chen

Liu

and Huang

, Granular computing approach to finding association rules in relational database, International Journal of Intelligent Systems (25) (2010), 165–179.

33.

Wang

Jiang

Wang

et al., Algorithm of mining association rules based on rough sets and transaction itemsets combination, Computer Science 38(11) (2011), 234–239.

34.

Zhang

and Yan

, Association rules mining algorithm based on granular computing, Computer Engineering 35(8) (2009), 86–88.

35.

Liu

Zheng

et al., Decision rule acquisition algorithm based on association-characteristic information granular computing, in: IEEE International Conference on Granular Computing, San Jose, CA, USA, 14–16 Aug, 2010, pp. 812–816.

36.

Fang

and Wu

, Frequent spatiotemporal association patterns mining based on granular computing, Informatica 37(4) (2013), 443–453.

37.

and He

, An improved Apriori algorithm based on transactional granule, Journal of Luoyang Normal University 34(8) (2015), 72–74.

38.

Tsai

L.M.

Lin

S.J.

and Yang

D.L.

, Efficient mining of generalized negative association rules, in: IEEE International Conference on Granular Computing, Washington DC, 2010, pp. 471–476.

39.

Fang

Wang

Ying

and Tang

, Frequent closed itemsets mining based on granular computing, Computer Engineering and Applications 50(20) (2014), 130–134.

40.

Tan

and Steinbach

, Introduction to data mining, Posts & Telecom Press, Beijing, 2011.

41.

Fang

and Wu

, A general framework based on composite granules for mining association rules, International Journal on Artificial Intelligence Tools 23(5) (2014), DOI: 10.1142/S0218213014500092.

A frequent itemset mining algorithm based on composite granular computing

Abstract

Keywords

1. Introduction

3.1 The example

3.2 Algorithm description

Table 2 Performance analysis

Footnotes

Acknowledgments

Conflict of interest

References

Table 2
Performance analysis