An integrated approach for mining closed and generator high utility itemsets

Abstract

High Utility Itemset Mining (HUIM) is playing an important role in extracting meaningful knowledge that is profitable itemsets rather than their occurrence frequency. HUIM result thrives into a huge number of utility itemsets that makes it difficult in decision policy. The alternative procedure is condensed representation of HUI’s that contains utility itemsets without any redundancy. Closed High Utility Itemsets (CHUI) and High Utility Generators (HUG) are lossless utility itemsets useful in recommendation systems. In this paper, we address how to extract Closed and Generator HUI’s from the given dataset. For faster accessing, it is proposed hash based technique that store itemsets with the frequency, utility value, closed flag, and generator flag. Experimental results shows that it outperforms other approaches in time and space.

Keywords

HUI’s CHUIM generator hash data mining

1. Introduction

Frequent Itemset Mining (FIM) [5] and Association Rule Mining (ARM) [1, 2, 15] are playing an important role in deriving hidden and important knowledge. Traditional FIM’s considers the occurrence of itemsets in database, rather than its importance. It has been investigated extensively to Weighted Itemset Mining and Weighted Association Rule Mining. But such kind of knowledge is not exception to FIM limitations such are (i) value of the pattern is not considered due to frequency based framework (ii) relative importance of each item is not considered. Such kind of knowledge can cause of losing profitable itemsets due to insufficient frequency such are named as infrequent itemsets. To address and solve the above limitations, the outcome of the investigation led to the concept of High Utility Itemset Mining (HUIM) [2, 4, 18]. The goal of HUIM is a derive itemsets that have interested utility value or high utility value, where the utility value is determined from the utility function. The utility function is defined as a product of unique profit of the item and units or quantities that carry in that instance. HUIM is more difficult task than FIM, because it doesn’t follow monotonic and anti-monotonic property. Hence, the HUIM takes huge computation for a large number of utility itemsets. It is motivated for adopting condensed representation of itemsets to HUIM, the result with many techniques for extracting Maximal HUIM [21]. Further, it has been investigated for lossless condensed representation of utility itemsets, result with the concept of Closed HUIM and Generators [7]. It has been investigated to derive various kinds of knowledge such are incremental mining of HUIM [24]. Consider the sample database from Table 1, where each item is recorded with its units of quantity purchase in a particular instance. Table 2 shows the utility profit values of each item. For a given utility threshold value 25, the list of high utility itemsets with their frequency are shown in Table 2. It can be seen that Itemsets $<$ {a, c} $>$ , $<$ {a, c, e} $>$ are having utility values $<$ 28, 31 $>$ , but their frequency are same. Hence such kind of Itemsets are named as Closed, whose frequency is same as their supersets, and can be ignored.

Table 1
Sample quantitative database

TID	Itemsets with quantity
T0	(a, 1), (b, 5), (c, 1), (d, 3), (e, 1)
T1	(b, 4), (c, 3), (d, 3), (e, 1)
T2	(a, 1), (, 1), (d, 1)
T3	(a, 2), (c, 6), (e, 2)
T4	(b, 2), (c, 6), (e, 2)

Table 2

Utility table

Item	Unit profit (Rs)
A	5
B	2
C	1
D	2
E	3

In this paper, we address the importance of Closed HUI’s and High Utility Generators, and approaches for deriving same thing from the quantitative databases. The main contributions of this paper are

•

Hash based data structure for keeping High Utility Itemsets of HUIM.

•

Efficient approaches for deriving CHUI’s and HUG from the output of HUIM.

•

Measuring the performance of the proposed approach.

The rest of the paper is organized as, we present the highlights and limitations of the techniques that are related to HUIM, CHUI, and generators in related works. We present basic definitions that are related to the problem statement in next chapter. We also describe the proposed method for CHUI’s and HUG. To examine the performance of the proposed approach, we present both theoretical results and practical result analysis.

2. Related work

Several approaches have been proposed for HUIM such are classified as (i) Apriori based approaches (ii) Tree based approaches (iii) Projection based approaches (iv) Hybrid approaches.

(i) Apriori based approaches

Liu et al. [13] proposed Two-phase algorithm, where all the possible high utility itemsets are explored in first step, and utility of the possible itemsets are calculated by scanning the database. One of the main contribution is Transaction Weighted Down-ward Closure (TWDC), is replica of Apriori property, to reduce the search space by reducing the maximum upper bound to its TWDC. To exhibit downward closure property, Yao and Hamilton [23] proposed UMining and UMining_H algorithms to discover HU itemsets.

(ii) Tree based approaches [14]

To address the limitations of Level-wise candidate generation based approaches of HUIM, Tree based and Pattern-growth approaches are introduced. Tseng et al. [11, 20] introduced more compressed tree UP-tree that can represent all the itemsets in structure format. He has proposed UP-Growth algorithm to mine High utililty itemset using UP-tree. To overcome the difficulties of UP-Growth [20], minimal node utility values are introduced in each path of UP-tree, is named as UP-Growth $+$ algorithm. UP-Growth $+$ outperforms all the other approaches in terms of both space and execution time.

(iii) Projection based approaches

To overcome the limitations of tree based approaches that is recursive visit of the same tree to derive all the high utility itemsets Lan et al. [9, 10, 19] proposed PBAU (Parojection-Based Average Utility) approach to mine high average utility itemsets. It uses a measure called average utility.

(iv) Hybrid approaches [12]

The above approaches derive HUIs either in K-passes or two passes. In real time applications, it is needed that running algorithm should meet the application data. Hence, it is motivated the researches to propose a Single Phase algorithms. One of the algorithm is HUI-miner [13]. It uses Utility-list structure to keep the item occurrence information and remaining utility information. Fournier-Viger et al. [6] proposed FHM algorithm. It uses estimated utility co-occurrence structure (EUCS) structure to avoid join operations. To discover HUIs more efficiently, EFIM algorithm is proposed [8]. To reduce the database scans, projection method HDP HTM (High Utility merging) is used in EFIM. Thus, it outperforms other approaches by reducing execution and space to linear.

3. High utility itemset mining

For a given data set $\textit{QDB}={\{}T_{1},T_{2},\ldots,T_{n}{\}}$ , each transaction is represented with a set of items that are associated with Quantity or unit values, $Ti={\{}(i_{1},\linebreak q_{1}),(i_{2},q_{2}),\ldots,(i_{m},q_{m}){\}}$ , also each item associates with an external utility (profit/quality). Utility of item $i$ is calculated from the product of internal quantity and external utility. Utility of itemset $X$ is calculated from the sum of the utility values of items of $X$ . For a given minimum utility threshold value $\delta$ , aim is to find the itemsets whose utility value is greater than or same to the $\delta$ , named as High Utility Itemset Mining (HUIM). Also aimed at redundant less HUI’s such are Closed and generators The definitions and notations for the basic terminology is presented as follows.

Definition 1: The Internal Utility of an item $i$ is denoted as $IU(i,T_{p})$ , and defined as the quantity associated with item $i$ in a transaction $T_{p}$ . For example, item ${\{}a{\}}$ in $T_{0}$ of Table 1 is associated with 1, hence ${IU}(a,T_{1})=2$ .

Definition 2: The External Utility of an item is denoted as $EU(i)$ , and defined as the unique quantity associated with each item in the utility table shown in Table 2. For example, item $a$ is associated with quantity 5, hence $EU(a)$ is 5.

Definition 3: The utility of an item $i$ in transaction $T_{p}$ is denoted as $UI(i,T_{p})$ , and defined as, is the product of the $IU(i,T_{p})$ and $EU(i)$ . In other words, it is the product of internal and external utility values of item $i$ in $T_{p}$ . For example, item $a$ in $T_{0}$ , $UI(a,T_{0})=IU(a,T_{0})\times EU(a)=1\times 5=5$ .

Definition 4: The utility of an itemset $X$ in a transaction $T_{p}$ is denoted as $UT(X,T_{p})$ , defined as, is the sum of the utility values of all the items that are in itemset $X$ in a transaction $T$ . It uses Definition 3 to find the utility of each item in $T_{p}$ . For example, item set $<$ ab $>$ in $T_{0}$ , $UT(<$ ab $>$ , $T_{0})=UI(a,T_{0})+UI(b,T_{0})=5+10=15$ .

Definition 5: The utility of $X$ in database is denoted as $UT(X,\textit{QDB})$ , defined as the sum of the utility values of the itemset $X$ in each transaction.

$\displaystyle UT(X,\textit{QDB})=\sum_{X\subseteq Tp\cap Tp\subseteq\textit{% QDB}}^{N}UT(X,T_{p})$ (1)

For example, the utility of item ${\{}a{\}}$ in QDB is $UT\linebreak({\{}a{\}},\textit{QDB})=UT(a,T_{0})+UT(a,T_{2})+UT(a,T_{3})=5+5+1% 0=25$ . Definition 5 uses the Definition 4 to find the utility of each itemset in each $T_{p}$ .

HUIM is a defined as, it is process of discovering itemsets whose utility is not less than the minutil that is given by the user. HUIM process derives all High Utility Itemsets with the help of Definition 5, Definition 4 and others subsequently. It takes more search space to maintain all possible itemsets, because lack of downward closure property.

To reduce the memory space, the following Definition 5 is used to find the upper bound of the utility of item in a database QDB.

Definition 6: The utility of a transaction is denoted as $UT(T_{i},\textit{QDB})$ , defined as the sum of the utilities of the all items presented in a transaction $T$ . In other words, if can be defined as follows

$\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{m}UT(i_{j},T_{i})$ (2)

For example, $UT(T_{0},\textit{QDB})=UT(a,T_{0})+UT(b,\linebreak T_{0})+UT(c,T_{0})+UT(d,T_{% 0})+UT(e,T_{0})$ is $1\times 5+5\times 2+1\times 1+3\times 2+1\times 3=25$ .

Definition 7: The utility of database QDB is denoted as $UT(\textit{QDB})$ , defined as the sum of utility of the each transaction in database. In other words, it can be defined as

$\displaystyle\sum_{T\in\textit{QDB}}UT(T,\textit{QDB})$ (3)

For example, $UT(\textit{QDB})=UT(T_{0},\textit{QDB})$ (Eq. (2)) $+$ $UT(T_{1},\textit{QDB})+UT(T_{2},\textit{QDB})+UT(T_{3},\textit{QDB})+UT(T_{4},% \textit{QDB})=25+32+8+22+16=103$ . The utility of QDB invokes Eq. (2) for $|\textit{QDB}|$ times to find the utility of each transaction in QDB.

Definition 8: An itemset $X$ is called high utility itemset iff satisfy the following condition

$\displaystyle\left\{\begin{array}[]{ll}UT(T,\textit{QDB})\geqslant\textit{% minutil}\times UT(\textit{QDB})&\textit{High}\\ \textit{Otherwise}&\textit{not}\\ \end{array}\right.$ (4)

In other words, it can be defined it is high if the utility of an itemset in database is not less than minutil as mentioned in Definition 8 or Eq. (4).

For a large databases, HUIM delivers into huge itemsets. One of the goal of data mining that is it should not contain any redundancies. To tackle this issue, Definitions 9 and 10 are defined below.

Definition 9: An itemset $X$ is Closed High Utility Itemset denoted as CHUI, if there is no superset $Y$ , such that $Y\supset X$ and $UT(X,\textit{QDB})\geqslant\delta$ . For ex: HUI’s are {{bc}, {bce}}, and their frequencies are {3, 3}, then {bc} is not closed. Because {bc} $\subset$ {bce} and their frequencies are 3, then the details of {bc} can be derived from {bce}. Hence {bce} is closed.

Definition 10: An itemset $X$ is High Utility Generator denoted as HUG, if there is no subset $Y$ , such that $Y\subset X$ and $UT(X,\textit{QDB})\geqslant\delta$ . For ex: HUI’s are {{bc}, {bce}}, and their frequencies are {3, 3}, then {bce} is not generator. Because there is a subset with the same frequency that is {bce} $\supset$ {bc}. Hence {bce} is a not generator but {bc} is a generator.

3.1 Concise representation of HUIM

Huge number of HUIM motivated researches to look at the size without any redundancy. It was the reason for Closed [3, 7, 10], Maximal [14] and Generator HUIM [3]. They tried to extract concise representation at the time of HUIM with additional time for visiting data structure. Mai et al. [8] proposed a post pruning technique to derive closed high utility itemsets. Lattice is used to represent HUI’s and Closed HUI’s which takes additional time for constructing lattice. In this paper, we assume that High utility itemsets are known, and try to determine Closed and Generators from the same single closure check.

4. Approaches for lossless condensed representations of HUIM

In this section, we discuss the approaches for CHUIM and Generators. At first, we discuss naïve approach for CHUIM and HUGM with its limitations, and then discuss the proposed approach.

4.1 Naïve approach for Closed Utility Itemsets

In this section, we discuss the general or Naïve approach for deriving Closed and Generator HUI’s. This approach takes High Utility Itemsets as input and discovers closed and generators high utility itemsets. The procedure is presented in steps as follows.

Step1:
for each utility itemset, using Definition 9, check is there any itemset that is super set and having equal occurrence frequency.
Step 2:
if there is any super set, then the current one is not considered as closed. If none of them are identified as superset, then the current itemset is considered as Closed.

Repeat Steps 1 and 2 for all the utility itemsets.

If there are $n$ High utility itemsets, then it needs to compare other $n-1$ itemsets. Hence the time complexity is $O(n^{2})$ .
4.2 Naïve approach for generators

Step 1:
for each a utility itemset, using the Definition 10, check is there any utility itemset that is sub set and having equal occurrence frequency.
Step 2:
if there is any sub set, then the current one is not considered as Generator. If none of them are identified as subset, then the current itemset is considered as Generator.

Repeat Steps 1 and 2 for all the utility itemsets.

If there are $n$ High utility itemsets, then it needs to compare other $n-1$ itemsets. Hence the time complexity is $O(n^{2})$ .

It is observed from the above two approaches that, the time complexity towards double for both closures closed and generators. Hence it is motivated us to propose the following approach for deriving both in less time.

The idea for the proposed approach is to combine the above two approaches such that two operations are performed using a single condition that is either Superset or Subset check. In addition to that, we avoid some of the unnecessary computations during superset checking by checking only with one down its length.
5. Proposed closed and generator algorithm

In this section, we discuss the Hash Table based Data Structure, and then the proposed CG-Algorithm.

Figure 1.

Hash Table Data Structure.

5.1 Hash Table Data Structure

Basically each entry in hash table is associated with 5 fields such are Index Id, Utility value, Frequency, Closed Flag, and Generator Flag. The possible values in each entry are name of the itemset, utility value, frequency value, $T$ , and $F$ are True and False values for Closed and Generator Flags. The Hash Table Data Structure format is visualized in Fig. 1.

Algorithm 1: CG-Algorithm
Input: HUI-High Utility Itemsets.
Output: CHUI – Closed High Utility Itemsets, HUG – High Utility Generators.
Begin: 1. For each $\textit{UI}\in\textit{HIS}$ //all HIS are arranged in the order of their length. 2. { 3. Hash (UI) //HI is inserted into the hash table with frequency and Utility Value. 4. If ( $\|\textit{UI}\|==1$ ) $\textit{then UI.CF}=\textit{UI},\textit{GF}=\textit{T}$ . //two flags for closed and generator 5. Else If (IsSclose (UI)) 6. $\textit{UI.CF}\rightarrow\textit{T}$ ; $\textit{UI.GF}\rightarrow\textit{F}$ ; // $\textit{other one is set to CF}\rightarrow\textit{false}$ . 7. } 8. For each $\textit{UI}\in\textit{HIS}$ { 9. if ( $\textit{UI.CF}==\textit{T}$ ) $\textit{CHUI}\leftarrow\textit{CHUI}\cup\textit{UI}$ 10. if ( $\textit{UI.GF}==\textit{T}$ ) $\textit{HUG}\leftarrow\textit{GHUI}\cup\textit{UI}$ 11. }
End
IsSclose (HI) //to check whether HI is Superset or not.
{ 1. for each $\textit{UI}\in\textit{HIS}$ //consider UI if its length is just same as one lower than HI. { 2. if (( $\|\textit{UI}\|==\|\textit{HI}\|-1$ ) && ( $\textit{HI}\supset\textit{UI}$ ) && ( $\textit{UI.frequency}==\textit{HI.frequency}$ )) { 3. $\textit{UI.CF}\rightarrow\textit{F}$ //False 4. return true } } return false
}

5.2 Closed and Generator-Algorithm (CG-Algorithm)

Here, CG-Algorithm is proposed to find Closed High Utility Itemsets and Generators in integrated manner, such that only one closure is sufficient for both closed and generators. The input for CG-A is a list of High Utility Itemsets, and the output is Closed HUI’s and Generators. It arranges the input HUI’s are in the order of their length.

As a first step, using Definitions 9 and 10 it checks whether it is a superset of the hash contents or not. If it is found as superset, then it changes CF flag to T and GF flag to F, and subset itemset flags will be set to {F, T}. It repeats the same procedure for the rest of the utility itemsets. CG-Algorithm is presented in Algorithm 1. Lines 1 to 7 describes about inserting each utility itemset into hash data structure, changes in flags with respect to the superset closure. Line 3 inserts utility itemset into hash. Line 3 initializes 1-length utility itemsets CF, GF flags to T and T. Line 4 checks the holding UI is superset or not by calling IsSclosure() is same as Definition 9. If the holding UI is found to be superset of hash table contents and their frequency is same, then its CF is set to T and GF is set to F. lines 8 to 11 list the High Closed Utility Itemsets whose flag CF is T, and GF is set to T.

5.3 CG-Algorithm explanation for the input Table 3

Table 3
List of Utility Itemsets when threshold utility value is 25

Itemset	Utility	Frequency	Itemset	Utility	Frequency
{a, c}	28	3	{b, c, d}	34	2
{b, c}	28	3	{b, c, e}	37	3
{b, d}	30	2	{b, d, e}	36	2
{b, e}	31	3	{b, c, d, e}	40	2
{c, e}	27	4	{a, b, c, d, e}	25	1
{a, c, e}	31	3

Figure 2.

a. Hash Table Data Structure after input $<$ {ce} $>$ of Table 2. b. Hash Table Data Structure after input $<$ {ace} $>$ of Table 2. c. Hash Table Data Structure after input $<$ {bcd} $>$ of Table 2. d. Hash Table Data Structure, final output of Table 2.

Table 4

Standard datasets with characteristics

Dataset	No. of items	No. of transactions	AVG-len	Max-len	Type
Chess	75	3196	37	37	Dense
Foodmart	1559	4141	4.4	14	Sparse
Mushroom	119	8124	15	22	Dense
Retail	16,470	88,162	10.3	76	Sparse
Accidents	572	340,183	45	572	Dense
Chainstore	46,086	1,112,949	7.2	170	Sparse

Table 5

Number of HUI, CHUI, HUG w.r.t various minimum utility, and memory usage of Naive, LHUI and CG-A

Dataset	Minimum utility %		No.of HUI	No.of CHUI	No.of HUG	Memory usage (MB)
						Naïve		LHUI		CG-A
Chess	25		6406	3550	4074	106	.53	106	.81	85	.40
	28		493	339	358	50	.39	49	.43	40	.56
Foodmart	0	.07	637	605	22	58	.12	59	.0	48	.96
	0	.04	20766	1762	4686	405	.86	405	.93	320	.6
Mushroom	10		9594	119	1623	160	.47	172		130	.5
	13		1152	6	317	66	.15	66	.29	52	.56
Retail	0	.01	22,479	21,935	21,959	418	.25	434	.56	320	.5
	0	.04	2272	2266	2266	191	.46	192	.39	160	.3
Accidents	10		7479	7479	7479	374	.56	349	.57	282	.2
	13		189	189	189	305	.73	305	.61	250	.9
Chainstore	0	.005	12,347	12,275	12,289	1048	.39	1058	.98	726	.4
	0	.03	593	593	593	671	.68	671	.67	540	.56

The proposed algorithm explanation for the given example Table 3 is presented in Fig. 2. For the given inputs { $<$ ac $>$ }, $<$ bc $>$ , $<$ bd $>$ , $<$ be $>$ , $<$ ce $>$ } doesn’t contain any supersets. The result the flags CF and GF are assigned with T in Fig. 2a. Figure 2b shows the hash table representation after the input $<$ ace $>$ . It can be seen that there is element of Hash Table $<$ ac $>\subset<$ ace $>$ but their frequencies are not same, then it returns false. $<$ ace $>$ flags are set to {T, T}. From the Fig. 2c shows the Hash Table after the input $<$ bcd $>$ . The UI $<$ bcd $>$ is found to be superset of $<$ bcd $>\supset<$ bd $>$ and their frequencies are same. Hence $<$ bcd $>$ .GF $=$ F, and $<$ bd $>$ .CF $=$ F. Figure 2d shows the hash table representation for the rest of the Table 1, and set of closed and generators. The UI $<$ bcde $>\supset<$ bce $>$ and their frequencies are same, hence $<$ bcde $>$ .CF $=$ T, $<$ bcde $>$ .GF $=$ F, and $<$ bce $>$ .CF $=$ F.

6. Theoretical results

For inserting an itemset into Hash Table takes O (1), for n itemsets takes O (N). Let K-itemset be the length of itemset K, N be the number of itemsets in Hash Table, and P be the number of (K-1) subsets of itemset K. For superset closure checking, CG-algorithm uses only $|$ K-1 $|$ length subsets for checking, and it is $P<$ (N-1) where N is the number of itemsets in Hash Table. For N Itemsets, time complexity is O (N.P). CG-Algorithm do not use 1-length itemsets in superset closure checking. Let L be the number of 1-length itemsets in Hash Table.

Therefore time complexity is O (N X P-L), which is feasible than O (N ${}^{2}$ ).

7. Experimental results

In this section, we evaluate the performance of the proposed approaches Naïve and CG-Algorithm in terms of time complexity and space complexity. We have implemented CG-A in python with the configuration core I5. We carried out the performance analysis by considering various threshold values for various data sets visible in Table 5. We tested CHUI and HUIG on the standard data sets and the result is visualized in Table 5.

Chess dataset [16]: Chess dataset is recorded with 3196 instances, where 1669 instances describe that white can win and 1527 describe that white cannot win. Each instance is recorded with on average 37 attribute values which describe the board positions and class label name. Chess dataset characteristics are listed in Table 4. It can be seen that chess data set contains 6406 utility itemsets, whereas Closed and generators are 3550 and 4074. The variance in number of UIs, CHUIs, and HUIG for other datasets is visualized in Table 5. LHUI [9] lattice maintains high utility itemsets in lattice which is same as Naïve. CG-A integration approach maintains all HUI’s in Hash Table, are associated with flags that determines closed or generator. It reduces the search space when itemsets are exhibits properties of closed and generators. Hence it takes less space for both closed and generators. The runtime comparison of CG-A with other approaches are visualized in Figs 3 and 4.

Figure 3.

Runtime comparison of Naive, LHUI and CG-A on chess.

Figure 4.

Runtime comparison of Naive, LHUI and CG-A on chain store.

FoodMart [16]: Foodmart is recorded with 4141 instances, where each instance describe the behavior of customer transaction from a retail store. Each instance is recorded with on average 4.4 out of 1559 attributes. Hence it is registered as sparse database. The result of HUIM for the dataset of Table 4is visualized in Table 5. It can be seen that the growth in memory usage when utility threshold increases and CG-A uses less memory.

Mushroom [16]: Mushroom is recorded with 8124 instances, where each instance is a description of hypothetical samples of 23 species of gilled mushrooms in the Agaricus and Lepiota Family. It includes 4208 mushrooms are edible and 3916 are poisonous. Each instance is described an average with 15 attributes out of 119 with maximum length 25. Hence it is registered as Dense. The result of HUIM for the dataset of Table 4 is visualized in Table 5. It can be seen that the huge growth in memory usage when utility threshold increases and CG-A uses less memory.

Retail [16]: It describes the Belgian retail supermarket, which describes about the details about customer transactions. It is recorded with 88162 transactions, with 10.3 average length out of 76 maximum length. The participated products are 16470. Hence it is called as sparse database. Since the participated products are huge, it can result to a huge memory usage. It can be seen in Table 5, the change in memory usage when utility threshold increases, CG-A algorithm outperforms other approaches. The reason was the integration of closed and generators in Hash Table.

Accidents [16]: Accidents data set contains traffic accident data donated by Karolien Geurtsis, obtained from Belgian Traffic accidents. It contain information on the different circumstances in which accidents have occurred. It is recorded with 340,183 accidents as transactions, it consider 572 attributes, on average 45 attributes are considered for each accident. From the Table 4, Accident data set can be called as Dense. Because of dense property, techniques can cause huge memory usage which can be seen in Table 5.

Chainstore [16]: Chainstore dataset describe the customer transactions of retail store, which are obtained from NU-MU Bench. It is recorded with 1,112,949 sequences, having 46.086 products, with 7.2 average length. Hence it is recognized as sparse database. Figure 4 shows the running time performance of the proposed approach with other approaches when minimum utility threshold increases. Form the Fig. 4, it can be seen that CG-A outperforms other approaches.

7.1 Runtime comparison of Naïve, CG-Algorithm for mining CHUIs and HUIG

We tested Naïve and CG-Algorithm on various datasets with various threshold values for mining closed and generators high utility itemsets. The results shown in Figs 3 and 4 tell that CG-Algorithm takes less time than Naïve approach which takes O (2 N ${}^{2}$ ), and LHUI because of integration of Closed and Generators in Hash Table.

8. Conclusion

The proposed CG-Algorithm for mining CHUI and HUIG using utility confidence framework and flags that determines closure is closed or generator. The integration mechanism derives closed and generators for the same closure. Hash based data structure is used to maintain high utility itemsets. The result shows that it takes less time and less memory usage for deriving both kind of itemsets over the standard datasets. Basically it is post pruning technique of HUI’s. Further it can be improved deriving them while determining HUI’s.

References

Agrawal

and Srikant

, Fast algorithms for mining association rules in large databases, in: Proceedings of the International Conference on Very Large Data Bases, Santiago de Chile, Chile, 1994, pp. 487–499.

Kalyani

Chandra Sekhar Rao

M.V.P.

, Privacy-Preserving Association Rule Mining Using Binary TLBO for Data Sharing in Retail Business Collaboration, in: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications, Advances in Intelligent Systems and Computing, 2017, p. 515. doi: 10.1007/978-981-10-3153-3_19.

Dam

T.-L.

Fournier-Viger

and Duong

, CLS-Miner: Efficient and Effective Closed High utility Itemset Mining, Frontiers of Computer Science, Springer, 2018. doi: 10.1007/s11704-016-6245-4.

Erwin

Gopalan

R.P.

and Achuthan

N.R.

, CTU-mine: An efficient high utility itemset mining algorithm using the pattern growth approach, in: Proceedings of the IEEE International Conference on Computer and Information Technology, Fukushima, Japan, 2007, pp. 71–76.

Fournier-Viger

Lin

J.C.-W.

Chi

T.T.

Zhang

and Le

H.B.

, A survey of itemset mining, WIREs: Data Mining and Knowledge Discovery 7(4) (2017).

Fournier-Viger

C.W.

Zida

and Tseng

V.S.

, FHM: Faster high-utility itemset mining using estimated utility co-occurrence pruning, in: Proceedings of the International Symposium on Methodologies for Intelligent Systems, Roskilde, Denmark, 2014, pp. 83–92.

Fournier-Viger

C.W.

and Tseng

V.S.

, Novel Concise Representations of High Utility Itemsets using Generator Patterns, in: Proc. 10th Intern. Conf. on Advanced Data Mining and Applications, Springer, 2014, pp. 30–43.

Fournier-Viger

Zida

Lin

C.W.

C.-W.

and Tseng

V.S.

, EFIM-Closed: Fast and Memory Efficient Discovery of Closed High-Utility Itemsets, in: Proc. 12th Intern. Conf. on Machine Learning and Data Mining, Springer, 2016, pp. 199–213.

Lan

G.C.

Hong

T.P.

and Tseng

V.S.

, A projection-based approach for discovering high average-utility itemsets, Journal of Information Science and Engineering 28(1) (2012), 193–209.

10.

Lan

G.C.

Hong

T.P.

and Tseng

V.S.

, An efficient projection-based indexing approach for mining high utility itemsets, Knowledge and Information Systems 38(1) (2013), 85–107.

11.

Lin

C.W.

Hong

T.P.

and Lu

W.H.

, An effective tree structure for mining high utility itemsets, Expert Systems with Applications 38(6) (2011), 7419–7424.

12.

Liu

and Qu

, Mining high utility itemsets without candidate generation, in: Proceedings of the ACM International Conference on Information and Knowledge Management, Maui, HI, 2012, pp. 55–64.

13.

Liu

Liao

W.K.

and Choudhary

, A two-phase algorithm for fast discovery of high utility itemsets, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, Vietnam, 2005, pp. 689–695.

14.

Mai

and Nguyen

L.T.T.

, A lattice-based approach for mining high utility association rules, Information Sciences 399 (2017), 81–97.

15.

Padhy

Mishra

and Panigrahi

, The survey of data mining applications and feature scope? International Journal of Computer Science, Engineering and Information Technology (IJCSEIT) 2(3) (June 2012), 43. doi: 10.5121/ijcseit.2012.2303.

16.

Fournier-Viger

Gomariz

Gueniche

Soltani

C.-W.

and Tseng

, SPMF: A Java open-source pattern mining library, The Journal of Machine Learning Research 15(1) (2014), 3389–3393.

17.

Mai

and Nguyen

L.T.T.

, An efficient approach for mining closed high utility itemsets and generators, 2017.

18.

Tseng

V.S.

Shie

B.E.

C.W.

and Yu

P.S.

, Efficient algorithms for mining high utility itemsets from transactional databases, IEEE Transactions on Knowledge and Data Engineering 25(8) (2013), 1772–1786.

19.

Tseng

V.S.

C.W.

Fournier-Viger

and Yu

P.S.

, Efficient algorithms for mining top-k high utility itemsets, IEEE Transactions on Knowledge and Data Engineering 28(1) (2016), 54–67.

20.

Tseng

V.S.

C.W.

Shie

B.E.

and Yu

P.S.

, UP-Growth: An efficient algorithm for high utility itemset mining, in: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, 2010, pp. 253–262.

21.

C.W.

Fournier-Viger

J.Y.

and Tseng

V.S.

, Mining Compact High Utility Itemsets Without Candidate Generation, 2019.

22.

Yao

and Hamilton

H.J.

, Mining itemset utilities from transaction databases, Data & Knowledge Engineering 59(3) (2006), 603–626.

23.

Yao

Hamilton

H.J.

and Butz

C.J.

, A foundational approach to mining itemset utilities from databases, in: Proceedings of the SIAM International Conference on Data Mining, Orlando, FL, 2004, pp. 211–225.

24.

Yun

Ryang

Lee

and Fujita

, An efficient algorithm for mining high utility patterns from incremental databases with one database scan, Knowledge-Based Systems 124 (2017), 188–206.

An integrated approach for mining closed and generator high utility itemsets

Abstract

Keywords

1. Introduction

Table 1 Sample quantitative database

(i) Apriori based approaches

(ii) Tree based approaches [14]

(iii) Projection based approaches

(iv) Hybrid approaches [12]

3. High utility itemset mining

4. Approaches for lossless condensed representations of HUIM

4.1 Naïve approach for Closed Utility Itemsets

5.3 CG-Algorithm explanation for the input Table 3

Table 3 List of Utility Itemsets when threshold utility value is 25

7. Experimental results

8. Conclusion

References

Table 1
Sample quantitative database

Table 3
List of Utility Itemsets when threshold utility value is 25