Incrementally updating high utility quantitative itemsets mining algorithm

Abstract

High utility quantitative itemsets (HUQI) mining is a new research topic in the field of data mining. It not only provides high utility itemset (HUI), but also provides quantitative information of individual item in the itemset. HUQI can provide decision makers with information about items and their purchase quantities. However, the currently proposed HUQI mining algorithms assume that the datasets are static. In order to solve this problem, an incremental quantitative utility list (IQUL) data structure is proposed to store item information, including item name, item number, transaction weight utility of item, each entry in the list stores the transaction identifier, the utility of the original data, the remaining utility, the utility of the incremental data, the remaining utility, and the sum of the utility and the remaining utility. When data is inserted, the item information will be updated. Based on IQUL, an incrementally updating HUQI (IHUQI) mining algorithm is proposed to mine HUQI on incremental update data. A large number of experiments on real datasets show that the IHUQI algorithm can effectively mine HUQI Experimental results show better performance on sparse datasets.

Keywords

Incremental mining high utility quantitative itemsets high utility itemsets utility list itemsets mining

1 Introduction

Frequent itemsets mining (FIM) [1] is used to discover frequent itemsets in transaction databases. However, frequent itemsets can only find itemsets with more purchases by users, and the profits of itemsets are not considered. To mine itemsets with higher profits, the researchers further proposed high utility itemsets mining (HUIM) algorithm [2]. Utility mining is to discover high utility itemsets (HUI). The internal utility and external utility are set for each item to mine HUI. In addition, researchers have proposed a variety of expansion modes for HUI to obtain more accurate information, including closed HUI [3], average HUI [4], and HUI with negative utilities. [5], top-k HUI [6] and other mining algorithms.

Although these HUI can find interesting itemsets, they ignore the quantity attribute of items in the HUI. HUI may not always be high income products, because purchased quantity of items will also affect the income of items. Quantitative of items can provides more useful and accurate information in many applications. Thus, high utility quantitative association rules (HUQA) mining algorithm [7] is proposed. In the framework of HUQI, an item may have different quantities in the database, and each item carrying a different quantity is regarded as a quantitative item, denoted as q—item. The itemsets composed of q—item are integrated into a q—itemset. If the utility of a q—itemset is not lower than the user specified minimum utility threshold, the itemsets are called HUQIs. For example, the HUQIM algorithm can find a HUQI “yogurt: 3, bread: 2, jelly: 6”, indicating that buying 2 breads, 3 yogurts and 6 jellies will generate high profits. HUQIM algorithm can also find itemsets that include a range of quantities. For example, “Cheese: 3–6, Juice: 5–7” means that buying 3–6 pieces of cheese and 5–7 bottles of juice can generate high profits. Due to the quantity of information provided, the itemsets discovered by HUQIM algorithms are more informative than the itemsets discovered by HUIM algorithms. This information can help decision-makers make more accurate decisions. The currently proposed HUQIM algorithms are only used for static datasets mining.

In real-world application, however, data have been incrementally generated due to new purchases from customers. Transactions in datasets are not always static. Therefore, it is desired to developed an efficient incremental mining algorithm to update the HUQI on dynamic datasets. But there is no algorithm for mining HUQI on incremental datasets. Therefore, this paper proposes an algorithm for mining HUQI on incremental data.

The main contributions of this article include 3 aspects:

A novel data structure named IQUL is proposed to store items information which contains 7 parameters and can update the information of items when new data is inserted;

Proposed IHUQI mining algorithm based on IQUL to mine HUQI in incremental datasets;

To evaluate the proposed IHUQI algorithm, extensive experiments were conducted on different datasets. Experimental results show that the IHUQI algorithm performs better on sparse datasets.

2 Related work

This section mainly introduces the research results of HUIM algorithms, from three perspectives: HUIM algorithms, incremental HUIM algorithms, and HUQIM algorithms.

2.1 High utility quantitative itemsets mining algorithms

HUIs can only provide interesting itemsets, ignoring the quantity information of a single item. In order to consider the relationship of items and the quantity of items, Li et al. [8] proposed the VHUQI (Vertical mining of HUQI) vertical mining algorithm, which converts the horizontal datasets into a utility list structure to accelerate the mining process. The utility list is used to store the utility information of the q-itemsets in the dataset, and the k-support bound is used to estimate the pruning search space. Li et al. proposed the HUQI-Miner algorithm [9]. The algorithm uses the utility list structure to directly calculate the utility of the q-itemsets in the memory, and uses the transaction weighting utility and ER pruning technology to reduce the number of candidates q-itemsets. Nouioua et al. [10] proposed FHUQI-Miner (Faster HUQIM). This algorithm proposes two new pruning strategies EQCPS and RQCPS. The proposed strategy can greatly reduce the number of connection operations in the search process and improve the running time of the algorithm, but the algorithm is more effective on sparse datasets.

2.2 Incremental high utility itemsets mining algorithms

With the advent of the era of big data, new data is generated at all times. Since static approach have to rescan the dataset and perform mining operation again, which is very time consuming. Therefore, developing efficient algorithm to process incremental data is desired. Yun et al. [11] proposed list based incremental high utility pattern (LIHUP). It scans the dataset once to build utility list to store utility information and have no candidate. Yun et al. [12] proposed indexed list based incremental high utility pattern mining (IIHUM) algorithm, only one scanning of datasets. Kim et al. [13] proposed incremental mining of high average utility itemsets (IMHAUI) based on tree structure. Dam et al. [14] proposed the first approach which is IncCHUI to mine CHUIs from incremental datasets. Wang et al. [15] proposed IncUSP-Miner+ algorithm to mine high utility sequence patterns (HUSPs) incrementally. This algorithm uses a tighter upper bound of the utility and the candidate pattern tree to reduce redundant computations. These algorithms are designed to process HUI related problems, and do not provide information about the number of items.

3 Problem definition

This section mainly introduces the basic concepts of HUQIM, and the combination method of generating Rq—itemsets.

3.1 Preliminary knowledge

Let I denote the set of item i, that I = {i₁, i₂, . . . , i_m} is a set of items, D = {T₁, T₂, . . . , T_n} is a quantitative transaction dataset, where the items in each transaction T_i are a subset of I. Itemset X is a subset of I. The number of items i_p in the transaction T_q is represented by n (i_p, T_q). The external utility s (i_p) is the unit value of the item i_p (for example, profit). The utility of the item i_p in the transaction T_q, represented by u (i_p, T_q), u (i_p, T_q) = n (i_p, T_q) × s (i_p) is defined as the product of the internal utility and the external utility.

Table 1 gives an example of a quantitative transaction database D. T₁ to T₄ represent the original dataset D₁, T₅ and T₆ represent the incremental dataset D₂, T₇ and T₈ represent the incremental dataset D₃. In addition, Table 2 shows the external utility of D. Taking item a as an example, in Table 2, it can be seen that the profit of item a is 5. The number of items a in T₁ is 2, and the utility of item a in T₁ is 10.

Table 1
Sample dataset

Dataset TID Transaction Transaction utility

D₁ T1 (a, 2) (b, 4) (e, 6) (f, 4) 56

T2 (c, 2) (e, 3) 9

T3 (a, 1) (b, 3) (c,1) (d, 2) (e, 1) 34

T4 (b, 4) (c,2) (e, 7) 41

D₂ T5 (a, 2) (b, 1) 17

T6 (b, 2) (d, 1) (f, 3) 25

D₃ T7 (c, 1) (d, 2) (e, 1) 8

T8 (a, 3) (b, 2) (c, 3) (d, 1) (e, 2) 42

Dataset	TID	Transaction	Transaction utility
D₁	T1	(a, 2) (b, 4) (e, 6) (f, 4)	56
	T2	(c, 2) (e, 3)	9
	T3	(a, 1) (b, 3) (c,1) (d, 2) (e, 1)	34
	T4	(b, 4) (c,2) (e, 7)	41
D₂	T5	(a, 2) (b, 1)	17
	T6	(b, 2) (d, 1) (f, 3)	25
D₃	T7	(c, 1) (d, 2) (e, 1)	8
	T8	(a, 3) (b, 2) (c, 3) (d, 1) (e, 2)	42

Table 2

External utility values

Item	a	b	c	d	e	f
External utility values	5	7	3	2	1	3

Quantitative items include exact quantitative items and range quantitative items, collectively referred to as quantitative items, denoted as: q—item [7].

Definition 1. (Exact quantitative items) [7]. The exact quantitative item indicates that the quantity of the item is an exact number, denoted as: Eq—items. Eq—items x defined as (i, q), where i ∈ I, q is the number of items. For example, in the transaction T₁ in Table 1, there are Eq—items (a, 2) (b, 4) (5, 6) (e, 4).

Definition 2. (Range quantitative items) [7]. Range quantitative item indicates that the number of items is a range, denoted as: Rq—items. Rq—items x defined as a triplet (i, l, u), where i ∈ I, l represents the minimum value of the number of items i, and u represents the maximum value of the number of items i. For example, the range of q-item a (a, 1, 2) in D₁ in Table 1.

Definition 3. (Quantitative itemset) [7]. Quantitative itemset is denoted as q-itemset X. k - q—itemset indicates that there are k different q-items. If k only contains Eq—items, then X is an exact quantitative itemset, denoted as: Eq - itemset; if the number of Rq—items in the itemset is greater than or equal to 1, then it is a range quantitative itemset, denoted as: Rq—itemset. For example, [(a, 2) (b, 4) (e, 6) (f, 4)] is 4 - Eq—itemset; [(a, 1,2), (b, 4)] is 2 - Rq—itemset.

Definition 4. (The utility of q—item in transaction) [7]. The utility of Eq—items x = (i, q) in the transaction T_d is defined as u (x, T_d), u (x, T_d) = p_i × q. The utility of Rq—itemset in a transaction T_d is the sum of all the utilities of Eq—items contained in x.

For example, u ((a, 2) , T₁) =2 × 5 =10, u ((a, 1, 2) , T₃) = u ((a, 1) , T₃) + u ((a, 2) , T₃) =5 + 0 =5.

Definition 5. (The utility of q-itemset in transaction/dataset) [7]. Given a q—itemset X = [x₁, x₂, . . . , x_k], the utility of q—itemset X in the transaction T_d is $u (X, T_{d}) = \sum_{j = 1}^{k} u (x_{j}, T_{d})$ . The utility of q—itemset X in the dataset is defined as u (X) = ∑_{T_d∈D}Tu (T_d).

For example:

u ([(a, 1, 2) (b, 3, 4)] , T₃) = u ((a, 1, 2) , T₃) + u ((b, 3, 4) , T₃) =5 + 21 = 26.

Definition 6. (Transaction utility) [7]. The utility of transaction T_d is defined as $TU (T_{d}) = \sum_{i = 1}^{k} u (y_{i}, T_{d})$ .

For example:

TU (T₁) = u ((a, 2) , T₁) + u ((b, 4) , T₁) + u ((e, 6) , T₁) + u ((f, 4) , T₁) =10 + 28 + 6 +12 = 56.

Definition 7. (The utility of the dataset) [7]. The utility of the whole dataset is defined as σ, which is the sum of the utility of all transactions in the original dataset and the incremental dataset. defined as σ = ∑_{T_d∈D}TU (T_d) + ∑_{T_d∈D₁}TU (T_d) + … + ∑_{T_d∈D_n}TU (T_d)

For example, the utility of the whole dataset shown in Table 1 is σ = TU (T₁)+ TU (T₂) + TU (T₃) + TU (T₄) + TU (T₅) + TU (T₆)+ TU (T₇) + TU (T₈) =56 + 9 +34 + 41 + 17 + 25 + 8 +42 = 232.

Definition 8. (Quantitative related coefficient) [7]. The interval between the maximum and minimum of the size of Rq—items is called qrc and is defined as (u - l + 1).

Definition 9. (Candidate quantitative itemset) [7]. Given a user-defined minimum utility threshold and quantitative correlation coefficient qrc (qrc > 0), if the utility of q—itemset X satisfies θ/qrc ≤ u (X) ≤ θ, then X is a candidate set of quantitative items.

Definition 10. (Combined restrictions) [7]. For two q—itemsets X = [(x₁, l₁, u₁) , (x₂, l₂, u₂), . . . , (x_k, l_k, u_k)] and $Y = [(y_{1}, l_{1}^{'}, u_{1}^{'}), (y_{2}, l_{2}^{'}, u_{2}^{'})$ ,..., $(y_{k}, l_{k}^{'}, u_{k}^{'})]$ . X and Y have the same prefix and different last item, if x_k = y_k, $l_{k}^{'} =$ (u_k + 1), $u_{k}^{'} - l_{k} \leq qrc$ X and Y can be combined to form one Rq—itemsets Z = [(i₁, l₁, u₁) , (i₂, l₂, u₂) $, . . ., (i_{k}, l_{k}, u_{k}^{'})]$ .

Definition 11. (Remaining utility) [8]. The remaining utility of q—itemset X, denoted as Rutil (X, T_d), is the sum of the utility of all following after q—item in in the sort ≺, which is defined as: Rutil (X, T_d) = ∑_{x∈T_d/x}u (x, T_d).

3.2 q-itemset combination method

This section introduces two combination methods, namely Combine_All [9] and Combine_Min [23] methods.

The Combine_All method combines candidates q—item, candidates q—itemset and Rq—itemset outputs all possible high utility Rq—itemset. Suppose there are 6 candidate itemsets {[(a, 2), (b, 4), (d, 3)], [(a, 2), (b, 4) (c, 1)], [(a, 2), (b, 4) (c, 2)], [(a, 2), (b, 4) (c, 3)], [(a, 2), (b, 4) (c, 4)], [(a,2 ), (b, 4) (c, 5)]}, and qrc set to 4, The process of method combination candidates using Combine_All method is shown in Fig. 1.

Fig. 1

Combine process.

The Combine_Min method only outputs the high utility of the minimum interval Rq—itemset. The Combine_Min method uses the same traversal process as Combine_All methods, but after generating the range HUQI, the Combine_Min method will immediately stop combining the current one with the rest and pass it directly to the next candidate q—itemset. As shown in the example in Fig. 1, if the generated [(a, 2), (b, 4), (c, 1, 2)] is HUQI, the Combine_Min method will directly move to the next q—itemset [(a, 2), (b, 4), (c, 2)] and [(a, 2), (b, 4), (c, 3)] generate [(a,2), (b,4), (c, 2, 3)].

4 Research based on IHUQI algorithm

In this section, an incremental quantitative utility list IQUL and an incrementally updated HUQI mining algorithm IHUQI are proposed. The algorithm uses the utility information stored in IQUL to mine HUQI. This section first introduces the IQUL structure, and then introduces the IHUQI algorithm.

Definition 13. (High utility quantitative itemset) [8]. Given a user-defined minimum utility threshold and a q—itemset X, if the utility of X is not less than θ, then X is a HUQI. If the utility of X is less than θ, then X is the low utility quantitative itemsets (LUQI).

Definition 14. (Incremental updating HUQIM). Let the database be D, if D = D₁ ∪ D₂ ∪ ⋯ ∪ D_n, where D_i is a non-empty transaction set, D_i is the update of the dataset D₁. Let minutil, D_i, qrc be the user-specified minimum utility threshold, incremental data and quantitative correlation coefficient, respectively, the problem of incremental updating HUQIM can be defined as finding HUQI in D with given minutil, D_i and qrc.

Algorithm description: The first step is to scan the database and calculate the TWU of q—item, because there is incremental data, the q—item in original dataset is not HUQI, and it will become HUQI later when the incremental data is added, so it is necessary to construct a IQUL and TQCS structure for each q—item. The second step is to check which of the HUQI H, candidate itemsets C, or explored itemsets E belongs. Then, use a combination method to combine the candidate itemsets to get the initial Rq—item which is HR, and then merge H, E and HR into Qis. The third step is to call the mining algorithm to mine HUQIs recursively.

4.1 Incremental quantitative utility list structure IQUL

IHUQI uses IQUL to store the utility information of original data and incremental data. The item information stored by IQUL includes q—item, TWU of q—item, IutilD, RutilD, IutilDP, RutilDP. Where IutilD and IutilDP represents the utility of the itemset in the transaction in the original dataset and the incremental data. RutilD and RutilDP represents the remaining utility of the itemset in the transaction in the original dataset and the incremental data. IQUL will also store the utility sum of the itemset, including SumItuilD, SumRtuilD, SumItuilDP, SumRtuilDP, which respectively represent the sum of IutilD the sum and the sum of RutilD in the original dataset; and the sum of IutilDP and the sum of RutilDP in the incremental data.

Taking q—item (a, 2) in the dataset as an example, the IQUL is shown in Fig. 2. When scanning the dataset D₁, the dataset is regarded as an incremental dataset, and only T₁ contains (a, 2) in D₁. The utility information of q—item scanned dataset is stored in IQUL. When inserting a dataset D₂, the utility information IutilDP and RutilDP in the dataset D₁ will be passed to IutilD and RutilD, and will be used as the original dataset, and then the dataset D₂ will be scanned to store its utility information. When the dataset D₃ is inserted, the utility information of the dataset D₂ is added to the utility information in the original dataset, and then the incremental utility information is added.

Fig. 2

IQUL of (a, 2).

4.2 Algorithm description

This section introduces the proposed IHUQI algorithm. The proposed algorithm follows the procedures similar to FHUQI-Miner and VHUQI. IHUQI will input 4 parameters: (1) transaction dataset D, incremental transaction dataset DP (internal utility) and transaction profit dataset D_profit (external utility), (2) minimum utility threshold θ, (3) combined method (CM), (4) Quantitative correlation coefficient (qrc). The output is HUQIs. The specific process is shown in Algorithm 1.

Algorithm 1 IHUQI
Input: A set of databases D = {D₁, D₂, ... , D_k}, θ: The user-defined minimum utility threshold, CM: The combining method (Combine_All or Combine_Min), qrc: The quantitative related coefficient.
Output: The set of high utility quantitative itemsets.
① Scan the database D to calculate TWU of the different q-items.
② P⟶ Each q-item* such that in D
③ Second database scan to bulid the IQUL of each q-item ∈ P* and build the TQCS structure.
④ Check if the q-itemsets is high/ candidate/ to be explored or to be directly prunned
⑤ If UL(x).SumIutilD + UL(x).SumIutilDP≥⟶ H
⑥ qrc≤UL(x).SumIutilD+ UL(x).SumIutilDP≤ θ ⟶ C
⑦ UL(x).SumIutilD+ UL(x).SumIutilDP + UL(x).SumRutilD +UL(x).SumRutilDP≥ θ ⟶ E
⑧ Combine the q-items to generate high utility range q-itemsets (HR) using CM and C
⑨ QIs ⟶ sort(H ∪E∪HR)
⑩ Mining Procedure (∅, QIs, ULs(QIs), P, qrc, CM, θ*);

The IHUQI algorithm is mainly divided into three steps:

Construct the IQUL of q—item (1–3 lines). First scan the transaction dataset D and the transaction profit dataset to calculate the initial TWU of q—item, and store the q—item in P*. Then the second dataset scan is performed. In addition, q—item in each transaction is reordered according ≺, and create the utility list and TQCS structure of the transaction dataset.

Find the initial Rq—item (lines 4–9). IHUQI first checks the utility of q—item, for each item x, if u (x) ≥ θ, x is output as a HUQI, and put it into the set H which containing HUQIs. Otherwise, IHUQI performs two judgments: (1) If θ/qrc ≤ u (x) ≤ θ, x put it into the set C, which contains candidates q - itemsets that can be combined to form high utility Rq—itemsets. ② If u (x) + UL (x) . SumRutil ≥ θ, x is put into the set E, the set contains all the q—itemsets that should be explored, because one or more of their extensions may be high utility. If the set C is not empty, call the CM to generate HR and UL (HR). HR is the high utility Rq—itemsets generated by calling the CM to combine the candidates in C, and UL (HR) is the corresponding IQUL of HR. Then IHUQI created Qis, Qis is composed of H, E and HR. The items in Qis are sorted according ≺.

IHUQI calls the mining algorithm (10 lines) to mine HUQIs. The mining process algorithm is described in Algorithm 2.

Recursive mining search will input 7 parameters: (1) prefix itemset P, (2) set of Qis, (3) IQUL of items in Qis UL (QIs), (4) set of itemset P*, (5) quantitative correlation coefficient (qrc), (6) Combination method CM, (7) Minimum utility threshold θ.

When the algorithm is called for the first time, the prefix P of q—item is ∅, which contains Eq—item and Rq—item identified in Algorithm 1. The specific operation of the mining algorithm is as follows: For each extension Px of P, among them x ∈ Qis, the algorithm traverse all the extensions Py of P, among them y∈ P * and according x ≺ y to explore the expansion of Pxy, which are reflected in the first 1–3 of the algorithm 2.

Algorithm 2 Mining Procedure
Input: The prefix Q-itemset, QIs: The Q-itemsets list, ULs(QIs):Utility lists of Q-itemsets, P:the list of Q-itemsets in D, qrc: The quantitative related coefficient, CM: The combining method,θ: The pre-defined minimum utility threshold.*
Output: The set of high utility quantitative itemsets;
① Foreach Px such that x ∈QIs do
② QIs←∅; P*←∅
③ Foreach Py such that y∈P* and x < y do
④ If Px is an Eq-itemset then
⑤ Check TQCS (x, y, c);
⑥ If (c==Null or c < θ/qrc) then
⑦ Go to next Py;
⑧ End
⑨ End
⑩ Else
$c \leftarrow \sum_{q = 1}^{u} TQCS (x_{i}, y);$
If c < θ/qrc then
Go to next Py;
End
End
UL(Pxy) = Constructjoin(x, y, P);
If UL(Pxy) ! = Null and TWU (Pxy)≥θ/qrc then
P* = P ∪ Pxy; if UL(Pxy).SumIutilD + UL(Pxy).SumIutilDP*≥θ
H = H ∪Pxy; Output Pxy;
End
Else
If UL(Pxy).SumIutilD + UL(Pxy).SumIutilDP + UL(Pxy).SumRutilD + UL(Pxy).SumRutilDP≥θ then
E = E ∪Pxy;
End
If θ/qrc≤UL(Pxy). SumIutilD+ UL(Pxy).SumIutilDP≤θ then
C = C ∪ Pxy;
End
End
End
End
Combine the q-items to generate high utility range q-itemsets (HR) using CM and C
QIs←(H ∪E∪HR);
Mining_Procedure (Px, QIs, ULs(QIs), P, qrc*, CM, θ);
End

For each extension Pxy, the algorithm performs a pruning check based on the TQCS structure to decide whether to extend q—itemset Pxy or pruning this directly without spending time creating a IQUL for it. There are two situations:

Both q—item x and y are Eq—item. The algorithm searches for tuples (x, y, c) in the TQCS structure. If c = ∅ or c < θ/qrc, the algorithm will be passed directly to the next extension Py without constructing UL (Pxy). Because q—itemset Pxy and all its extensions are LUQI, it is reflected in the 4–9 lines of Algorithm 2.

If x is Rq—item, the algorithm looks for the tuples of each contained in the TQCS structure. If the sum obtained is less than, the algorithm prunes the combination instead of constructing its IQUL. Corresponding to the line 10–13 of Algorithm 2.

For these two cases, if TWU (xy) ≥ θ/qrc, then the algorithm performs the concatenation operation through UL (P), UL (Px) and UL (Py) to build UL (Pxy).

Based on UL (Pxy), pruning Pxy if Pxy is no hope, the algorithm will be passed directly to another extension Py. Otherwise, the extension Pxy is put into a promising new list P*, and the algorithm performs similar judgments as in Algorithm 1 to check whether it belongs to H (lines 17–20), E (lines 22–24), and C (line 25–27 line).

After traversing all the extensions Py, the combined operation CM is performed to extract the high utility Rq - itemset (HR), which is reflected in line 31 of Algorithm 2. Then, the new set Qis is formed by the union of H, E, and HR. Then the recursive mining search algorithm uses the new prefix Px and the new Qis and P* is called recursively together with the prefix Px.

4.3 Case analysis

In this section, an example will be given to illustrate the proposed IHUQI algorithm. The original database D₁ is in Table 1, and the incremental data is D₂, D₃. Using Combine_All method, the qrc is 3, and the minimum utility threshold is 40.

First, scan the dataset to calculate different TWU of q—item, and the TWU of q—item is shown in Table 3. Then perform the second dataset scan to construct the IQUL and TQCS structure, and the TQCS structure of q—item is shown in Table 4.

Table 3
TWU of q-item

item TWU

(a,2) 56

(b,4) 97

(e,6) 56

(f,4) 56

(c,2) 50

(e,3) 9

(a,1) 34

(b,3) 34

(c,1) 34

(d,2) 34

(e,1) 34

(e,7) 41

item	TWU
(a,2)	56
(b,4)	97
(e,6)	56
(f,4)	56
(c,2)	50
(e,3)	9
(a,1)	34
(b,3)	34
(c,1)	34
(d,2)	34
(e,1)	34
(e,7)	41

Table 4

TQCS structure

A	B	C	A	B	C
(a,2)	(e,6)	56	(d,2)	(c,1)	34
				(e,1)	34
(a,1)	(c,1)	34	(e,6)	∅	∅
	(d,2)	34
	(e,1)	34
(b,4)	(a,2)	56	(d,3)	∅	∅
	(c,2)	41
	(e,6)	56
	(e,7)	41
	(f,4)	56
(b,3)	(a,1)	34	(e,1)	∅	∅
	(c,1)	34
	(d,2)	34
	(e,1)	34
(c,2)	(e,3)	9	(e,7)	(c,2)	41
(c,1)	(e,1)	34	(f,4)	(a,2)	56
				(e,6)	56

After that, the data in the dataset is sorted in descending order according to their utility, and after sorting P* = {(b, 4), (b, 3), (f, 4), (a, 2), (e, 7), (c, 2), (e, 6), (a, 1), (d, 2), (c, 1), (e, 3), (f, 1)}, then according to the utility of q—item in P*, according to its utility in IQUL are put into the HUQI set (H), candidate q—itemsets (C) or explored set (E) respectively. After traversing all q—item in P*, since all items in P* are LUQI, the set H is empty, C= {(b, 3), E= (b, 4)}. After that, IHUQI use the Combine_All method to find Rq—itemset from the set C, calculate its utility and output HR, because the set C contains only one q—item so it does not produce Rq—itemset HR, that is HR =∅. After the combination process, the set H, HR and E constitute a set Qis, which is Qis= {(b,4)} in this dataset. Finally, the algorithm calls the mining algorithm, from the first q—item to the last q—item in the set Qis. In this process, the algorithm first uses EQCPS [23] and RQCPS [23] for pruning to avoid generating unnecessary itemsets, and then performs connection operations to generate larger itemset. In the dataset D₁ Px = (b, 4) and all Py, y∈ P * is used to expand. (b, 4) is first extended with (b, 3). It can be seen from Table 4 that there is no such entry (a, b, c) ∈ TQCS, that is A = (b, 4), B = (b, 3), and C ≥ (θ/qrc). Therefore, the itemset [(b, 4), (b, 3)] is ignored and no longer produces a IQUL of [(b, 4), (b, 3)]. The algorithm will proceed to the next Py, y∈ P * which is Py = (f, 4). It can be seen TWU [(b, 4) , (f, 4)] =6 ≥ (θ/qrc) from Table 4. Therefore, q—item (b, 4) and (f, 4) are connected to form [(b, 4), (f, 4)] and IQUL of [(b, 4), (f, 4)]. Then put it into P* and check its utility to determine which set of HUQI (H), set for combination (C) or explored itemset (E). Since u([(b, 4),(f, 4)]) = 40, q—itemset [(b, 4), (f, 4)] is output to the set H. The algorithm continues to expand recursively, after traversing all expansions H= {(f, 4)}, C= {(a, 2), (e, 7), (c, 2), (e, 6)}, E= {(f, 4), (a, 2), (e, 7)}, the HR is formed by combining operations for q—item in C, and the final result is QIs= {(f, 4),(a, 2),(e, 7),(e, 6, 7)}. In addition, a new promising list is formed P*= {(f, 4), (a, 2), (e, 6, 7), (e, 7), (c, 2), (e, 6)). Then the algorithm is called recursively to explore larger q—itemsets, until all HUQIs are found.

When incremental data is inserted, scan the incremental dataset to calculate TWU of q - item, construct its IQUL and TQCS structure. For items that exist in the original dataset D₁, their utility will be updated in the original IQUL. If the original dataset D₁ does not exist, a new IQUL will be created. Then call CM to generate the Rq—item. In Table 1, it will generate Rq—item(b, 2, 3) and its corresponding IQUL. It will do the same operation as before, when the dataset D₃ is inserted. It generated Rq—item (a, 2, 3), (b, 2, 3) and their corresponding IQULs. The detailed process is shown in Fig. 3. Finally, call the mining operation to mine all HUQIs.

Fig. 3

Construction of IQUL and combination process.

4.4 Time analysis

In this subsection, time complexity of the proposed method IHUQI is discussed for the following two aspect: IQUL construction or update, and pattern mining. Let M_o and M_i be the number of transactions in the original and increased databases composed of N_q q—items, respectively. The algorithm first scans the original data or incremental data to calculate the TWU of q—items. The time complexity is O (M_oN_q) or O (M_iN_q), respectively. Then perform a second scan to build the utility list and TQCS structure. Before constructing IQUL, transactions need to be sorted according to TWU. The time complexity of this process is O (M_oN_qlog (M_oN_q)) or O (M_iN_qlog (M_iN_q)). The time consumption to construct TQCS is $O (M_{o} N_{q}^{2})$ or $O (M_{i} N_{q}^{2})$ . After that, the algorithm needs to perform a combination operation on the candidate 1-qitemset. The time complexity of the combination operation is mainly determined by the number of candidates q—itemsets. In the worst case, all 4.2 - q—itemsets are candidates, and the time complexity is $O (N_{q}^{2})$ .

The mining part is composed of join operations and combined operations. The worst case of the mining process is that no patterns are deleted, and two IQULs need to be joined to expand into a larger itemset. There are N_q items that need to be connected. The number of expansion patterns is O (2^{N
_q} - N_q). The time complexity of the join operation is $O (M_{o}^{2})$ or O ((M_o + M_i) ²). The pruning strategy is used in the mining process, so the number of expansion patterns will be much less than O (2^{N
_q} - N_q), so let the number of expansion patterns after pruning is k1. The worst-case complexity of the combined operation is $O (N_{q}^{2})$ . Let the number of combined operations in the mining process is k2. The total time complexity of the mining process is $O (k 1 M_{o}^{2} + k 2 N_{q}^{2})$ or O (k1 (M_o + M_i) ² + k2 $N_{q}^{2})$ . In summary, the time complexity of the proposed algorithm is O (M_oN_q+ M_oN_qlog (M_oN_q) + $M_{o} N_{q}^{2} + N_{q}^{2} + k 1 M_{o}^{2} + k 2 N_{q}^{2})$ or O (M_iN_q + M_iN_q $\log (M_{i} N_{q}) + M_{i} N_{q}^{2} + N_{q}^{2} + k 1 (M_{o} + M_{i})^{2} + k 2$ $N_{q}^{2})$ mean $O (M_{o} N_{q}^{2} + M_{o} N_{q} \log (M_{o} N_{q}) + k 1 M_{o}^{2} + k 2 N_{q}^{2}$ ) or $O (M_{i} N_{q}^{2} + M_{i} N_{q} \log (M_{i} N_{q}) + k 1 (M_{o} + M_{i})^{2} + k 2$ $N_{q}^{2})$ .

5 Experimental results

In this section, the algorithm VHUQI is used to evaluate the performance of the proposed algorithm IHUQI, and the comparison algorithm is the HUQI mining algorithm on static data. In the evaluation experiment, all algorithms are implemented in Java language.

The experimental running environment is 3.00 GHz CPU, 256 GB memory, and the operating platform is Win10 Professional Edition. Five different types of real datasets Connect, Retail, Pumsb, BMS and BMS2 are used to evaluate the performance of the algorithm. All datasets are obtained from SPMF [16]. The characteristics of these datasets are shown in Table 5, describing the number of transactions in the datasets, the number of different items, the average transaction length, and the type of the datasets. The datasets include dense and sparse datasets, which are commonly used as benchmark datasets in the HUIM literature. The experiments divide the datasets into five batches and insert them incrementally, using the Combine_All combination method, except that the Connect dataset qrc is set to 4, and the qrc of the rest datasets are set to 3.

Table 5
Dataset characteristics used in the experiment

Dataset Transaction count Item count Average transaction length Dataset type

Connect 67,557 129 43 Dense

Retail 88,162 16,470 10,30 Sparse

Pumsb 49,046 2113 74 Dense

BMS 59,601 497 2.42 Sparse

BMS2 77,512 3,340 4.62 Sparse

Dataset	Transaction count	Item count	Average transaction length	Dataset type
Connect	67,557	129	43	Dense
Retail	88,162	16,470	10,30	Sparse
Pumsb	49,046	2113	74	Dense
BMS	59,601	497	2.42	Sparse
BMS2	77,512	3,340	4.62	Sparse

5.1 Verification of Incremental Mining HUQI Results

This section conducts experiments on the effectiveness of IHUQI. Two datasets are selected, the sparse datasets Retail and the dense datasets Connect to illustrate the incremental mining results. The two datasets are mining HUQI incrementally when the minimum utility threshold is 0.05% and 2.9% respectively, and the data is divided into 5 batches, 10 batches, and 15 batches. The mining results are shown in Fig. 4 and Fig. 5. It can be seen in Fig. 4 that as the batches are added, the extracted HUQI is also increasing.

Fig. 4

HUQI on Connect.

Fig. 5

HUQI on Retail.

When the minimum utility threshold is 2.9% on the Connect dataset, the number of HUQIs that meet the condition in the first few batches is 0. With the insertion of batches, the number of HUQIs that meet the condition continues to increase. On the Retail dataset, the first batch has HUQI that meets the minimum utility threshold. With the addition of batches, the mined HUQI keeps increasing. It can be seen that the final number of HUQI mined by different batches is equal. It can be seen that IHUQI can effectively update HUQI incrementally.

Figure 6 shows the experimental results of VHUQI and IHUQI on the BMS2 and Retail datasets. As the size of the dataset increases, the running time changes.

Fig. 6

Running with different dataset size.

The minimum utility threshold of the BMS2 dataset is set to 0.05%, and the running time of VHUQI and IHUQI both increase with the increase of the size of the dataset. The reason is that the HUQI that meets the conditions must be extracted from a larger number of transactions. IHUQI has better performance in terms of database size increase. It can be seen that as the dataset size increases, the running time gap between IHUQI and VHUQI becomes larger. The minimum utility threshold of the Retail dataset is set to 0.01%. As the size of the dataset increases, the running time of the algorithm increases, and the difference in running time between the two algorithms gradually increases. Compared with the VHUQI algorithm on static data, IHUQI can process incremental data and effectively mining HUQI.

5.2 Running time comparison

Figure 7 to Figure 11 show the running time comparison between the comparison algorithm VHUQI and IHUQI algorithm as the minimum utility threshold increases. With the increase of the minimum utility threshold, the running time gradually decreases, because with the increase of the minimum utility threshold, the itemsets that meet the conditions will be reduced, and the IQUL that needs to be constructed will be correspondingly reduced. In addition, the time consumed in the mining process will be further reduce. It can be seen that the IHUQI algorithm is faster than the VHUQI algorithm on the BMS dataset and the BMS2 dataset. The reason is that the proposed TQCS structure and the pruning strategy used can effectively improve the mining efficiency. On the Retail dataset, IHUQI is still better than VHUQI when the minimum utility threshold is small, and when the minimum utility threshold is large and the resulting HUQI is less, VHUQI is better than IHUQI, but the time difference is small. The reason is that although Retail dataset is a sparse dataset, it contains a large number of q-items. Besides, in the incremental mining process of IHUQI, it will generate data related to the original dataset, which needs to be recombined, which consumes a certain amount of time. On the Connect dataset, when the minimum utility threshold is small, the IHUQI algorithm is better than VHUQI, but on the Pumsb dataset, VHUQI is better than IHUQI, but the running time is not much different. The dense dataset contains long transactions and also contains a large number of HUQIs when the minimum utility threshold is small. In general, the running time of IHUQI on sparse datasets is better than VHUQI. On dense datasets, the running times of the two algorithms are similar. The reason is that the similarity between transactions in the sparse dataset is small. It is found that HUQIs are not similar to each other, and the combination is pruned according to the pruning strategy, reducing the running time.

Fig. 7

Running time on BMS.

Fig. 8

Running time on BMS2.

Fig. 9

Running time on Retail.

Fig. 10

Running time on Connect.

Fig. 11

Running time on Pumsb.

Fig. 12

Memory consumption on BMS.

Fig. 13

Memory consumption on BMS2.

5.3 Memory consumption comparison

Figures 12–16 show the comparison of memory consumption between the IHUQI algorithm and the comparison algorithm VHUQI algorithm as the minimum utility threshold increases. When the minimum utility threshold of the BMS dataset and Retail dataset is relatively small, the memory consumption of IHUQI is better than VHUQI. On the dense dataset Connect, the memory consumption of IHUQI algorithm is better than VHUQI. The memory consumption of the Pumsb dataset on the IHUQI algorithm is greater than VHUQI. The reason is that the IHUQI algorithm stores TQCS structures compare with VHUQI algorithm and increases the memory. In addition, the number of HUQI in the dense dataset is relatively large, but the different items are relatively small, resulting in a large number of candidates itemsets, and the pruning strategy used cannot effectively reduce the number of candidates itemsets.

Fig. 14

Memory consumption on Retail.

Fig. 15

Memory consumption on Connect.

Fig. 16

Memory consumption on Pumsb.

Mining HUQIs on incremental data needs to consider newly added transactions compared with mining HUQIs on static data, which is more complicated. The proposed IHUQI algorithm can mine HUQIs incrementally and has better results on sparse datasets.

6 Conclusion

This paper proposes an incremental updating HUQIM algorithm IHUQI, which uses IQUL to store utility information. When new data is inserted, it updates IQUL of the original item. If a new item appears, a new IQUL is constructed. A large number of experiments have been carried out to show that the proposed algorithm can effectively mine HUQIs on incremental data; as the size of the dataset increases, the running time gap between IHUQI and VHUQI increases; IHUQI performs better on sparse datasets.

Footnotes

Acknowledgments

This work was supported by the National Nature Science Foundation of China (62062004, 61862001), the Ningxia Natural Science Foundation Project (2020AAC03216).

References

Agrawal

and Srikant

, Fast algorithms for mining association rules in large databases, Proceedings of International Conference on Very Large Databases (VLDB’ 94) (1994), 487–499.

Fournier

V.P.

, Wu

C.W.

, Zida

, et al. FHM: Faster high-utility itemset mining using estimated utility co-occurrence pruning. International symposium on methodologies for intelligent systems, Springer, Cham, (2014), 83–92.

Dam

T.L.

, Li

, Fournier

V.P.

, et al. CLS-Miner: efficient and effective closed high-utility itemset mining, Front Comput Sci 13 (2019), 357–381.

Sethi

K.K.

and Ramesh

, A fast high average-utility itemset mining with efficient tighter upper bounds and novel list structure, Super Comput 76 (2020), 10288–10318.

Singh

, Shakya

H.K.

, Singh

, et al. Mining of highutility itemsets with negative utility, Expert Systems 35(6) (2018), 12296–12319.

Singh

, Singh

S.S.

, Kumar

, et al. TKEH: an efficient algorithm for mining top-k high utility itemsets, Appl Intell 49 (2019), 1078–1097.

Yen

S.J.

and Lee

Y.S.

, Mining high utility quantitative association rules, International Conference on Data Warehousing and Knowledge Discovery. Springer, Berlin, Heidelberg (2007), 283–292.

C.H.

, Wu

C.W.

and Tseng

V.S.

, Efficient vertical mining of high utility quantitative itemsets, IEEE International Conference on Granular Computing (GrC). IEEE (2014), 155–160.

C.H.

, Wu

C.W.

, Huang

J.T.

, et al. An efficient algorithm for mining high utility quantitative itemsets, International Conference on Data Mining Workshops (ICDMW). IEEE (2019), 1005–1012.

10.

Nouioua

, Fournier

V.P

, Wu

C.W.

, et al. FHUQI-Miner: Fast high utility quantitative itemset mining, Applied Intelligence (2021), 1–25.

11.

Yun

, Ryang

, Lee

, et al. An efficient algorithm for mining high utility patterns from incremental databases with one database scan, Knowledge-Based Systems 1 (2017), 88–206.

12.

Yun

, Nam

, Lee

, et al. Efficient approach for incremental high utility pattern mining with indexed list structure, Future Generation Computer Systems 95 (2019), 221–239.

13.

Kim

and Yun

, Efficient algorithm for mining high average utility itemsets in incremental transaction databases, Applied Intelligence 47(1) (2017), 114–131.

14.

Dam

T.L.

, Ramampiaro

, Nørvåg

, et al. Towards efficiently mining closed high utility itemsets from incremental databases, Knowledge-Based Systems 165 (2019), 13–29.

15.

Wang

J.Z.

and Huang

J.L.

, On incremental high utility sequential pattern mining, ACM Transactions on Intelligent Systems and Technology (TIST) 9(5) (2018), 1–26.

16.

Fournier

V.P.

, Gomariz

, Gueniche

, et al. SPMF: A Java open-source pattern mining library[J], Journal of Ma-chine Learning Research 15(1) (2014), 3389–3393.

Incrementally updating high utility quantitative itemsets mining algorithm

Abstract

Keywords

1 Introduction

2 Related work

2.1 High utility quantitative itemsets mining algorithms

2.2 Incremental high utility itemsets mining algorithms

3 Problem definition

3.1 Preliminary knowledge

4.1 Incremental quantitative utility list structure IQUL

4.3 Case analysis

Table 3 TWU of q-item item TWU (a,2) 56 (b,4) 97 (e,6) 56 (f,4) 56 (c,2) 50 (e,3) 9 (a,1) 34 (b,3) 34 (c,1) 34 (d,2) 34 (e,1) 34 (e,7) 41

5 Experimental results

Table 5 Dataset characteristics used in the experiment Dataset Transaction count Item count Average transaction length Dataset type Connect 67,557 129 43 Dense Retail 88,162 16,470 10,30 Sparse Pumsb 49,046 2113 74 Dense BMS 59,601 497 2.42 Sparse BMS2 77,512 3,340 4.62 Sparse

Footnotes

Acknowledgments

References

Table 3
TWU of q-item

item TWU

(a,2) 56

(b,4) 97

(e,6) 56

(f,4) 56

(c,2) 50

(e,3) 9

(a,1) 34

(b,3) 34

(c,1) 34

(d,2) 34

(e,1) 34

(e,7) 41

Table 5
Dataset characteristics used in the experiment

Dataset Transaction count Item count Average transaction length Dataset type

Connect 67,557 129 43 Dense

Retail 88,162 16,470 10,30 Sparse

Pumsb 49,046 2113 74 Dense

BMS 59,601 497 2.42 Sparse

BMS2 77,512 3,340 4.62 Sparse