A Survey of incremental high-utility pattern mining based on storage structure

Abstract

Traditional association rule mining has been widely studied, but this is not applicable to practical applications that must consider factors such as the unit profit of the item and the purchase quantity. High-utility itemset mining (HUIM) aims to find high-utility patterns by considering the number of items purchased and the unit profit. However, most high-utility itemset mining algorithms are designed for static databases. In real-world applications (such as market analysis and business decisions), databases are usually updated by inserting new data dynamically. Some researchers have proposed algorithms for finding high-utility itemsets in dynamically updated databases. Different from the batch processing algorithms that always process the databases from scratch, the incremental HUIM algorithms update and output high-utility itemsets in an incremental manner, thereby reducing the cost of finding high-utility itemsets. This paper provides the latest research on incremental high-utility itemset mining algorithms, including methods of storing itemsets and utilities based on tree, list, array and hash set storage structures. It also points out several important derivative algorithms and research challenges for incremental high-utility itemset mining.

Keywords

Survey pattern mining incremental mining high-utility patterns frequent itemsets

1 Introduction

High utility itemset mining (HUIM) is an emerging data mining task. It expands frequent itemset mining (FIM) by considering occurrence frequency of items in each transaction and its external factors. The external factors may include utility profit, weight, etc. After adding weight to the itemsets, it can be used to find the itemsets with high utility, that is, the high utility itemsets (HUI). In the real world, a customer can buy multiple commodities in a supermarket, or buy multiple pieces of the same commodity, but the profit for each commodity is not the same. For example, products such as bread and ham are often purchased by consumers, and these products have low profits. In contrast, non-necessities such as diamond rings are not often purchased by customers, but they have higher profits. Therefore, only looking for traditional frequent patterns cannot find those products that make a major contribution to the total profit of the retail business. Therefore, it is of great practical significance to excavate the high utility patterns mining. In recent years, the scale of data warehouse tends to increase with time, such as wireless sensor data, social network data and financial transaction data. It can be seen that mining high utility patterns from incremental or dynamic databases will have more important research significance.

Several algorithms have been proposed to mine the high utility itemsets in the transaction database. In the early days, researchers used a constant value to weight each item to calculate the importance of the transaction. In 2004, Yao et al. clearly proposed two new definitions, describing them as external utility and internal utility [1]. It is applied to the so-called Expected Utility Mining (MEU) model, which not only considers the number of high utility itemsets, but also the unit profit of items. However, the model doesn’t propose how to maintain the downward closure of Apriori, so the algorithm is incomplete. In order to solve the problem of the MEU model’s insufficient strategy for mining all high utility itemsets strategies, Liu et al. proposed a Two-phase algorithm [2]. The algorithm uses the transaction weighted downward closure attribute in the first phase. Only those items or itemsets whose Transaction-Weighted Utilization is greater than the user’s minimum utility threshold will be considered as potentially high utility itemsets. The entire process is carried out in a step-by-step scan. In the second phase, the algorithm scans the database to calculate the exact utility of each candidate itemset.

So far, most HUIM algorithms have been designed to find high utility itemsets in static databases. However, in a dynamic data environment, traditional static high-utility pattern mining technology must always start mining from the head of the whole updated dataset to reflect the additional data inserted later. In order to overcome the above problem, several incremental high utility itemset mining (iHUIM) algorithms [3 –9] have been developed to deal with dynamic databases related to transaction insertion. In 2008, Yeh et al. proposed two incremental algorithms for HUIM in a dynamic database, called IUM and FIUM [10]. Both IUM and FIUM use Apriori-based methods. They generated a set of promising candidates, and then checked these candidates to identify the required high utility itemsets. The disadvantage of this algorithm is that in some cases, the original database is scanned to maintain and update the high utility itemsets. Another key issue of the horizontal algorithm is the explosive growth of the combination of the number of candidates retained in memory during the mining process. In order to solve these problems, a tree-based pattern growth method is proposed. These algorithms connect the header index table and the tree node through a link pointer to form a linked list, which stores relevant information of the itemset, and speed up the search speed. At the same time, some algorithms also use the linked list to construct the prefix itemsets of the tree, combining the itemsets support in the linked list and the external utility of the item to mine the variable and high utility itemsets. Therefore, a high-utility pattern mining method based on the pre-large concept [11] and the TWU model was proposed, and a more efficient algorithm for transaction insertion was developed, called PRE-HUI-INS [12]. After that, a tree-based and incremental high utility itemset mining (iCHUM) algorithm has also been designed [13]. Yun et al. also introduced a tree-based method called HUPID-Growth [14], which is used to mine high utility itemsets in incremental databases. Experimental results show that HUPID-Growth is superior to algorithm IHUP. However, both of the above incremental methods consume a lot of memory and take a long time to execute, and when the minimum utility threshold is set low or the database is frequently updated, their performance may be greatly reduced. Therefore, a high utility itemset mining algorithm that uses a list structure to store utility information is proposed, and an incremental pattern mining method based on list storage is developed for dynamic data, such as HUI-list-INS [15] and EIHI [16]. HUI-list-INS is HUIM’s first list-based maintenance algorithm with transaction insertion. Furthermore, a more efficient iHUIM algorithm with transaction insertion, named EIHI [16], is proposed for mining high utility itemsets in dynamic databases. Recently, a new list-based incremental method named LIHUP [17] has been developed for mining high utility itemsets without generating candidate itemsets in a dynamic environment. Contrary to HUI-list-INS [15] and EIHI [16], LIHUP constructs and updates its global data structure by performing a single database scan. The algorithm LIHUP [17] only performs once scan to construct an optimized data structure, which can then be used to derive all the required itemsets. Recently, the algorithm of discovered derivative high-utility patterns in incremental databases has also made many relatively new progress, such as IncCHUI [67], APHAUP [60].

Although a variety of iHUIM algorithms have been proposed, the existing survey of high utility pattern mining mainly focuses on static database. So far, only Gan et al. have proposed a research review of iHUIM in 2018 [18]. However, this review only introduces the incremental full set of high-utility pattern mining algorithms from the perspective of basic mining technologies such as Apriori-based and pattern-growth based.

In view of the above problems, this paper reviews and summarizes the incremental high utility pattern mining algorithms from the perspective of storage structure. The main contributions of this article are:

This paper is the first survey to classify and analyze incremental high-utility pattern mining algorithms in terms of the storage structure including list-baesd structure, array-based structure, set-based structure, and hybrid structure, et al. It also systematically summarizes the characteristics of each storage structure, as well as the impact on the performance of the algorithm.

This paper gives a comprehensive overview of the latest various derivative incremental high-utility pattern mining algorithms in recent years, and from the perspective of storage structure, summarizes the development status of the storage structure technology of various derivative algorithms.

This review has a good guiding significance for researchers in this field in selecting storage structures of methods. It also provides some ideas and directions for the next step of the improvement of algorithms.

2 Related works

With the continuous development of HUIM, there have been many reviews to summarize the achievements, especially in the field of static high-utility pattern mining. In recent years, the survey of high-utility pattern mining methods on dynamic database also began to appear.

Survey [73] studied several high utility mining algorithms of HUIM, and studied their working mechanism and mining efficiency. This paper introduces the differences between the one-phase algorithm and the two-phase algorithm, points out the shortcomings of the two-phase algorithm, and summarizes the advantages of some typical one-phase algorithms. This paper provides ideas for the future development of static high utility mining algorithms.

Literature [69] reviews various existing methods and approaches to extract high utility itemsets from precise and uncertain static datasets. In the literature, the algorithms are classified into two categories: those running on accurate datasets and those running on uncertain datasets. This paper proposes some extensions of efficient itemset mining methods, such as high average itemset mining, Top-K itemset mining and so on. However, these algorithms are still algorithms on static datasets and lack of technical summary of algorithms on dynamic datasets.

Literature [70] briefly introduces the key technologies of HUIM in detail. It classifies and introduces algorithms according to Apriori-based, projection-based, etc. In addition, it also outlines the derivative high utility pattern mining methods, and summarizes high utility pattern mining methods over the data stream based on the above classification.

Survey [71] comprehensively reviews utility-oriented pattern mining, describing various problems related to mining-based utility patterns and methods to solve these problems. This paper proposes the most common and advanced classification methods for mining different types of high utility patterns, and provides a comprehensive review of advanced topics in existing high utility pattern mining techniques, and discusses their advantages and disadvantages. Finally, it introduces some famous UPM open source software packages.

Survey [72] introduces the key properties of high utility itemset mining and how it improves frequent itemset mining algorithm. Then, this survey investigates the popular technologies of high utility itemset mining in static database, and introduces the main extensions of high utility itemset mining, such as on shelf high utility itemset mining, periodic high utility itemset mining. However, this survey introduces few algorithms on dynamic database, and does not explain the main technology of the algorithm on the dynamic database.

Survey [75] reviews the correlated high utility pattern mining (COHUPM), which is extended from high utility pattern mining. Its purpose is to extract interesting patterns for real scenes by utilizing both utility and correlation measures. This survey analyzes and compares COHUPM methods from pruning strategies, key technologies and other aspects.

Paper [74] is the first survey of incremental high utility pattern mining algorithms. He introduces some basic technologies of incremental high utility pattern mining, such as Apriori and pattern-growth methods. In this paper, several classic complete high utility itemset mining algorithms are analyzed and explained in detail.

Based on the analysis of the above review, the existing literature mainly has the following problems:

Most of the previous reviews have used some of the most commonly used classification methods to classify algorithms, such as Apriori-based, projection-based and other methods. However, the data structure has a great influence on the performance of the algorithm, and the existing reviews lack the discussion and research on the storage structure of the algorithm.

The existing review of incremental high-utility pattern mining mainly focus on the summary of incremental high utility pattern of full set mining methods, but lack of a deeper and broader analysis and summary of the current situation of compact and other derivative high utility pattern mining.

The existing review of incremental high-utility pattern mining does not give recommendations for the selection of algorithms or storage structures for different usage scenarios, so it lacks practical guiding significance.

This article will help solve the above problems and promote further research in this field to a certain extent.

3 Basic concepts

This section introduces important basic concepts and formally defines the issue of incremental high-utility pattern mining (iCHUM). In fact, the transaction database is dynamically updated with new data (e.g., items, transactions) about existing or new customers. For example, consider the database in Fig. 1, which consists of 10 transactions and 6 items, each of which has a purchase quantity (also known as internal utility). In the six transactions shown in Fig. 1, the set of transactions T₁ to T₄ is the original database D, and the set of transactions T₅ to T₆ is inserted into new database $d_{1}^{+}$ . Table 1 shows the unit profit table (also known as external utility) that represents the unit profit of each item. In this example, the minimum utility threshold (minUtil) is set to 35%. The important definition of high-utility pattern mining is as follows.

Fig. 1

Transaction database example.

Table 1

External utility

Item	a	b	c	d	e
External utility	2	1	3	1	4

Let I be a finite set of terms, where I ={ i₁, i₁, … i_n }, Then an itemset X is a finite set of items belonging to I. The transaction database D consists of a set of transactions, in which each transaction contains a set of items i ∈ I and is associated with the unique transaction id called TID. As shown in Fig. 1, the transaction database contains four transactions, T ={ T₁, T₂, … T₄ } and five items I ={ a, b, c, d, e }. The internal utility of an item x in a transaction T_j ∈ D is defined as the number of x in a transaction T_j, denoted as IU (x, T_j), and the external utility of an item x is denoted as EU (x). As shown in Fig. 1, the internal utility of item a in transaction T₁ is 1; As shown in Table 1, the external utility of item a, b, c, d and e are 2, 1, 3, 1 and 4, respectively.

Definition 1: utility of an item in a transaction. the utility of an item x in a transaction T_j is denoted as U (x, T_j), and the definition is shown in Equation (1). $U (x, T_{j}) = IU (x, T_{j}) \times EU (x)$ (1)

Definition 2: utility of an itemset in a transaction. the utility of an itemset X in a transaction T_j is denoted as U (X, T_j), and the definition is shown in Equation (2). $U (X, T_{j}) = \sum_{x \in X} U (x, T_{j})$ (2)

Definition 3: itemset utility. the utility of an itemset X in database D is denoted as U (X), which is defined as Equation (3). $U (X) = \sum_{X \subseteq T_{j} \land T_{j} \in D} U (X, T_{j})$ (3)

Definition 4: transaction utility. the transaction utility (abbreviated as TU) of a transaction T_j is denoted as TU (T_j), which is defined as Equation (4), where m is the number of items in transaction T_j. $TU (T_{j}) = \sum_{i = 1}^{m} U (x_{i}, T_{j}) x_{i} ɛ T_{j}$ (4)

Example 1. the transaction utility of T₁ is TU (T₁) = U ({ a } , T₁) + U ({ c } , T₁) + U ({ d } , T₁) = 6 +3 + 2 =11.

Definition 5: itemset support. the support of an itemset X is defined as the number of transactions containing X in dataset D, denoted as support(X). As shown in Fig. 1, the support of itemset {a, c} is 2, which appears in transaction T₃ and T₄ respectively.

Definition 6: The Transaction-Weighted Utilization (abbreviated as TWU) of an itemset X is the sum of all transaction utilities including X, denoted as TWU(X). Its definition is shown in Equation (5). $TWU (X) = \sum_{X \subseteq T_{j} \land T_{j} \in D} TU (T_{j})$ (5)

Definition 7: high utility itemset. If an itemset U (X) ⩾ minUtil, then X is a high utility itemset; otherwise, X is a low utility itemset.

Example 2. Let minUtil = 6, the high utility itemsets in Fig. 1 are {{b} :6, {c} :8, {ab} :10, {bc} :14, {abc} :18}, where the number next to each itemset is its utility.

Problem statement Let there be an original database D. A database D_n is an update of database D if $D_{n} = D \cup d_{1}^{+}$ , where n is a non-empty set of inserted transactions. Let minUtil, D and H be a user-specified minimum utility threshold, a database, and the set of high utility itemsets found in D, respectively. The problem of incremental high utility itemset mining is to find H′, which is the set of high utility itemsets in the updated database D_n, given minUtil, D, and H.

4 Incremental high-utility pattern mining algorithms

In recent years, several iHUIM methods have been developed to deal with dynamic database with transaction insertion. Great attention has been paid to the design of these methods to ensure their scalability and promote knowledge discovery in dynamic data environment. At present, the proposed incremental high utility pattern is mainly mining full set of patterns, and a few algorithms are used to design mining compact patterns. This section mainly studies and analyzes the storage structure classification of itemsets and utility information in the mining process. There are a lot of researches on high utility pattern mining algorithms for incremental database in the world. The analysis of main data storage structure characteristics is shown in Table 2. In Table 3, According to the performance of representative algorithms with different structures on a variety of datasets, we give further suggestions on the usage scenarios of data structures. The related algorithms are summarized in detail in Table 4.

Table 2
Characteristics of main storage structures used in algorithms

Storage structure type Need validation phase Pattern mining speed Memory consumption

Tree Need Slower High

List No need Faster Low

Array Set No need Need Faster Slower Low High

Storage structure type	Need validation phase	Pattern mining speed	Memory consumption
Tree	Need	Slower	High
List	No need	Faster	Low
Array Set	No need Need	Faster Slower	Low High

Table 3

Detailed comparison of different storage structures with application scenarios

Storage structure	Dataset	Representative algorithm	Advantage	Disadvantage	Usage scenarios
Hash table, Indexed list	Accidents, Mushroom, Chainstore, Pumsb, Connect, Retail, BMS, T10, T20	Id²HUP+	1. There is no connection operation similar to the traditional vertical utility list structure construction, so the algorithm mining speed is quite fast. 2. The structure can calculate the transaction set of each enumeration pattern through pseudo-projection, and merge the same transactions. It does not need to be repeatedly constructed like the vertical utility list structure, thus saving a lot of memory.	1. When the threshold is low and average item count per transaction is large, the algorithm runs slower.	The algorithm using this structure can run in an environment with low memory and high real-time requirements. This type of structure has a more obvious speed advantage on large or medium-sized dense datasets where average item count per transaction is small. For small sparse datasets such as Retail, this structure has no obvious time advantage compared to the vertical utility list.
Array	Mushroom, Chess, Connect, BMS	IHUP(ch)	1. The structure adopts pseudo-projection technology, and the mining speed is fast. 2. The transactions with the same prefix can be compressed and merged. The array structure of each item is independent of each other, and the algorithm only needs to take out the relevant array and read it into the memory for processing each time.	1. The algorithm needs to create an array for each item in order to perform pseudo-projection, and the memory consumption is greater than the previous composite structure. 2. The structure does not have an effective transaction index mechanism, the speed of establishing sub-projection is slow.	This structure is suitable for small or medium-sized datasets with a small total number of items. However, on datasets with a large number of items such as BMS, the memory consumption is larger than that of the traditional vertical utility list structure, and the operation of such datasets may be restricted by the system memory environment.
Indexed list	Accidents, Chainstore, Connect, Retail, Mushroom, Kosarak	IIHUM	1. This structure builds a conditional indexed list with prefix through the index, which greatly improves the connection speed between lists. 2. The algorithm does not need to rearrange the actual position of global lists after the incremental data arrives.	1. It cannot avoid frequent construction of the condition prefix list, so the memory and time required when the dataset is small are still more than the composite structure adopted by Id²HUP+.	This structure is suitable for use in environments with sufficient memory, large sparse datasets and high threshold settings.
List	Foodmart, Mushroom, Re tail, Chess, Connect, Accidents, T10N40000L4000 T10I4D1000K	LIHUP, EIHI, HUI-list-INS	1. This structure does not generate candidate itemsets. Compared with the tree structure, it has faster speed and lower memory consumption.	1. The algorithm needs to construct a utility list for each intermediate itemset.	This structure is suitable for use in environments with sufficient memory, large or medium sparse datasets and high threshold settings.
Tree	Chainstore, T10I6D100, Accidents, CT, Hepatitis, Pumsb	IHUP, iCHUM, HUPID-Growth	1. Compared with the hash set structure, the structure is more compact.	1. Tree-based algorithms need to frequently remove and update tree nodes. 2. In the mining process, algorithms recursively create subtree structures and extract actual patterns according to paths, which is very time-consuming.	This structure is suitable for running on small datasets with infrequent data insertion.
Hash set	T10I4N4KD200K, data.ntrans_10.nitems_0.1, data.ntrans_10.nitems_0.02,	IUM, FIUM, FUP-HUI-INS	1. The structure itself is simple. The algorithms using this structure horizontally expand itemsets in a step-by-step manner, which is convenient for users to understand.	1. All algorithms using this structure will generate a large number of candidate itemsets, so they require a lot of time and memory.	This structure is suitable for use on small datasets with large enough memory and low real-time requirements.

Table 4

Summary of High utility incremental pattern mining algorithms for different storage structures

Algorithm name	Whether to generate candidates	Data structure	Advantage	Disadvantage
iCHUM [22]	No	Tree	Only items with high Transaction-Weighted Utilization in both the original database and the newly inserted dataset are inserted into the tree, so that the tree structure does not need to be completely rebuilt.	Whenever the data is updated, it is necessary to update the information of all items in the iCHUM-tree structure, regardless of whether the item was previously high utility or not.
HUPID-Growth [23]	Yes	Tree, List	A reorganization method with a novel data structure is used to construct a tree through a single database scan. A strategy was also developed to reduce the number of candidate patterns.	Because it uses a new utility overestimation strategy, it needs to construct some additional data structures.
HUI-list-INS [15]	No	List	Calculation based on the structure of the utility-list without generating candidate itemsets. The application of the estimated utility co-occurrence structure also speeds up the incremental mining process.	It requires two database scans to calculate the TWU values of items and sort them in ascending order in the transaction.
LIHUP [17]	No	List	Only one database scan can rebuild the list, so that the mining process can be executed efficiently.	When the minimum utility threshold is set low and the database is updated frequently, the efficiency of the algorithm is restricted due to the high cost of the utility-list structure joining process.
EIHI [16]	No	List, Tree	Storing all itemsets in a structure called trie can quickly update their utility and shorten mining time.	It performs two database scans according to the defined sort order to construct the data structure. At the same time, the efficiency of the algorithm is reduced because of the join operation of the list.
IIHUM [28]	No	Indexed list	Recursively generate a conditional index utility-list data structure from the indexed global utility-list structure. Reorganizing the global utility-list based on the TWU information only needs to scan the database once, so that the proposed algorithm can mine high utility patterns more effectively.	Since TWU information is not used to overestimate utility, When the minUtil is set low and the database is updated frequently, the cost of updating the index utility-list increases.
Id²HUP+ [68]	No	Hash table, Indexed list	Relying on the transaction hash table and the indexed transaction list of linked summary hash table, the efficient pseudo-projection calculation and update of the dynamic database are realized, and the memory usage is low. It saves a large number of join operations in the traditional list structure, and the existence of the transaction hash table enables the same transaction quickly merge.	The database needs to be scanned twice when the data structure is first constructed.
IHUP (ch) [35]	No	Array	Using the array structure to project, fundamentally avoids the join operation of the list structure. At the same time, this structure is superior to storage structures such as trees and linked lists in terms of pattern search speed.	The lack of efficient sub-projection utility pruning strategy makes the projection construction too frequent when the amount of data is large.
WEPS [37]	No	Tree, Array	The branch paths of the extracted tree are stored in a temporary array and sorted quickly in the array, the tree update speed is greatly accelerated.	When the dataset is dense, the memory required to construct the tree still needs to be reduced.
WMFP-SW [36]	No	Tree, Array	A condition tree can be generated with only one scan, and the mining performance can be improved by applying it to the mining process.	All weighted frequent patterns are extracted, so the generated patterns are larger than other algorithms, and the required running time becomes longer.
FUP-HUI-INS [38]	Yes	Hash set	Using the concept of FUP, the previous database will not be rescanned every time, avoiding the generation of some unnecessary candidate patterns.	It depends on the concept of FUP, but suffers from the combinatorial explosion of search space.

4.1 Tree-based storage methods

Most of the previous Aprior-based algorithms used a level-wise approach to explore high utility itemsets (for example, Two-phase). In order to make up for the shortcomings of HUIM algorithm based on Apriori, based on the method of TWDC attribute and pattern growth, several algorithms that use trees to store transactions are proposed to effectively mine high utility itemsets. In order to improve the mining performance and avoid repeated scanning of the original database, Tseng et al. proposed a compact tree structure called UP-Tree [19] and a global header table to maintain information about transactions and high utility itemsets. The tree maintains the important information of the transaction database related to the pattern of items. Only two scans of the database are required to effectively generate a high utility itemset from UP-Tree. In UP-Growth, the minimum item utility table is used to reduce the overestimated utility. In UP-Growth+ [20], UP-Growth+ adopts two new utility overestimation strategies (DNN and DNU), considering the minimum node utility in each path to make the estimated pruning value closer to the pruning the actual utility of the item in the database.

Traditional incremental algorithms need to iterate layer by layer, scan the database multiple times to generate a large-scale candidate itemset, which is expensive and can only deal with the increase of the database, and cannot effectively deal with the deletion and other changes of the transaction datasets. Therefore, Na proposes a new utility mining pattern tree (PreHU-tree) [21] to mine high-utility itemsets, which first calculate the Transaction-Weighted Utilization change of each itemset in the transaction, and then divide them into high-frequency utility items, sub-frequency utility items and low-frequency utility items to construct PreHU-tree according to the item frequency of the original database. Finally, all high-utility itemsets is discovered.

Early tree-based incremental high utility mining algorithms performed unnecessary insertion or reordering operations for items with lower utility. In order to reduce the items that need to be maintained in the global tree when new data arrives, Zheng et al. [22] proposed an efficient incremental pattern mining algorithm called iCHUM. The algorithm iCHUM uses items with high Transaction-Weighted Utilization to construct its tree structure, namely iCHUM-Tree. When the new database is attached to the original database, it recalls those items with high TWU in the updated database and low TWU in the original database, and then inserts the high TWU items of the original database and the new database to update iCHUM-Tree, and no need to completely rebuild the tree structure. The information of the high utility itemset is kept in the iCHUM tree so that the candidate itemset can be generated through the mining process. The header table is a part of iCHUM-Tree, and each entry is composed of item name, TWU value and links to nodes with the same name in iCHUM-Tree. In the mining process, each entry is the entrance to traverse related nodes. All the same nodes in iCHUM-Tree are found through the linked list pointer. Experimental results show that iCHUM has advantages over IHUP on sparse and large datasets.

Algorithm HUPID-Growth [23] optimizes the tree structure, scan database once to build a global tree structure called HUPID-Tree (High utility pattern in incremental database tree), except The algorithm iCHUM uses the same header table structure, that is, it contains an initialized link pointer to link all the same items to form a linked list. The algorithm also uses a new data structure called TIList to reconstruct the global HUPID structure to Process incremental data. In this reorganization process, each path P is extracted from the tail node of the TIList entry to the root node. The items in P are sorted in descending order of the current TWU of the tree. At the same time, the calculation of each item in P Path utility (reduced overestimation utility) to greatly reduce the search space and candidate patterns, which makes HUPID run time and number of candidate itemsets completely better than IHUP.

The existing HUP mining algorithm based on sliding window does not have the “build once” attribute for interactive mining. Based on this, Kim et al. [24] recently proposed an algorithm called DSHUP, the algorithm uses a special global tree data structure. The global tree stores the nodes of each item that make up each transaction. Each node stores a batch estimated utility array, which stores the reduced estimated utility values of all batches, and these utility values are used for pruning in the path extraction process. When a new batch is added, the oldest value in the node’s array is removed by DSHUP, then the new reduced estimated utility value is added to the array. Nevertheless, the algorithm using the global tree structure still needs to generate candidate patterns in order to extract the actual high-utility patterns in the verification phase.

4.2 List-based storage methods

Most existing algorithms first generate candidate itemsets by overestimating their utility, and then calculate the exact utility of these candidates. These algorithms cause the problem of generating a large number of candidates, but after calculating their exact utility, it is found that most of the candidates do not have very high utility. In order to solve the limitations of traditional HUIM, Liu et al. [25] proposed a novel utility-list structure. This list storage structure not only stores the utility information about the itemset, but also stores heuristic information about whether the itemset should be pruned.

Liu proposed the algorithm HUI-Miner [25] based on the utility-list structure, which reduces the time and memory consumption in the mining process by avoiding the generation of a large number of candidate itemsets and utility calculations. At the same time, a new pruning strategy is proposed, which uses the remaining utility stored in the utility-list to prune the deep traverse search space. However, HUI-Miner explores the search space of itemsets by generating itemsets, and must perform expensive join operations to evaluate the utility of each itemsets. In order to reduce the number of connections performed, FHM and HUP-Miner have been proposed one after another. Fournier-Viger et al. [26] proposed a novel pruning strategy called EUCP (Estimated Utility Co-occurrence Pruning), which can prune the itemset Without performing redundant connection operations. Extensive experimental research shows that this strategy can reduce the number of connection operations of the previous algorithm HUI-Miner by up to 95%, and it is six times faster than the former. Krishnamoorthy et al. designed a new list data structure called the partition utility-list, which borrowed from the basic idea of the Tid list representation used by Zaki (2000). It can also be regarded as an extension of the utility-list structure, which also reduces the cost of join operations between two utility-list. The proposed method also utilizes three key pruning strategies to limit the search space during tree exploration. More specifically, the pruning strategies used in the recursive mining process include U-Prune (Liu & Qu, 2012), PUPrune and LA- Prune [27].

Recently, an incremental method for maintaining and updating discovered high utility itemsets is proposed to handle transaction insertion. This method uses the algorithm HUI-Miner to design the incremental algorithm of HUM, and is named HUI-list-INS [15]. This algorithm will pre-build the utility-list structure before inserting the transaction into the original database. It not only retains high Transaction-Weighted Utilization itemsets from the updated database, but also retains the small Transaction-Weighted Utilization itemset from the original database, thereby avoiding rescanning the database as a new item when the transaction is inserted. Since the utility-list structure is a condensed structure used to retain the relevant information in the original database, Therefore, it only needs less memory to save the relevant information of the proposed algorithm, and the exact utility of the larger itemset is obtained by adding the utility-list of the smaller itemset (no need to scan the database). In addition, the algorithm trims the search space by using the remaining utility stored in the utility-list. The estimated utility co-occurrence structure [12] is also applied in HUI-list-INS to speed up the incremental mining process.

Fournier-Viger et al. designed algorithm EIHI [16] (Efficient Incremental High-utility Itemset Miner) to solve the fact that the incremental algorithm is still very expensive in execution time and the One-phase algorithm HUI-List-INS still has room for improvement. It designs a structure called HUI-trie to store the previously mined high utility itemsets and their utility information, which saves utility calculation time. A fact is that if a certain itemset X appears in D but does not appear in N, where D′ = D ∪ N. The utility of X in D′ is the same as that of X in D. Therefore, if X is the high utility itemset in D, then it is also the high utility itemset in D′. Similarly, if X is low utility itemset in D, then it is also an low utility itemset in D′. Therefore, EIHI uses some important attributes (not used in HUI-List-INS) to trim the search space during the update of the utility-list and speed up the search.

HUI-list-INS and EIHI have a shortcoming when constructing their global list data structure, that is, they need to scan the original data or other data twice to calculate the TWU ascending order, which is the optimized sort order. In order to solve this shortcoming, the LIHUP method proposed by Yun et al. [17] can not only construct and update its global data structure according to the best sorting order, but also reconstruct the previously constructed data structure with only one scan, without other database scanning, so that the mining process can be executed more efficiently. Since the proposed method only scans the given data once, the optimized sort order is determined after the construction phase. The utility-list created to store the utility information of the candidate pattern has the remaining utility information to effectively reduce the search space. Its value is initialized to zero in the construction step, and is calculated according to the optimized sort order in the reorganization step. This process passes go through all the entries in the list. That is to say, the main difference from the existing scheme’s list-based progressive high utility pattern mining technology is that the algorithm does not obtain the best sort order based on two database scans, but it establishes an initial global data structure. The data structure is reconstructed in the best sort order without further scanning. Compared with IHUP, HUPID-Growth and other algorithms that only scan the database once to build a global structure, its memory stability and speed are better in both dense and sparse datasets, which shows that there are no candidates for generating utility-list based on various methods. The above reasons prevent it from incurring large memory and time overhead.

In order to reduce the number of database scans, Yun et al. [28] proposed a new algorithm for mining high utility patterns from incremental databases without generating any candidates, which is called incremental high-utility pattern mining based on indexed lists. (IIHUM). The data in a given incremental database is refined and stored in a global data structure, called “Incremental Index Utility List (IIU-List)". The global IIU-List has been allocated to every database in the current database. Each item, each IIU-List is composed of many tuples, used to store transaction information with incremental data processing. For each tuple (a tuple is related to a transaction of the current item), NextItem points to the next item next to the current item in corresponding transaction; ActUtil is the actual utility of the current item in the transaction; NextIdx indicates the corresponding tuple of the item pointed to by NextItem in its IIU-List. SumUtil is the sum of all ActUtil in the current IIU-List, which is equal to the actual utility of the current item. RUtil represents the maximum utility that can be added to SumUtil when the current item is expanded to the maximum. This method first scans the given original database once to construct the global incremental index utility-list. After that, the constructed list structure is reorganized according to the ascending order of the Transaction-Weighted Utilization, and its index information is correctly set again. If new transaction data is entered, the proposed method will reflect them to the previously constructed data structure without renewing scan the previously processed data. In this process, calculate the TWU value of the items that make up the global structure again, and reorganize the updated data structure according to the changed TWU ascending order. After reorganizing the data structure, the user provides a threshold percentage value to request high-utility patterns for the current incremental database. Then, the algorithm recursively generates the condition indexed list data structure from the global data structure. After all the recursive work is completed, the user can directly get all the high utility patterns without any candidate checking. Experiments show that, in general, the memory usage and running time of the proposed algorithm are much lower than all the latest incremental algorithms based on the traditional utility list structure.

Jaysawal et al. [29] proposed a data structure called IUDataListSW for storing the utility and upper-bounds of items in the current sliding window along with the position in a transaction to get the projected database fastly. IUDataListSW helps output high-utility items in the current sliding window, and also helps to identify promising items for further expansion. In addition, with the help of the IUDataListSW structure, the algorithm SOHUPDS proposed in this paper can effectively obtain the initial projected database of all promising items for further expansion.

Jaysawa et al. proposed a novel algorithm called DMHUPS [30] and combined it with a data structure called IUData List to effectively mine high utility patterns on sparse and dense datasets. The IUData list stores the information of the itemset of length 1 and its position in the transaction to effectively obtain the initial plan database. In addition, the algorithm DMHUPS simultaneously calculates utility and stricter extended upper bounds for multiple promising candidates. Therefore, DMHUPS finds multiple high utility patterns at the same time and effectively trims the search space.

Combining the above algorithms, the incremental utility list with index is derived from the traditional incremental utility-list [28]. The related algorithm reduces the need to scan the original or updated database twice to build a global list to only one scan [17, 28]. In particular, the proposed indexed list makes it unnecessary to perform comparison and join operations during the construction of itemset lists of different lengths, because the index indicates the tuple position of items in the same transaction in each list, which eliminates a major drawback of traditional incremental utility-list.

4.3 Array-based storage methods

Although One-phase incremental high utility pattern mining algorithms HUI-LIST-INS and EIHI can mine all high utility patterns without accessing the original database again after the transaction update, However, these two algorithms are based on the structure of utility-list, because of the uncertainty of search space, it is still necessary to compare the utility-list of new transaction with the utility-list in the original database to mine high utility patterns, and it is not convenient to merge transactions. Especially when the amount of data is too large to store in memory, the utility-list needs frequent memory swap leads to system memory thrashing, which greatly reduces the performance of the algorithm.

In order to fundamentally solve the above problems, some array-based pattern storage structures have been proposed. For example, Zida et al. introduced a new utility counting technique based on projection arrays, called Fast Utility Counting (FAC) [31]. In the utility bin array U, each entry is called the utility bin, and the utility bin U[i] of the item i is stored in the i-th position of the array U. To calculate the upper limit on the linear time and space of all expansions of the itemset. It uses the proposed algorithm EFIM to introduce two effective techniques to reduce the cost of database scanning. These techniques are called “High-Utility Database Projection (HDP)” and “Efficient Transaction Merging (HTM)". With the exploration of larger itemsets, both of these techniques reduce the size of the database, thereby greatly reducing the cost of database scanning. In order to avoid a large number of structure comparisons and joins when generating transaction data structures, Wang et al. in 2016 proposed a new projection-based high-utility pattern mining algorithm HUPMP [32]. This algorithm also uses the form of projection array to store the information of each item and mines it. Compared with the previous algorithm, it uses a more compact reduction strategy, and its computational complexity is also greatly reduced. The latest algorithms use TWU and residual utility strategies to trim the search space, but these strategies are not efficient. In order to improve performance, Singh [33] and others adopted and modified the SU-Prune strategy proposed by EFIM [31]. In addition, it also uses the array-based utility counting technique to calculate the transaction utility and Transaction-Weighted Utilization of items. The experimental results show that the proposed idea is effective and produces good results.

In the field of high utility sequential pattern mining, the current methods mostly use utility matrix to store the sequence database in memory. Unfortunately, the utility-list consumes a lot of storage. To solve this problem, Le et al. introduced a pure array structure, which can reduce memory consumption compared with the utility matrix. Experimental results show that AHUS [34] can effectively discover all high utility itemsets. In terms of running time, memory usage and scalability of all experimental datasets, the two proposed algorithms are better than the latest algorithm HUS-Span.

Luo et al. proposed the algorithm HUP [35], and IHUP is also based on the HUP-Array structure. It consists of two parts: the processing part of the original database and the processing part of the new transaction. Using this structure, when a new transaction is added, only the HUP-Array of items that appear in the new transaction needs to be processed, and the HUP-Array of items that do not appear in the new transaction does not need to be considered, then the main projection structure is constructed separately for the new transaction, and then it is merged with the main projection of the original database. In the process of mining, there is no need to mine the entire merged main projection, only the items that appear in the new transaction. At the same time, the HUP-Array group of each item is independent. In the case of a large amount of data, it can be distributed on multiple computers for mining. Based on the HUP-Array structure, it becomes easier to update and mine new transactions. Experimental results show that with the change of transaction insertion ratio, the proposed algorithm is about two orders of magnitude faster than the list storage structure algorithms such as HUI-list-INS [15] and EIHI [16].

Lee et al. based on the sliding window model (WMFP-SW) [36] to mine the weighted maximum frequent pattern of the data stream to obtain the weighted maximum frequent pattern reflecting the latest information of the data stream. It proposes that the conditional WMFP-SW tree generated from the global tree does not need to have information about weighted infrequent items, because in the process of extracting WMFP only the current weighted frequent items are needed. Therefore, invalid items should be deleted when creating the condition tree.

Yun et al. [37] adopted the new tree structure WEPS-Tree and sliding window model suitable for weighted erasable pattern mining, and a new array structure containing the basic information of the mining process. The algorithm is completing the insertion or update of WEPS-Tree. After the operation, the algorithm extracts each unsorted path from the corresponding tail node to the root node, and stores the extracted path in a temporary array. After that, sort the items in the array in descending order of the support of the items in the tree. The sorted items of the array are re-inserted into the tree in order from the root. In this algorithm, the algorithm updates the information of the child node and the parent node link, and sets the node with the last item of the array as the tail node. This structure greatly improves the efficiency of mining operations.

It is not difficult to find that most of incremental algorithms based on array storage store the utility of each projected transaction and the upper limit information of the utility expansion of the expandable item, and use these utility arrays for fast utility calculations. That is, the utility of calculating and accessing items in O(1) time. For example, EFIM only creates three utility arrays, which are used to calculate TWU, Subtree utility, and Local utility. Secondly, it can be reused multiple times by reinitializing the same utility array with zero values before each use. This can avoid creating multiple arrays, thereby reducing memory usage. However, because the above array-based incremental high utility pattern mining algorithm are mostly carried out by projection, whenever an itemset is extended and projected, it is necessary to scan and traverse possible transactions to determine whether a transaction should be added to the projection database, which is for those with average length longer datasets are disastrous, because it is unknown whether the items that need to be newly expanded exist in the current transaction, and where they are in the utility array.

Considering the characteristics of incremental databases and projection arrays, when a new transaction arrives, the TWU of an item changes. The previously constructed projection databases may not include a large number of low TWU items, which makes the original utility array unsuitable for direct use or update. The calculation algorithm has made improvements to the static array structure. For example, IHUP [35] introduced earlier, in order to avoid rescanning all transactions to construct the projection array, choose to keep the low TWU items in the projection array, construct the HUP-Array for each item separately, and only need to mine the HUP-Array of items that appear in the new transaction.

4.4 Other methods

In incremental high utility pattern mining, the high utility pattern of using different length sets to store horizontal expansion is also a commonly used storage structure scheme. At the same time, the above-mentioned data structures have more or less efficiency problems, and it is still necessary to explore other more efficient data structures. In some special pattern mining, some scholars proposed a hash table-based pattern information storage structure, and achieved good results.

Some existing incremental algorithms use a Two-phase model, which generates a large number of candidates that cause scalability problems, or uses a vertical data structure, which causes a large number of join operations, which leads to efficiency problems. To solve these incremental algorithms, In 2019, Liu et al. [68] proposed a new incremental algorithm Id²HUP+ to directly discover the high-utility pattern. Id²HUP+ adapts to the one-stage paradigm by improving correlation-based pruning and cap-based pruning, and proposes a novel data structure niCAUL for the rapid update of dynamic databases. Take the original database D and the new transaction set $d_{1}^{+}$ in Section 3 as an example, suppose that the updated database $D_{2} = D \cup d_{1}^{+}$ . The structure is composed of the five parts shown in Fig. 2. Hash(X, D) organizes all the transaction lists in UL(X, D) so that transactions that contain the same itemset can be identified in a constant time, and merge it into the list of a transaction, which summarizes the internal utility of each component item. The algorithm first builds TSniCAUL({},D₂) step by step according to TSniCAUL({}, D) when UL({}, $d_{1}^{+}$ ) and Hash({}, $d_{1}^{+}$ ) are empty. By scanning the set N of new transactions twice, inserting new entries into the Summary table to record the new items, constructing UL({}, $d_{1}^{+}$ ) based on the new transactions, and inserting the corresponding hash entries into Hash({}, $d_{1}^{+}$ ). All the items in UL({}, $d_{1}^{+}$ ) are linked to Summary({}, D₂) through a linked list called “newlink”. In the deep mining process, relying on the structure of niCAUL, the algorithm quickly calculates the high-utility itemset by creating pseudo projections for each prefix in a certain order. Due to the chain relationship between the summary table and UL in niCAUL, when generating pseudo projections, there is no large number of connection operations for traditional lists, and compared to the array-projection method, this algorithm uses the index of the extension item on the current pseudo-projected niCAUL structure to quickly construct a new projection without having to scan the original database. The experimental results show that the speed of Id²HUP+ is LIHUP is 223 times faster than EIHI, 473 times faster than EIHI, and access candidates are 90 times less than LIHUP and 35 times less than EIHI. Therefore, this kind of hybrid structure is very helpful for fast maintenance and update of dynamic databases and merging of the same transactions. To a certain extent, it makes up for the shortcomings of EIHI, HUI-list-INS, LIHUP and other traditional list structures.

Fig. 2

The representation of D₂ in niCAUL.

Lin et al. [38] proposed a concept based on the fast-update (FUP) method to mine high utility patterns, which was originally designed for association rule mining. The proposed method first divides the itemset into four parts according to whether it belongs to the high Transaction-Weighted Utilization itemset in the original database and the newly inserted transaction, as shown in Fig. 3. Then, each part is executed by its own algorithm. It uses a horizontal mode to mine incremental high utility patterns, and uses a set to store potential candidates of different lengths generated during the enumeration process. And when new data is inserted, it will judge the corresponding situation with the itemset stored in the new dataset, so as to update the candidate itemset in the set. The execution process can be roughly divided into two steps. In the first step, the Transaction-Weighted Utilization of itemset with length K needs to be calculated as the overestimated utility. If it is greater than or equal to the updated minimum utility threshold of the transaction, it is put into a updated set, this set is used to store candidate high Transaction-Weighted Utilization itemsets. The second step is to calculate the actual utility of the high Transaction-Weighted Utilization itemset. If the threshold condition is met, it will be stored in the second type of set, which only stores the updated real high utility itemset.

Fig. 3

Four cases of incremental algorithm updating data based on the concept of utility FUP.

Hong [39] and others also proposed to mine high average utility in incremental database based on four situations in the concept of FUP. The proposed incremental average-utility mining algorithm is divided into two stages. In the first stage, the upper limit of average utility is used to overestimate the utility of itemset. In the second stage, the actual average utility of overestimated average utility itemset is calculated. The proposed incremental high average-utility mining algorithm mines in a level-wise mode in the initial stage, and uses a set to store the set of potential candidates of different lengths generated during the enumeration process. And when new data is inserted, it will judge the corresponding situation with the itemset stored in the new dataset, so as to update the candidate itemset in the set. Because it relies on the concept of FUP, it suffers from the combinatorial explosion of search space.

Jie [40] and others designed a new DS_CWFP structure to dynamically maintain the transaction information in the current sliding window. The structure consists of three parts: CWFP-Tree, item header table (IHT) and CW-HT (closed weighted hash table). CW-HT hash table is a two-level hash mapping structure used to check closed weighted frequent patterns. An algorithm is proposed that uses a divide-and-conquer strategy to mine closed weighted frequent patterns in the tree from bottom to top using a depth-first recursive method. If the weight support of a prefix pattern P is greater than the minimum threshold, the subset detection of P is performed through the closed pattern information stored in the CW-HT hash table. If the pattern P is a subset of the closed patterns that have been found and the support is equal, then start searching for the next item in the header. Otherwise, update the information of pattern P to the closed weighted hash table. Finally, the algorithm uses CW-HT two-level hash table and CWFP-Tree to find all closed weighted frequent patterns. Experiments show that DCWFP-Miner is better than Close+ in time and memory consumption.

Among the other methods described in this section, although the utility-based FUP strategy can reduce the number of database scans, the algorithm still needs to perform multiple time-consuming database scans to find all the HUIs in the incrementally updated database. At the same time, the hash set directly stores the candidate itemsets generated level by level, which will keep a large number of candidates in the memory, so it consumes a lot of memory, especially for dense datasets. Through the comparative analysis of each structure method in this section, the tree-based method is more efficient than the set-based method. HUPID-Growth is one of the fastest algorithms. HUPID-Growth [23] only need one scan to build their global tree. However, because the above tree-based algorithms need to traverse all the tree nodes during the mining process and must recursively create a subtree structure, it is very time-consuming to apply the above tree pattern growth method. In addition, most of these algorithms are based on the TWU model, which is a loose upper limit is used for utility of the itemset, so too many candidates may be generated, especially IHUP, which is prone to memory overflow when the threshold is set low. In these tree-based pattern mining methods, they will inevitably occur candidate patterns. Limited to the nature of the tree structure, when the number of dataset items is large or the number of high TWU items is small, the efficiency of the algorithm is not satisfactory. Algorithms that uses the traditional utility list structure can quickly complete the update when a new transaction arrives. During the algorithm operation, no candidates are generated, and the database is not scanned repeatedly. Through the “build once and then update multiple times” property, this kind of structure is significantly better than related algorithms based on storage structures such as sets and trees. One-phase list-based algorithms such as IIHUM [28] by adding indexes between the utility-list tuples, making it more efficient in incremental dataset scanning and condition list construction. The array structure proposed in IHUP algorithm eliminates the complicated join operations of the traditional vertical list structure, and can greatly reduce time and memory consumption. However, as the incremental data continues to arrive, the structure requires multiple scan transactions to construct the sub-projection array, resulting in the overhead will be very large, so projection arrays are not suitable for incremental mining scenarios with large amounts of data, but for small and medium datasets or sparse datasets, it still has a great advantage over traditional list structures. The algorithm Id2HUP+ [68] with a hybrid structure is lower to the algorithm using the traditional list structure in terms of memory and time consumption. It is not difficult to find that the first proposed storage scheme of transactional list and hash table using index linked list proposed by this algorithm is one of the latest storage methods, the problem that the traditional list requires complex join operations and cannot be merged is improved. It has a great inspiration for solving the shortcomings of traditional projection arrays that rely on repeated scanning of the original database to establish sub-projections. In general, the various data structures can be expressed more efficiently in different ways such as vertical and projection, so that the mining efficiency is better than the traditional structure proposed in the past.

5 Derivative incremental high-utility pattern mining algorithms

Although most incremental pattern mining algorithms can re-mine the changed database, and calculate the high utility pattern based on this. However, with the continuous increase of transactions, a large number of high-utility itemsets are usually mined out. While they contain many redundant itemsets, they also have a large number of itemsets with specific meanings. How to search for specific itemsets that meet user needs from a large number of result set is also a difficult task. Therefore, some derivative incremental high-utility pattern mining algorithms have been proposed. The expression of the incremental high-utility pattern mining results usually includes the incremental compressed high-utility pattern, the incremental high average-utility pattern, and the incremental high utility sequential pattern, etc. Table 6 summarizes the typical algorithms of each derivative pattern. The model relationship is shown in Table 5, and the related concepts are as follows:

Table 5
Characteristics of derivative incremental high-utility patterns

Pattern type Store content Compression Advantage Disadvantage

Maximal high-utility pattern All the patterns whose utility is greater than the minimum utility threshold and has no superset Lossy compression Intuitively find patterns with larger utility. May be missing useful sub-patterns.

Closed high-utility pattern There is no pattern whose superset is equal to its own support and whose utility is greater than the minimum utility threshold. Lossless compression There is no redundant itemset, and no useful subsets are missed. It consumes more memory than mining maximal high utility patterns.

Top-K high-utility pattern K patterns with the highest utility. Lossy compression Can mine a specified number of high utility patterns. There may be redundant patterns.

High average-utility pattern Patterns whose average utility is greater than the minimum utility threshold. No compression Discovered meaningful patterns with less length and number of items included. If the threshold is set single, it is easy to cause the problem of rare term solution.

High utility sequential pattern Those sequence patterns whose utility is not less than the minimum utility threshold specified by the user. No compression Discover valuable sequence patterns. Compared with traditional utility pattern mining, it takes longer time.

Temporal high-utility pattern All patterns whose utility is not less than the minimum utility threshold specified by the user in a certain period. No compression The high utility patterns in a specific period can be mined out in chronological order. Most temporal high utility pattern mining ignores the problem of whether items still exist in a certain period

Pattern type	Store content	Compression	Advantage	Disadvantage
Maximal high-utility pattern	All the patterns whose utility is greater than the minimum utility threshold and has no superset	Lossy compression	Intuitively find patterns with larger utility.	May be missing useful sub-patterns.
Closed high-utility pattern	There is no pattern whose superset is equal to its own support and whose utility is greater than the minimum utility threshold.	Lossless compression	There is no redundant itemset, and no useful subsets are missed.	It consumes more memory than mining maximal high utility patterns.
Top-K high-utility pattern	K patterns with the highest utility.	Lossy compression	Can mine a specified number of high utility patterns.	There may be redundant patterns.
High average-utility pattern	Patterns whose average utility is greater than the minimum utility threshold.	No compression	Discovered meaningful patterns with less length and number of items included.	If the threshold is set single, it is easy to cause the problem of rare term solution.
High utility sequential pattern	Those sequence patterns whose utility is not less than the minimum utility threshold specified by the user.	No compression	Discover valuable sequence patterns.	Compared with traditional utility pattern mining, it takes longer time.
Temporal high-utility pattern	All patterns whose utility is not less than the minimum utility threshold specified by the user in a certain period.	No compression	The high utility patterns in a specific period can be mined out in chronological order.	Most temporal high utility pattern mining ignores the problem of whether items still exist in a certain period

Table 6

Characteristics of typical derivative incremental high-utility pattern mining algorithms

Algorithm name	Pattern type	Phase	Data structure	Description	Comparison algorithms
TOPK-HUP-INS [41]	The Top-K pattern	One-phase	List	For the new transaction, the “1-pattern” utility-list is constructed, and the order of the items in the new transaction is based on the global ITable, so the original transaction does not need to be reconstructed after the new data is added. In the process of constructing utility-list and mining Top-K high-utility patterns, four different strategies are used to improve the performance of the algorithm.	Comparison algorithm: EIHI, HUI-list-INS Advantages: Thanks to a variety of threshold raising strategies, topkUti can iterate quickly to reach the minimum target threshold, which has a shorter running time than HUI-list-INS.
IncCHUI [67]	The Closed pattern	One-phase	List	It use the incremental utility-list structure to efficiently mine CHUI from the incremental database. IncCHUI only scans the original database or the updated part once to construct a list of individual items, and uses a hash table called CHT to store the CHUI discovered so far.	Comparison algorithm: CHUI-Miner, CLS-Miner, EFIM-Closed Advantages: IncCHUI has excellent scalability, running time is better than comparison algorithms, and the maximum memory of the entire mining process is very small.
IMHAUI [52]	The average-utility pattern	Two-phase	Tree	IHAUI-Tree compactly maintains the information of the incremental database through the node sharing effect. The actual average utility of the candidate itemset needs to be calculated through the candidate verification process, which requires additional database scanning.	Comparison algorithm: ITPAU, HUPID Advantages: Whenever a newly generated transaction is reflected in IHAUI-Tree, a reorganization technology will be applied to IHAUI-Tree to maximize the node sharing effect, so that the algorithm saves memory space. Because its performance is better than the Apriori-like method ITPAU, etc. Effective, so it is widely used in other tree-based pattern mining algorithms.
FIMHAUI [53]	The average-utility pattern	Two-phase	Tree	It use the pattern growth method to find the set of HAUUBI from its IHAUI tree, and verify these candidate itemsets during the verification phase.	Comparison algorithm: IMHAUI Advantages: On dense datasets, FIMHAUI is completely better than IMHAUI.
IncUSP-Miner⁺ [56]	The sequential pattern	Two-phase	Tree	It use the candidate pattern tree to gradually mine HUSP. This tree is an extension of the lexicographic tree and stores high TSU	Comparison algorithm: Uspan, USP-Miner Advantages: The memory and time consumption are lower than the comparison algorithm.
				sequences to improve the efficiency of incremental HUSP mining.
IUM, FIUM [10]	The temporal pattern	Two-phase	Hash set	When new data is added to the original database, it can effectively identify all high-utility patterns that users are interested in during a specific period. The algorithm can not only find the temporal high-utility pattern in a specific time period, but also the high-utility pattern of the entire database.	Comparison algorithm: FIUM, IUM Advantages: Since the FIUM algorithm generates fewer candidate itemsets than the IUM algorithm, the FIUM algorithm takes less time than the IUM-algorithm.
FUPT-HOUIN [62]	The temporal pattern	Two-phase	Tree	The algorithm is divided into two sub-tasks: the first is to construct the utility tree by scanning the original database to find the high on-shelf utility pattern with negative values, and the other is to update the utility tree and find the high on-shelf utility pattern with negative values from the dynamically updated database.	Comparison algorithm: HOUN Advantages: The utility tree does not need to scan the database every time when it is updated. Compared with the existing algorithm HOUN, its execution requires less time.

Definition 8: high utility sequential patterns. If the utility of a sequence in the data is not less than the minUtil specified by the user, the sequence α is called the high utility sequence pattern.

Definition 9: high average-utility patterns. If the average utility of an itemset (itemset utility/itemset length) is not less than the minUtil, then the itemset is a high average-utility pattern.

Definition 10: temporal high utility patterns. If an itemset X is in a certain time period and the utility in the period is not less than the minUtil specified by the user, then the itemset is a temporal high utility pattern.

5.1 A. Incremental compressed high-utility patterns

Incremental high-utility pattern mining can be divided into full set of high-utility pattern and compressed high-utility pattern mining. Compressed high-utility pattern mainly includes maximum high-utility pattern, closed high-utility pattern, and Top-K high-utility pattern.

Among them, the closed high utility pattern belongs to the lossless compression pattern, which can retain all useful pattern information. While the maximum high-utility pattern removes some useful subsets in the closed high utility pattern, there is no superset in this pattern. Top-K high-utility pattern is to keep the Top-K highest utility pattern information from the closed high-utility or full set of high utility pattern mining process, which also belongs to the lossy compression pattern. This section introduces the definition of incremental compression high-utility pattern and its algorithm. The algorithm is shown in Fig. 4.

Fig. 4

Classification diagram of incremental compressed high-utility pattern mining algorithms.

In the process of mining compression high-utility patterns, two types of constraints are mainly used for mining, namely compactness and high-utility. The compactness constraint depends on the support value of the item or itemset, and the constraint method is the same as the frequent compression pattern. The researcher gave the relevant definition of the compression high-utility pattern, as follows:

Definition 11: closed high utility patterns. Given the minUtil, when U(P) of K-itemset P>minUtil, and no superset Q of P satisfies support (P) = support (Q).

Definition 12: Maximum high-utility patterns. Given itemsets P and Q satisfy U (P) ⩾ minUtil, U (Q) ⩾ minUtil. If there is no Q that satisfies P ⊂ Q, then P is the maximum high-utility pattern.

Definition 13: Top-K high utility patterns. Sort all the mined itemsets in the order of utility from high to low, and let minUtil be the utility of the K-th itemset. Then for pattern P, if the condition U (P) ⩾ minUtil is satisfied, then pattern P is called the Top-K high utility pattern.

Although existing high-utility pattern mining algorithms can find all itemsets that meet a given minimum utility threshold, it is often difficult for users to set an appropriate minimum utility threshold. A small minimum utility threshold may produce a large number of itemsets, while a higher minimum utility threshold may produce several itemsets. It is difficult and time-consuming to specify the suitable minimum utility threshold. In order to solve these problems, the Top-K high utility itemset mining is defined, where K is the number of high-utility itemsets to be found. Moreover, a large number of researches on high utility mining are only for static databases, but in practical applications, new transactions are often added.

In order to efficiently process the dynamically increased transaction data and mine the number of high-utility patterns required by users, Wu et al. [41] proposed the algorithm TOPK-HUP-INS for mining incremental Top-K high-utility patterns in dynamic databases. This algorithm processes the original database before the arrival of new transactions, that is, creates a “1-pattern” utility-list. It is a process of recursively building the utility-list to find the pattern with the highest utility in the utility-list. For newly added data, the minimum utility threshold obtained before the arrival of the new transaction is retained, and is continuously updated during the mining process after the arrival of the new transaction. The order of items in the new transaction is still based on ITable. If there are no items in the new transaction in ITable, the newly added items are sorted in descending order of TWU and added to the original items in ITable. Reconstruct the items in the new transaction in the above order, that is, re-establish the “1-pattern” utility-list, merge it with the original “1-pattern” utility-list, and finally start the mining process. The algorithm adopts four different strategies to delete more low-utility patterns through threshold raising methods such as utility descending order. At the same time, it uses EUCS to reduce utility-list construction to reduce mining time and memory consumption. Lu et al. proposed the algorithm TOPK-SW [42]. Record the batch data of the current window and the transaction information of the item in the HUI-Tree, and store the path information of each item node in the leaf node, and use the method of pattern growth to mine this part of the information. Effectively calculate the utility without rescanning the dataset. Zihaya [43] and others proposed the algorithm T-HUDS, which designed a utility model called SFU to estimate the prefix utility, which is very effective for pruning the search space. At the same time, it uses HUDS to store information about transactions over the data stream, uses the pattern growth method to obtain the prefix utility of items, and proposes several threshold enhancement strategies to improve the update efficiency of the Top-K cache area and accelerate the Top-K pattern of mining. Based on the variable sliding window model, Dawar [44] proposed an effective algorithm for mining Top-K high utility itemsets from data streams. The algorithm designs a data structure called iList, which makes the algorithm very efficient in batch insertion and deletion. The utility-list information that exists in the previous window batch will be retained in order to reduce the calculation cost and time. The list structure can also ensure that no candidate itemsets are generated during the mining process. In addition, the threshold raising strategy and the residual utility pruning strategy also reduce the search space of the algorithm.

In the current research on incremental frequent itemset mining, mining the maximum frequent itemsets is more efficient than mining basic frequent itemsets and frequent closed itemsets, because the number of maximum frequent itemsets is smaller and the compression strength is higher than that of closed frequent patterns. However, because the maximum high-utility pattern is lossy compression, only the longest itemset can be recorded, and its useful high-utility subset is not saved, so the high-utility pattern cannot be recorded completely.

The algorithm WMFP-SW proposed by Lee et al. [45] expresses the importance of items by adding weight constraints, and designs the WMFP-SW-tree structure for pattern expansion, extraction of WMFP information, and subset checking operations. The WMFP-SW-array strategy is proposed to reduce the traversal time of the conditional algorithm and improve the mining efficiency. Wang Shaopeng [46] and others proposed the concept of full weighted maximum frequent itemsets FWMF and the algorithm FWMFI-SW. The Naive algorithm adds compression filter conditions to the algorithm WMFP-SW, and FWMFI-SW reduces the redundant operations of Naive. This algorithm introduces the reconstruction discriminant function RJF of WMFP-SW-Tree, and adds a conditional optimization strategy based on frequent constraints to meet the definition of the maximum high utility pattern. These two algorithms have less time overhead than algorithm WMFP-SW. Yun [37] and others proposed the algorithm MWS, which uses the tree structure MWS-tree to manage and maintain the information of continuously added transactions and related weight conditions. The MWS algorithm uses the global MWS-tree header table to traverse the tree in a bottom-up manner to perform mining operations, using the maximum weight (MW) as one of the pruning conditions. For each item selected in the header table, the algorithm passes the divide-and-conquer method finds WMFP, and checks whether WMFP is effective. This algorithm consumes less memory and time than the similar high utility pattern mining algorithm IWFP, and the effect is obvious. In addition, the GUIDE [46] framework also uses an effective compact tree structure, which can effectively mine the maximum high utility pattern.

If there is no superset in the same transaction database and it has the same support, its called a Closed High Utility Itemset (CHUI). The process of mining patterns containing such itemsets is called closed high-utility pattern mining. This compression method greatly reduces the number of itemsets without losing the original high-utility pattern, and effectively improves the analysis efficiency of the high-utility pattern. Especially when the system resources are limited, it is usually impractical to generate the entire set of high utility patterns. Moreover, it is difficult for users to utilize a large amount of high-utility information. Therefore, the closed high utility pattern has high research value.

Dam [67] and others proposed an incremental utility-list structure, which can be built and updated only through one database scan. In addition, it uses a novel global list update algorithm IncCHUI, and combines with the temporary array RuA to dynamically update the residual utility of the original transaction after the new thing is inserted, so as to effectively ensure the correctness of the pruning strategy after transaction insertion, and quickly eliminate the low utility candidates in the mining process. A method for updating or inserting a new closed itemset found based on the hash table CHT is also proposed. The method first checks whether the itemset P is already in the CHT. If it is, it means that the itemset P has been previously found when the algorithm is applied to the original database. In this case, we will update its utility and support. Otherwise, because it is a new CHUI, P is inserted into the table CHT.

Aiming at closed high-utility pattern mining over data streams, Ge et al. [47] proposed the algorithm DS_CRWF for mining closed weighted patterns over the data stream, and proposed the CRWF_tree structure, which dynamically records the candidate closed high-utility patterns in the sliding window as the window slides. Using the method of first mining high-utility patterns and then using frequent closure constraints, the two constraints of closure and weighting are satisfied, and all closed high-utility patterns can be obtained through one scan. Zeng [48] and others proposed the algorithm WCSPMPD-Stream, which designed time weight decay and no-update decay methods to control the decay rate, and increased frequent constraints to satisfy closed high-utility pattern mining. A new data structure WFPS-Tree is proposed to store and update the mining results. Experiments show that the algorithm WCSPMPD-Stream overcomes the memory limitation, and its execution time is less than the previous algorithm under the same conditions.

5.2 Incremental high average-utility patterns

High Average Utility Pattern Mining (HAUPM) is used to solve the problem that high utility pattern mining (HUPM) mining results usually include long patterns composed of many low utility items. These patterns are redundant information in some business areas. The average utility measure reveals better utility effect of combining multiple items than the original utility measurement.

Hong [49] and others proposed a two-phase average utility mining algorithm ITPAU, which can gradually maintain a high average utility itemset as the database grows. Based on the concept of the algorithm FUP, the algorithm combines the information previously mined from the original database with the new mining results from the newly inserted transaction, reducing the required database scanning to speed up the mining process. The experimental results also show the effectiveness of the algorithm. Even so, because it follows Apriori method and uses a simple hashset structure to directly store a large number of candidate itemsets, it still has limitations.

Later, Yun et al. proposed an algorithm called “Incremental Mining of High Average Utility Itemsets (IMHAUI)” [52] to discover high average-utility itemsets (HAUI) from incremental databases in a more efficient way. IMHAUI uses its own compact tree data structure is called IHAUI-Tree. After inserting each incremental part of the database into the tree, it uses a reorganization phase on IHAUI-Tree to make it as compact as possible. IMHAUI uses a pattern growth method from its IHAUI-Tree finds the collection of a high average-utility upper-bound itemset(HAUUBI). It extracts a conditional pattern base for each prefix itemset, and recursively builds its local tree based on this conditional pattern base. However, this recursive working principle is negatively reflected in runtime efficiency.

In order to speed up the time required to discover high average-utility upper-bound itemsets (HAUUBI), YILDIRIM et al. made some modifications to IHAUI-Tree, and then proposed the structure of mIHAUI-Tree [3]. Based on this structure, a method called FIMHAUI was proposed using database projection and merging technology. In the original IHAUI-Tree, each node N has an average-utility upper-bound (AUUB) value (N.AUUB), which is the sum of the maximum utility of the overlapping transactions on the path from the root node to N. However, in the proposed algorithm, the tree is used to obtain the projection database of promising itemsets, not to generate the conditional pattern library and local tree, so only the leaf nodes of the path need to maintain the corresponding AUUB value in the tree. The database used by the projection has a great impact on reducing the cost of database scanning, because when the set of items to be explored becomes larger, the transaction becomes smaller. Experimental results show that this algorithm is slightly better than the earlier proposed IMHAUI [52] on dense datasets such as Chess and Mushroom dataset. Because this algorithm still needs to construct and reconstruct the tree structure in the incremental data environment, and in the process of traversing the tree that candidate itemsets is inevitably generated, so it still has the significant disadvantage of the tree storage structure and is not suitable for frequently updated large or dense datasets.

Lin et al. [50] proposed a pattern mining algorithm called IHAUPM to deal with incremental database transaction insertion. The well-known fast-update (FUP) concept in FIM has been modified to be adopted into the designed algorithm, so as to effectively update the discovered high average-utility itemset. The algorithm also maintains the correctness and completeness of the updated HAUI to ensure that all necessary information is fully discovered. It uses an effective HAUP-tree structure to maintain the necessary information in order to continue mining in the future. Based on the above structure, proposed algorithm is developed to divide the original database and new transactions into four situations for maintenance. The itemsets in each case can be maintained and processed correctly. Compared with some algorithms running in batch mode, sometimes it does not need to rescan the original database multiple times to update the discovered high average-utility itemsets. The number of candidate itemsets generated by the designed algorithm is much less than that of PAI, HAUI-tree and HAUP-growth. The reason is that these algorithms have no effective pruning strategy to reduce the useless candidates of the entire updated dataset, and the designed IHAUPM algorithm only deals with in the incremental part, the size of the incremental dataset is much smaller than the updated dataset. Based on the concept of FUP, Zhang [51] and others recently proposed an incremental HAUIM algorithm FUP-HAUIMI for transaction insertion to maintain information about the pattern when updating the database. First, construct the average utility-list (AUL) structure by scanning the original database. Then, FUP-HAUIMI selects the upper-limit itemset with high average utility and classifies it according to different situations. In each case, a specific update process is used to maintain and update the itemset. When traversing the enumeration tree representing the search space in a depth-first manner, the algorithm performs a join operation to incrementally update the AUL structure.

5.3 Incremental high utility sequential patterns

High utility sequential pattern (HUSP) mining is more complicated than traditional high utility pattern mining. It needs to find all high utility subsequences that are not lower than the user-set utility threshold.

Wang et al. [56] proposed the algorithm IncUSP-Miner⁺ to gradually mine HUSPs. The approach consists of two phases, the initial phase and the incremental phase. In the initial phase, proposed algorithm USP-Miner incorporating TSU and PEU to mine HUSPs on the origin database D and keep the HUSPs in the list HUSP. To facilitate better performance in the incremental phase, we propose the concept of the candidate pattern tree and in addition to HUSPs, USP-Miner will construct a candidate pattern tree T as well.

Wu [55] and others defined the problem of mining high utility sequential patterns (HUSPs) over high-speed data streams. A compact data structure named HUSP-Tree is proposed to maintain the basic information of the mined HUSP using online methods, introduce the efficient single-channel algorithm of HUSP-Stream, and use a new utility estimation model to more effectively trim the search space. Then generate HUSP from the HUSP-Tree structure. Compared with the similar algorithm Uspan [25], the running time of HUSP-Stream is significantly reduced, and it is scalable.

In addition, Zihayat [57] proposed that MAHUSP algorithm is based on two memory adaptive mechanisms, which can allocate memory adaptively at the premise of sacrificing HUSP quality. Tang et al. [54] proposed a HUSP mining algorithm based on UT-tree called HUSP-UT. The algorithm can obtain the actual utility value of the sequence from the UT-tree instead of the estimated value. It does not generate candidates during the mining process, effectively reducing storage consumption. When the algorithm creates a subtree based on the pointers of related nodes, the utility value of each item in these sequences can be obtained. This also ensures that the utility of the items in the header table and the new SWU value can be quickly calculated, which greatly reduces the running time of the algorithm. Experimental results show that the performance of the algorithm HUSP-UT is better than the latest algorithm HUSP-Stream over the data stream.

5.4 Incremental temporal high utility patterns

Temporal high utility pattern mining has attracted a lot of attention due to its practicality. Temporal or time series data widely exist in economics, finance, communications and other fields, such as weather forecasting. The time series transaction database is divided into several partitions according to the time period. In a particular time or season, the frequency of certain items or itemsets may be very high. Therefore, mining time-related information is interesting and useful.

Traditional association rule algorithms cannot find temporal association rules in a specific period. Temporal association rule mining does not consider the utility of each item. Therefore, temporal high-utility pattern mining is a research extended from temporal association rule mining and utility mining. Wang et al. [58] proposed the HUTPMiner algorithm to mine temporal high utility patterns with the help of the proposed matrix structure. The matrix structure is used to represent each sequence in the database, such as s, to avoid scanning s back and forth to improve mining efficiency. The purpose of temporal high utility pattern mining is to find all temporal patterns with high utility, even with low frequency. The traditional algorithm LMSpan [59] can only extract frequent temporal patterns, this feature is the main difference between the algorithm HUTPMiner and the algorithm LMSpan. These are the temporal high utility pattern mining algorithms on static data, and then researchers have proposed some temporal high utility pattern mining algorithms in an incremental environment.

Yeh et al. [10] researched and proposed novel incremental utility mining methods, IUM-Algorithm and FIUM-Algorithm, which can discover temporal high-utility patterns. When adding new transaction data to the original transaction database, the algorithm partitions the transaction data that arrives at different times, and conducts pattern mining for each partition in turn based on the Two-phase model, At the same time, according to the distribution of candidate itemsets in different partitions, the minimum utility threshold is dynamically updated. It can effectively identify all temporal high-utility patterns that users are interested in in a specific period. And use RU^k to store all candidate high Transaction-Weighted Utilization patterns of partition K, and use LU^[I,J] to store the temporal high-utility pattern of each time period in the range of time I to J, not only can be found temporal high-utility patterns in a specific time period, and the high utility pattern of the entire transaction database can be found. The process of the algorithm FIUM is similar to the algorithm IUM. FIUM also uses the transaction weighted downward closure property to trim candidates. The difference between the two algorithms lies in the way in which candidates are generated. In the step of generating p-pattern candidates, the algorithm IUM connects any two candidates of ${RU}_{P - 1}^{k}$ , while the algorithm FIUM is to connect ${RU}_{1}^{k}$ and ${RU}_{P - 1}^{k}$ . Assuming that the number of elements in ${RU}_{P - 1}^{k}$ is n, the time complexity of generating candidate patterns for FIUM is reduced by n² times compared with IUM, which has a significant efficiency improvement.

Nevertheless, most temporal high-utility pattern mining ignores the on-shelf time periods of products in stores. In order to solve this problem, Lan and Tseng [61] proposed a new research topic, called “high on-shelf utility pattern ming", the model takes into account the on-shelf periods of items with the quantity and profit. Subsequently, Radkar et al. [62] proposed an effective algorithm to quickly generate a fast-update utility model tree (FUPT-HOUIN) of on-shelf items with negative utility, high on-shelf utility patterns with positive and negative values in the current period are searched from the dynamically updated temporal incremental database. The proposed algorithm will scan the original database to construct a utility tree. Then, Transaction-Weighted Utilization of all items is calculated. It calculates the Transaction-Weighted Utilization difference before and after transition modification for 1-itemsets. According to the difference value, it judges how an itemset should be handled according to the four cases predefined by FUP, and decides whether to insert, remove or rescan the previous dataset. Therefore, the database will not be rescanned for every modification, and the existence of the utility tree avoids unnecessary generation of candidate itemsets, thus shortening the execution time. The experimental results show that compared with the existing similar algorithm, the algorithm requires less execution time.

5.5 Other derivative high-utility patterns

There are many other types of incremental high utility patterns, such as interactive high utility erasable patterns, hybrid high-utility patterns, etc.

Definition 14: high utility erasable patterns. If gain value of the itemset(utility of the itemset in the window) is less than the maximum gain threshold MGT of the data stream (minUtil * sum of transaction utility in the window), then this pattern is called a erasable pattern.

In recent years, researchers have also proposed a variant of frequent pattern mining: erasable pattern mining. The current more advanced erasable frequent pattern mining algorithm is MERIT [47]. Aiming at the high utility erasable pattern, Yun et al. [37] proposed the algorithm WEPS, which is suitable for mining weighted erasable pattern in data stream environments. The MWS tree structure is proposed based on the sliding window. First, the global MWS tree is constructed using the transactions added so far, and then the tree is reconstructed according to the item support in descending order. The WMFP tree is constructed by using the conditional MWS tree, and the WMFP-array and pattern growth technology are used to mine WMFP, when a new transaction arrives, iteratively build and mine the MWS tree. At the same time, a MW pruning strategy is designed to ensure the efficiency of the proposed algorithm. The scalability, running time and memory usage of this algorithm are better than those of MERIT and dMERIT+.

In addition to the high-utility erasable pattern, there is also a comprehensive high-utility pattern that combines the characteristics of multiple patterns. Zeng [63] defined a method for calculating time weights. The weights vary linearly with the length of the time interval between each stream segment and the current stream segment. Based on this definition, a algorithm WCSPMPD-Stream involving closed pattern and sequential pattern was proposed. The algorithm calculates the utility of the sequence pattern through the weights of the items in the sequence, and introduces closed pattern constraints to meet the requirements of the closed high-utility sequence pattern. The sliding window is divided into several segments and mined separately to obtain the closed high-utility sequential pattern.

This section reviews several derivative incremental high-utility pattern mining algorithms in recent years. It can be seen that compared with the full set of high-utility pattern mining algorithm introduced in Section 4, it has a stronger pertinence, and to a certain extent, it makes up for the shortcomings of full set of high-utility pattern mining in application scenarios. Although some incremental compressed high utility and high-average utility pattern mining algorithms [41 , 67] have achieved good results using new list structures, however, many of the more advanced storage structures in the full set of incremental pattern mining have not yet been widely used and expanded in the derivative pattern. Most of the current derivative incremental high utility pattern mining algorithms use older storage structures such as hash sets and trees, this status quo especially exists in the field of incremental temporal, sequential and other derivative pattern mining. Through our previous analysis, it is found that the hash set and tree structure have weak advantages for frequently updated and large incremental data. Moreover, the current derivative algorithms are mostly based on Two-phase model, which cannot adapt well to mining scenarios where data such as sequence and time series that often change periodically and require real-time performance. At the same time, this section hopes to give researchers some reference and inspiration by introducing several derivative algorithms based on newer data structures and strategies.

6 Further research direction

In recent years, domestic and foreign researchers have conducted a lot of research on incremental high-utility pattern mining, but the existing methods still have many shortcomings. Based on the above research content, further research ideas have been introduced.

(1) Improve the type of incremental high-utility pattern

There are still shortcomings in the derivative high-utility pattern. For example, in Top-K high-utility pattern mining, although some patterns with higher utility can be quickly found, However, as the dataset is constantly updated, it is inevitable that there will be no redundant subset patterns. We can consider combining Top-K high-utility pattern with closed high-utility pattern or maximum high-utility pattern mining method, and introduce constraints such as superset closure detection to improve the quality of Top-K high utility pattern.

(2) Research on more efficient incremental high-utility pattern mining methods

Most of the latest incremental mining algorithms are based on traditional lists for pattern mining. A large number of join operations are required in the list construction process, which consumes time and system resources. How to apply projection-based pattern mining methods and efficient storage structures such as index utility-lists, projection arrays in the field of incremental data mining is an urgent problem.

(3) Apply incremental high-utility pattern mining research to sequential patterns or data streams

Mining association rules over the data stream requires scanning the data stream in the shortest time to obtain the next required transaction information. Sequential pattern is a data combination with timestamp. The data stream is similar to it, and many incremental pattern mining algorithms can be applied to both. At the same time, compared with other traditional classification and clustering methods over data streams, the association rules generated by the incremental high utility pattern can effectively reduce the influence and interference of noise on the data, and improve the accuracy of the results.

(4) Research on incremental association rule mining in distributed environment or big data scenario

Literature [64] first proposes a Spark-based parallel utility pattern mining algorithm, which improves the single-phase utility pattern mining algorithm HUIMiner and its vertical data structure, and designs a three-phase parallel mining model. This efficient algorithm is helpful for distributed and parallel mining on machine clusters. Literature [65] proposes a novel framework for parallel mining of Top-K high utility itemsets in big data. Besides, a new algorithm called Parallel Top-K High Utility Itemset Mining (PKU) is proposed for parallel mining of Top-K high utility patterns on Spark in-memory platform. Nevertheless, the research on algorithms in the big data environment is still in its infancy, and further development is urgently needed. In recent years, cloud computing technology has matured, and some association rule mining algorithms based on cloud computing have been proposed, such as literature [66]. It is also one of the hotspots to realize large-scale computing incremental association rule mining in the cloud computing environment.

7 Summary

Traditional high-utility pattern mining algorithms are designed to discover high-utility patterns in static databases. Because these algorithms usually face considerable performance problems when dealing with dynamic database, incremental data mining has become an important research topic in recent decades. However, as far as we know, in recent years, various incremental high-utility pattern mining algorithms including derivative algorithms have emerged one after another. However, there are few comprehensive reviews of incremental high-utility pattern mining and its derivative algorithms, and the classification method of these algorithms on the data storage structure has not been proposed. In this article, we discuss and classify the main high-utility pattern mining methods. The problems of high utility itemset mining and incremental high utility itemset mining are proposed. Afterwards, the incremental high utility pattern mining algorithms are summarized, including iHUIM’s tree-based method, array-based method, hashset-based method, and list-based method. The method summarized in this paper is not only suitable for incremental high-utility pattern mining tasks, but also can stimulate other related data mining tasks, such as high-utility pattern mining over data streams. Incremental high-utility pattern mining and other related problems have been studied for more than 20 years, and they are still very active research fields.

Footnotes

Acknowledgments

This work was supported by the National Nature Science Foundation of China (62062004), the Ningxia Natural Science Foundation Project (2020AAC03216), and the Graduate Innovation Project of North Minzu University (YCX20077).

References

Hong

, Hamilton

H.J.

and Butz

C.J.

, A Foundational Approach to Mining Itemset Utilities from Databases, SDM (2004).

Liu

, Liao

W.K.

and Alok

N.C.

, A two-phase algorithm for fast discovery of high utility itemsets, Advances in Knowledge Discovery and Data Mining, the 9thPacific-Asia Conference PAKDD (2005), 18–20.

Yun

, Kim

, Ryang

, et al., Mining recent high average utility patterns based on sliding window from stream data, Journal of Intelligent & Fuzzy Systems 30(6) (2016), 3605–3617.

Yun

, Nam

and Kim

, Efficient transaction deleting approach of pre-large based high utility pattern mining in dynamic databases, Future Generation Computer Systems 103 (2020), 58–78.

Jeya Kumar

, Dhanabalachandran

and Jeya Kumar

, Effective and Efficient Utility Mining Technique for Incremental Dataset, The International Arab Journal of Information Technology 15(1) (2018), 157–166.

M.T.

, Teng

, Lin

C.W.

, et al., Updating high average-utility itemsets with pre-large concept, Journal of Intelligent & Fuzzy Systems 38(5) (2020), 5831–5840.

Tseng

V.S.

, Shie

B.E.

, Wu

C.W.

, et al., High utility itemsets Algorithms for Mining High Utility Itemsets from Transactional Databases, IEEE Transactions on Knowledge and Data Engineering 25(8) (2013), 1772–1786.

Fournier-Viger

, Lin

C.W.

, Gueniche

, et al., Efficient incremental high utility itemset mining, the ASE Big Data & Social Informatics (2015).

Lin

C.W.

, Lan

G.C.

and Hong

T.P.

, An incremental mining algorithm for high utility itemsets, Expert Systems with Applications 39(8) (2012), 7173–7180.

10.

Yeh

J.S.

, Chang

C.Y.

and Wang

Y.T.

, High utility itemsets algorithms for incremental utilitymining, the ACM International Conference on Ubiquitous Information Management and Cmmunication (2008), 212–217.

11.

Lee

, Yun

, Lee

, et al., Efficient incremental high utility pattern mining based on pre-large concept, Engineering Applications of Artificial Intelligence 72 (2018), 111–123.

12.

Lin

C.W.

, Hong

T.P.

, Lan

G.C.

, et al., Incrementally mining high utility patterns based on pre-large concept, Applied Intelligence 40(2) (2014), 343–357.

13.

Krishnamoorthy

, Pruning strategies for mining high utility itemsets, Expert Systems with Applications 42(5) (2015), 2371–2381.

14.

Yun

and Ryang

, Incremental high utility pattern mining with static and dynamic databases, Applied Intelligence 42(2) (2014), 323–352.

15.

Lin

C.W.

, Gan

W.S.

, Hong

T.P.

, et al., An Incremental High-Utility Mining Algorithm with Transaction Insertion, The Scientific World Journal (2015), 1–15.

16.

Fournier-Viger

, Lin

C.W.

, Gueniche

, et al., High utility itemsets incremental high utility itemset mining, the ASE Big Data & Social Informatics 53 (2015).

17.

Yun

, Ryang

, Lee

, et al., An high utility itemsets algorithm for mining high utility patterns from incremental databases with one database scan, Knowledge-Based Systems 124(15) (2017), 188–206.

18.

Gan

, Lin

C.W.

, Fournier-Viger

, et al., A survey of incremental high-utility itemset mining, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(2) (2018).

19.

Tseng

V.S.

, Wu

C.W.

and Shie

B.E.

, UP-Growth: An High utility itemsets Algorithm for High Utility Itemset Mining, the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010), 25–28.

20.

Tseng

V.S.

, Shie

B.E.

, Wu

C.W.

, et al., High utility itemsets Algorithms for Mining High Utility Itemsets from Transactional Databases, IEEE Transactions on Knowledge and Data Engineering 25(8) (2013), 1772–1786.

21.

, Research on Dynamic Mining Algorithm of high utility itemsets.

22.

Zheng

H.T.

and Li

, iCHUM: An high utility itemsets algorithm for high utility mining in incremental databases, the International Conference on Knowledge Science, Engineering and Management (2015), 212–223.

23.

Yun

and Ryang

, Incremental high utility pattern mining with static and dynamic databases, Applied Intelligence 42(2) (2015), 323–352.

24.

Kim

, Yun

, Baek

, et al., Damped sliding based utility oriented pattern mining over stream data, Knowledge-Based Systems 213 (2021).

25.

Liu

and Qu

J.F.

, Mining high utility itemsets without candidate generation, the 21st ACM International Conference on Information and Knowledge Management (2012), 55–64.

26.

Fournier-Viger

, Wu

C.W.

, Zida

, et al., FHM: Faster High-Utility Itemset Mining Using Estimated Utility Co-occurrence Pruning, 21st International Symposium on Methodologies for Intelligent Systems (ISMIS 2014) (2014).

27.

Krishnamoorthy

, Pruning strategies for mining high utility itemsets, Expert Systems with Applications 42(5) (2015), 2371–2381.

28.

Yun

, Nam

, Lee

, et al., High utility itemsets approach for incremental high utility pattern mining with indexed list structure, Future Generation Computer Systems 95 (2019), 221–239.

29.

Jaysawal

B.P.

and Huang

J.-W.

, SOHUPDS: A Single-pass One-phase Algorithm for Mining High Utility Patterns over a Data Stream, In: Proceeding OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC’20) 2020, 490–497.

30.

Prasad

J.B.

and Huang

J.W.

, DMHUPS: Discovering Multiple High Utility Patterns Simultaneously, Knowledge and Information Systems (2018).

31.

Zida

, Fournier-Viger

, Lin

C.W.

, et al., EFIM: a highly high utility itemsets algorithm for high utility itemset mining, Advances in Artificial Intelligence and Soft Computing (2015), 530–546.

32.

Wang

J.H.

, Luo

X.Z.

and Wu

, An Algorithm of High Utility Itemsets Based on Projection, Journal of Chinese Mini-Micro Computer Systems 37(6) (2016), 1212–1216.

33.

Singh

, Kumar

, Singh

S.S.

, et al., EHNL: An High utility itemsets Algorithm for Mining High Utility Itemsets with Negative Utility and Length Constraints, Information Sciences (2019).

34.

, Huynh

, Dinh

D.T.

, et al., Apure array structure and parallel strategy for high-utility sequential pattern mining, Expert Systems with Applications (2018).

35.

Luo

X.Z.

, Research and application of high utility pattern mining algorithm.

36.

Lee

, Yun

and Ryu

K.H.

, Sliding window based weighted maximal frequent pattern mining over data streams, Expert Systems with Applications 41(2) (2014), 694–708.

37.

Yun

and Lee

, Sliding window based weighted erasable stream pattern mining for stream data applications, Future Generation Computer Systems 59 (2016), 1–20.

38.

Lin

C.W.

, Lan

G.C.

and Hong

T.P.

, An incremental mining algorithm for high utility itemsets, Expert Systems with Applications 39(8) (2012), 7173–7180.

39.

Hong

T.P.

, Lee

C.H.

and Wang

S.L.

, An Incremental Mining Algorithm for High Average-Utility Itemsets, International Symposium on Pervasive Systems (2010).

40.

Wang

and Zeng

, DCWFP-Miner: Mining Closed Weighted Frequent Patterns over Data Streams, the 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (2012).

41.

, Wang

L.P.

, Luo

X.Z.

, et al., Incremental Top-k high utility pattern mining algorithm in dynamic database (2017).

42.

, Liu

and Wang

, An Algorithm of Top-K High Utility Itemsets Mining over Data Stream, Journal of Software 9(9) (2014).

43.

Zihayat

and An

, Mining top-k high utility patterns over data streams, Information Sciences 285(1) (2014), 138–161.

44.

Dawar

, Sharma

and Goyal

, Mining Top-K high-utility itemsets from a data stream under sliding window model, Applied Intelligence 47(4) (2017), 1–16.

45.

Yun

and Lee

, Sliding window based weighted erasable stream pattern mining for stream data applications, Future Generation Computer Systems 59 (2016), 1–20.

46.

Wang

S.P.

and Wen

Y.Y.

, Mining Full Weighted Maximal Frequent Itemsets Based on Sliding Window over Data Stream, Journal of Northeastern University 37(7) (2016), 931–936.

47.

C.C.

and Fu

X.F.

, Mining Closed Weighed Frequent Patterns from a Sliding Window over Data Stream, Advanced Materials Research (2013), 756–759.

48.

Zeng

, Han

, Li

, et al., WCSPMPD-stream: Mining weighted closed sequential patterns with pattern decay over data streams, Journal of Computational Information Systems 10(1) (2014), 435–442.

49.

Hong

T.P.

, Lee

C.H.

and Wang

S.L.

, An Incremental Mining Algorithm for High Average-Utility Itemsets, International Symposium on Pervasive Systems (2010).

50.

Lin

C.W.

, Ren

, Fournier-Viger

, et al., Efficiently updating the discovered high average-utility itemsets with transaction insertion, Engineering Applications of Artificial Intelligence 72 (2018), 136–149.

51.

Zhang

, Lin

C.W.

, et al., Maintenance of Discovered High Average-Utility Itemsets in Dynamic Databases, Applied Sciences (2018).

52.

Yun

, Kim

, Yoon

, et al., Damped Window based High Average Utility Pattern Mining over data streams, Knowledge-Based Systems 144 (2017).

53.

Irfan

and Mete

, FIMHAUI: Fast Incremental Mining of High Average-Utility Itemsets. In: Proceedings of the 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), (2018).

54.

Tang

H.J.

, Liu

Y.G.

and Wang

, A new algorithm of mining high utility sequential pattern in streaming data, International Journal of Computational Intelligence Systems 12(1) (2018), 342–350.

55.

Zihayat

, Wu

C.W.

, An

, et al., Mining High Utility Sequential Patterns from Evolving Data Streams, ASE Bigdata & Socialinformatics 52 (2015).

56.

J.Z.

and Hong

J.L.

, On Incremental High Utility Sequential Pattern Mining, ACM Transactions on Intelligent Systems and Technology 9(5) (2018), 1–26.

57.

Zihayat

, Chen

and An

, Memory-adaptive high utility sequential pattern mining over data streams, Machine Learning 106(6) (2017), 1–38.

58.

Wang

J.Z.

, Chen

Y.C.

, Shih

W.Y.

, et al., Mining High-utility Temporal Paterns on Time Interval based Data, ACM Transactions on Intelligent Systems and Technology 11(4) (2020).

59.

Huang

J.W.

, Jaysawal

B.P.

, Chen

K.Y.

, et al., Mining frequent and top-K High Utility Time Interval-based Events with Duration patterns, Knowledge and Information Systems 61(3) (2019), 1331–1359.

60.

J.M.T.

, Teng

, Lin

C.W.

and Cheng

C.F.

, Incrementally Updatingthe Discovered High Average-Utility Patterns With the Pre-LargeConcept, IEEE Access 8 (2020), 66788–66798.

61.

Lan

G.C.

, Hong

T.P.

and Tseng

V.S.

, Discovery of high utility itemsets from on-shelf time periods of products, Expert Systems with Applications 38(5) (2011), 5851–5857.

62.

Radkar

M.N.

and Pawar

M.S.S.

, Mining high on-shelf utility itemsets with negative values from dynamic updated database, the International Journal of advanced studies in Computer Science and Engineering IJASCSE 4(6) (2015).

63.

Zeng

, Han

, Li

, et al., WCSPMPD-stream: Mining weighted closed sequential patterns with pattern decay over data streams, Journal of Computational Information Systems 10(1) (2014), 435–442.

64.

Liu

, Zhao

, Yang

, et al., Efficient Parallel Algorithm for Mining High Utility Patterns Based on Spark, 2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC) (2019).

65.

Lin

C.H.

, Wu

C.W.

, Huang

J.T.

, et al., Parallel Mining of Top-k High Utility Itemsets in Spark In-Memory Computing Architecture, Pacific-Asia Conference on Knowledge Discovery and Data Mining (2019).

66.

Zeng

, A parallel mining algorithm based on Hadoop architecture, Modern Electronics Technique 41(1) (2018).

67.

Dam

T.L.

, Ramampiaro

, Duong

Q.H.

, et al., Towards efficiently mining closed high utility itemsets from incremental databases, Knowledge-Based Systems 165 (2019), 13–29.

68.

Liu

J.J.

, Ju

, Zhang

, et al., Incremental Mining of High Utility Patterns in One Phase by Absence and Legacy-based Pruning, IEEE Access 99 (2019).

69.

Rahmati

and Sohrabi

M.K.

, A Systematic Survey on High Utility Itemset Mining, International Journal of Information Technology & Decision Making (IJITDM) 18 (2019).

70.

Zhang

, Han

, Sun

, et al., A Survey of Key Technologies for High Utility Patterns Mining, IEEE Access (2020).

71.

Gan

, Lin

C.W.

, Fournier-Viger

, et al., A Survey of Utility-Oriented Pattern Mining, IEEE Transactions on Knowledge and Data Engineering (2019).

72.

Viger

P.F.

, Lin

, Truong

T.C.

, et al., A Survey of High Utility Itemset Mining, (2019).

73.

Patel

and Madhushree

, A Survey on Discovering High Utility Itemset Mining from Transactional Database, (2015).

74.

Gan

W.S.

, Lin

C.W.

, Fournier-Viger

, et al., A survey of incremental high-utility itemset mining, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(2) (2018).

75.

Almoqbily

R.S.

, Rauf

and Quradaa

F.H.

, A Survey of Correlated High Utility Pattern Mining, IEEE Access 9 (2021), 42786–42800.

A Survey of incremental high-utility pattern mining based on storage structure

Abstract

Keywords

1 Introduction

2 Related works

3 Basic concepts

Table 2 Characteristics of main storage structures used in algorithms Storage structure type Need validation phase Pattern mining speed Memory consumption Tree Need Slower High List No need Faster Low Array Set No need Need Faster Slower Low High

4.2 List-based storage methods

4.3 Array-based storage methods

4.4 Other methods

5.3 Incremental high utility sequential patterns

5.4 Incremental temporal high utility patterns

5.5 Other derivative high-utility patterns

6 Further research direction

7 Summary

Footnotes

Acknowledgments

References

Table 2
Characteristics of main storage structures used in algorithms

Storage structure type Need validation phase Pattern mining speed Memory consumption

Tree Need Slower High

List No need Faster Low

Array Set No need Need Faster Slower Low High