Abstract
Traditional association rule mining has been widely studied, but this is not applicable to practical applications that must consider factors such as the unit profit of the item and the purchase quantity. High-utility itemset mining (HUIM) aims to find high-utility patterns by considering the number of items purchased and the unit profit. However, most high-utility itemset mining algorithms are designed for static databases. In real-world applications (such as market analysis and business decisions), databases are usually updated by inserting new data dynamically. Some researchers have proposed algorithms for finding high-utility itemsets in dynamically updated databases. Different from the batch processing algorithms that always process the databases from scratch, the incremental HUIM algorithms update and output high-utility itemsets in an incremental manner, thereby reducing the cost of finding high-utility itemsets. This paper provides the latest research on incremental high-utility itemset mining algorithms, including methods of storing itemsets and utilities based on tree, list, array and hash set storage structures. It also points out several important derivative algorithms and research challenges for incremental high-utility itemset mining.
Introduction
High utility itemset mining (HUIM) is an emerging data mining task. It expands frequent itemset mining (FIM) by considering occurrence frequency of items in each transaction and its external factors. The external factors may include utility profit, weight, etc. After adding weight to the itemsets, it can be used to find the itemsets with high utility, that is, the high utility itemsets (HUI). In the real world, a customer can buy multiple commodities in a supermarket, or buy multiple pieces of the same commodity, but the profit for each commodity is not the same. For example, products such as bread and ham are often purchased by consumers, and these products have low profits. In contrast, non-necessities such as diamond rings are not often purchased by customers, but they have higher profits. Therefore, only looking for traditional frequent patterns cannot find those products that make a major contribution to the total profit of the retail business. Therefore, it is of great practical significance to excavate the high utility patterns mining. In recent years, the scale of data warehouse tends to increase with time, such as wireless sensor data, social network data and financial transaction data. It can be seen that mining high utility patterns from incremental or dynamic databases will have more important research significance.
Several algorithms have been proposed to mine the high utility itemsets in the transaction database. In the early days, researchers used a constant value to weight each item to calculate the importance of the transaction. In 2004, Yao et al. clearly proposed two new definitions, describing them as external utility and internal utility [1]. It is applied to the so-called Expected Utility Mining (MEU) model, which not only considers the number of high utility itemsets, but also the unit profit of items. However, the model doesn’t propose how to maintain the downward closure of Apriori, so the algorithm is incomplete. In order to solve the problem of the MEU model’s insufficient strategy for mining all high utility itemsets strategies, Liu et al. proposed a Two-phase algorithm [2]. The algorithm uses the transaction weighted downward closure attribute in the first phase. Only those items or itemsets whose Transaction-Weighted Utilization is greater than the user’s minimum utility threshold will be considered as potentially high utility itemsets. The entire process is carried out in a step-by-step scan. In the second phase, the algorithm scans the database to calculate the exact utility of each candidate itemset.
So far, most HUIM algorithms have been designed to find high utility itemsets in static databases. However, in a dynamic data environment, traditional static high-utility pattern mining technology must always start mining from the head of the whole updated dataset to reflect the additional data inserted later. In order to overcome the above problem, several incremental high utility itemset mining (iHUIM) algorithms [3–9] have been developed to deal with dynamic databases related to transaction insertion. In 2008, Yeh et al. proposed two incremental algorithms for HUIM in a dynamic database, called IUM and FIUM [10]. Both IUM and FIUM use Apriori-based methods. They generated a set of promising candidates, and then checked these candidates to identify the required high utility itemsets. The disadvantage of this algorithm is that in some cases, the original database is scanned to maintain and update the high utility itemsets. Another key issue of the horizontal algorithm is the explosive growth of the combination of the number of candidates retained in memory during the mining process. In order to solve these problems, a tree-based pattern growth method is proposed. These algorithms connect the header index table and the tree node through a link pointer to form a linked list, which stores relevant information of the itemset, and speed up the search speed. At the same time, some algorithms also use the linked list to construct the prefix itemsets of the tree, combining the itemsets support in the linked list and the external utility of the item to mine the variable and high utility itemsets. Therefore, a high-utility pattern mining method based on the pre-large concept [11] and the TWU model was proposed, and a more efficient algorithm for transaction insertion was developed, called PRE-HUI-INS [12]. After that, a tree-based and incremental high utility itemset mining (iCHUM) algorithm has also been designed [13]. Yun et al. also introduced a tree-based method called HUPID-Growth [14], which is used to mine high utility itemsets in incremental databases. Experimental results show that HUPID-Growth is superior to algorithm IHUP. However, both of the above incremental methods consume a lot of memory and take a long time to execute, and when the minimum utility threshold is set low or the database is frequently updated, their performance may be greatly reduced. Therefore, a high utility itemset mining algorithm that uses a list structure to store utility information is proposed, and an incremental pattern mining method based on list storage is developed for dynamic data, such as HUI-list-INS [15] and EIHI [16]. HUI-list-INS is HUIM’s first list-based maintenance algorithm with transaction insertion. Furthermore, a more efficient iHUIM algorithm with transaction insertion, named EIHI [16], is proposed for mining high utility itemsets in dynamic databases. Recently, a new list-based incremental method named LIHUP [17] has been developed for mining high utility itemsets without generating candidate itemsets in a dynamic environment. Contrary to HUI-list-INS [15] and EIHI [16], LIHUP constructs and updates its global data structure by performing a single database scan. The algorithm LIHUP [17] only performs once scan to construct an optimized data structure, which can then be used to derive all the required itemsets. Recently, the algorithm of discovered derivative high-utility patterns in incremental databases has also made many relatively new progress, such as IncCHUI [67], APHAUP [60].
Although a variety of iHUIM algorithms have been proposed, the existing survey of high utility pattern mining mainly focuses on static database. So far, only Gan et al. have proposed a research review of iHUIM in 2018 [18]. However, this review only introduces the incremental full set of high-utility pattern mining algorithms from the perspective of basic mining technologies such as Apriori-based and pattern-growth based.
In view of the above problems, this paper reviews and summarizes the incremental high utility pattern mining algorithms from the perspective of storage structure. The main contributions of this article are: This paper is the first survey to classify and analyze incremental high-utility pattern mining algorithms in terms of the storage structure including list-baesd structure, array-based structure, set-based structure, and hybrid structure, et al. It also systematically summarizes the characteristics of each storage structure, as well as the impact on the performance of the algorithm. This paper gives a comprehensive overview of the latest various derivative incremental high-utility pattern mining algorithms in recent years, and from the perspective of storage structure, summarizes the development status of the storage structure technology of various derivative algorithms. This review has a good guiding significance for researchers in this field in selecting storage structures of methods. It also provides some ideas and directions for the next step of the improvement of algorithms.
Related works
With the continuous development of HUIM, there have been many reviews to summarize the achievements, especially in the field of static high-utility pattern mining. In recent years, the survey of high-utility pattern mining methods on dynamic database also began to appear.
Survey [73] studied several high utility mining algorithms of HUIM, and studied their working mechanism and mining efficiency. This paper introduces the differences between the one-phase algorithm and the two-phase algorithm, points out the shortcomings of the two-phase algorithm, and summarizes the advantages of some typical one-phase algorithms. This paper provides ideas for the future development of static high utility mining algorithms.
Literature [69] reviews various existing methods and approaches to extract high utility itemsets from precise and uncertain static datasets. In the literature, the algorithms are classified into two categories: those running on accurate datasets and those running on uncertain datasets. This paper proposes some extensions of efficient itemset mining methods, such as high average itemset mining, Top-K itemset mining and so on. However, these algorithms are still algorithms on static datasets and lack of technical summary of algorithms on dynamic datasets.
Literature [70] briefly introduces the key technologies of HUIM in detail. It classifies and introduces algorithms according to Apriori-based, projection-based, etc. In addition, it also outlines the derivative high utility pattern mining methods, and summarizes high utility pattern mining methods over the data stream based on the above classification.
Survey [71] comprehensively reviews utility-oriented pattern mining, describing various problems related to mining-based utility patterns and methods to solve these problems. This paper proposes the most common and advanced classification methods for mining different types of high utility patterns, and provides a comprehensive review of advanced topics in existing high utility pattern mining techniques, and discusses their advantages and disadvantages. Finally, it introduces some famous UPM open source software packages.
Survey [72] introduces the key properties of high utility itemset mining and how it improves frequent itemset mining algorithm. Then, this survey investigates the popular technologies of high utility itemset mining in static database, and introduces the main extensions of high utility itemset mining, such as on shelf high utility itemset mining, periodic high utility itemset mining. However, this survey introduces few algorithms on dynamic database, and does not explain the main technology of the algorithm on the dynamic database.
Survey [75] reviews the correlated high utility pattern mining (COHUPM), which is extended from high utility pattern mining. Its purpose is to extract interesting patterns for real scenes by utilizing both utility and correlation measures. This survey analyzes and compares COHUPM methods from pruning strategies, key technologies and other aspects.
Paper [74] is the first survey of incremental high utility pattern mining algorithms. He introduces some basic technologies of incremental high utility pattern mining, such as Apriori and pattern-growth methods. In this paper, several classic complete high utility itemset mining algorithms are analyzed and explained in detail.
Based on the analysis of the above review, the existing literature mainly has the following problems: Most of the previous reviews have used some of the most commonly used classification methods to classify algorithms, such as Apriori-based, projection-based and other methods. However, the data structure has a great influence on the performance of the algorithm, and the existing reviews lack the discussion and research on the storage structure of the algorithm. The existing review of incremental high-utility pattern mining mainly focus on the summary of incremental high utility pattern of full set mining methods, but lack of a deeper and broader analysis and summary of the current situation of compact and other derivative high utility pattern mining. The existing review of incremental high-utility pattern mining does not give recommendations for the selection of algorithms or storage structures for different usage scenarios, so it lacks practical guiding significance.
This article will help solve the above problems and promote further research in this field to a certain extent.
Basic concepts
This section introduces important basic concepts and formally defines the issue of incremental high-utility pattern mining (iCHUM). In fact, the transaction database is dynamically updated with new data (e.g., items, transactions) about existing or new customers. For example, consider the database in Fig. 1, which consists of 10 transactions and 6 items, each of which has a purchase quantity (also known as internal utility). In the six transactions shown in Fig. 1, the set of transactions T1 to T4 is the original database D, and the set of transactions T5 to T6 is inserted into new database

Transaction database example.
External utility
Let I be a finite set of terms, where I ={ i1, i1, … i n }, Then an itemset X is a finite set of items belonging to I. The transaction database D consists of a set of transactions, in which each transaction contains a set of items i ∈ I and is associated with the unique transaction id called TID. As shown in Fig. 1, the transaction database contains four transactions, T ={ T1, T2, … T4 } and five items I ={ a, b, c, d, e }. The internal utility of an item x in a transaction T j ∈ D is defined as the number of x in a transaction T j , denoted as IU (x, T j ), and the external utility of an item x is denoted as EU (x). As shown in Fig. 1, the internal utility of item a in transaction T1 is 1; As shown in Table 1, the external utility of item a, b, c, d and e are 2, 1, 3, 1 and 4, respectively.
In recent years, several iHUIM methods have been developed to deal with dynamic database with transaction insertion. Great attention has been paid to the design of these methods to ensure their scalability and promote knowledge discovery in dynamic data environment. At present, the proposed incremental high utility pattern is mainly mining full set of patterns, and a few algorithms are used to design mining compact patterns. This section mainly studies and analyzes the storage structure classification of itemsets and utility information in the mining process. There are a lot of researches on high utility pattern mining algorithms for incremental database in the world. The analysis of main data storage structure characteristics is shown in Table 2. In Table 3, According to the performance of representative algorithms with different structures on a variety of datasets, we give further suggestions on the usage scenarios of data structures. The related algorithms are summarized in detail in Table 4.
Characteristics of main storage structures used in algorithms
Characteristics of main storage structures used in algorithms
Detailed comparison of different storage structures with application scenarios
Summary of High utility incremental pattern mining algorithms for different storage structures
Most of the previous Aprior-based algorithms used a level-wise approach to explore high utility itemsets (for example, Two-phase). In order to make up for the shortcomings of HUIM algorithm based on Apriori, based on the method of TWDC attribute and pattern growth, several algorithms that use trees to store transactions are proposed to effectively mine high utility itemsets. In order to improve the mining performance and avoid repeated scanning of the original database, Tseng et al. proposed a compact tree structure called UP-Tree [19] and a global header table to maintain information about transactions and high utility itemsets. The tree maintains the important information of the transaction database related to the pattern of items. Only two scans of the database are required to effectively generate a high utility itemset from UP-Tree. In UP-Growth, the minimum item utility table is used to reduce the overestimated utility. In UP-Growth+ [20], UP-Growth+ adopts two new utility overestimation strategies (DNN and DNU), considering the minimum node utility in each path to make the estimated pruning value closer to the pruning the actual utility of the item in the database.
Traditional incremental algorithms need to iterate layer by layer, scan the database multiple times to generate a large-scale candidate itemset, which is expensive and can only deal with the increase of the database, and cannot effectively deal with the deletion and other changes of the transaction datasets. Therefore, Na proposes a new utility mining pattern tree (PreHU-tree) [21] to mine high-utility itemsets, which first calculate the Transaction-Weighted Utilization change of each itemset in the transaction, and then divide them into high-frequency utility items, sub-frequency utility items and low-frequency utility items to construct PreHU-tree according to the item frequency of the original database. Finally, all high-utility itemsets is discovered.
Early tree-based incremental high utility mining algorithms performed unnecessary insertion or reordering operations for items with lower utility. In order to reduce the items that need to be maintained in the global tree when new data arrives, Zheng et al. [22] proposed an efficient incremental pattern mining algorithm called iCHUM. The algorithm iCHUM uses items with high Transaction-Weighted Utilization to construct its tree structure, namely iCHUM-Tree. When the new database is attached to the original database, it recalls those items with high TWU in the updated database and low TWU in the original database, and then inserts the high TWU items of the original database and the new database to update iCHUM-Tree, and no need to completely rebuild the tree structure. The information of the high utility itemset is kept in the iCHUM tree so that the candidate itemset can be generated through the mining process. The header table is a part of iCHUM-Tree, and each entry is composed of item name, TWU value and links to nodes with the same name in iCHUM-Tree. In the mining process, each entry is the entrance to traverse related nodes. All the same nodes in iCHUM-Tree are found through the linked list pointer. Experimental results show that iCHUM has advantages over IHUP on sparse and large datasets.
Algorithm HUPID-Growth [23] optimizes the tree structure, scan database once to build a global tree structure called HUPID-Tree (High utility pattern in incremental database tree), except The algorithm iCHUM uses the same header table structure, that is, it contains an initialized link pointer to link all the same items to form a linked list. The algorithm also uses a new data structure called TIList to reconstruct the global HUPID structure to Process incremental data. In this reorganization process, each path P is extracted from the tail node of the TIList entry to the root node. The items in P are sorted in descending order of the current TWU of the tree. At the same time, the calculation of each item in P Path utility (reduced overestimation utility) to greatly reduce the search space and candidate patterns, which makes HUPID run time and number of candidate itemsets completely better than IHUP.
The existing HUP mining algorithm based on sliding window does not have the “build once” attribute for interactive mining. Based on this, Kim et al. [24] recently proposed an algorithm called DSHUP, the algorithm uses a special global tree data structure. The global tree stores the nodes of each item that make up each transaction. Each node stores a batch estimated utility array, which stores the reduced estimated utility values of all batches, and these utility values are used for pruning in the path extraction process. When a new batch is added, the oldest value in the node’s array is removed by DSHUP, then the new reduced estimated utility value is added to the array. Nevertheless, the algorithm using the global tree structure still needs to generate candidate patterns in order to extract the actual high-utility patterns in the verification phase.
List-based storage methods
Most existing algorithms first generate candidate itemsets by overestimating their utility, and then calculate the exact utility of these candidates. These algorithms cause the problem of generating a large number of candidates, but after calculating their exact utility, it is found that most of the candidates do not have very high utility. In order to solve the limitations of traditional HUIM, Liu et al. [25] proposed a novel utility-list structure. This list storage structure not only stores the utility information about the itemset, but also stores heuristic information about whether the itemset should be pruned.
Liu proposed the algorithm HUI-Miner [25] based on the utility-list structure, which reduces the time and memory consumption in the mining process by avoiding the generation of a large number of candidate itemsets and utility calculations. At the same time, a new pruning strategy is proposed, which uses the remaining utility stored in the utility-list to prune the deep traverse search space. However, HUI-Miner explores the search space of itemsets by generating itemsets, and must perform expensive join operations to evaluate the utility of each itemsets. In order to reduce the number of connections performed, FHM and HUP-Miner have been proposed one after another. Fournier-Viger et al. [26] proposed a novel pruning strategy called EUCP (Estimated Utility Co-occurrence Pruning), which can prune the itemset Without performing redundant connection operations. Extensive experimental research shows that this strategy can reduce the number of connection operations of the previous algorithm HUI-Miner by up to 95%, and it is six times faster than the former. Krishnamoorthy et al. designed a new list data structure called the partition utility-list, which borrowed from the basic idea of the Tid list representation used by Zaki (2000). It can also be regarded as an extension of the utility-list structure, which also reduces the cost of join operations between two utility-list. The proposed method also utilizes three key pruning strategies to limit the search space during tree exploration. More specifically, the pruning strategies used in the recursive mining process include U-Prune (Liu & Qu, 2012), PUPrune and LA- Prune [27].
Recently, an incremental method for maintaining and updating discovered high utility itemsets is proposed to handle transaction insertion. This method uses the algorithm HUI-Miner to design the incremental algorithm of HUM, and is named HUI-list-INS [15]. This algorithm will pre-build the utility-list structure before inserting the transaction into the original database. It not only retains high Transaction-Weighted Utilization itemsets from the updated database, but also retains the small Transaction-Weighted Utilization itemset from the original database, thereby avoiding rescanning the database as a new item when the transaction is inserted. Since the utility-list structure is a condensed structure used to retain the relevant information in the original database, Therefore, it only needs less memory to save the relevant information of the proposed algorithm, and the exact utility of the larger itemset is obtained by adding the utility-list of the smaller itemset (no need to scan the database). In addition, the algorithm trims the search space by using the remaining utility stored in the utility-list. The estimated utility co-occurrence structure [12] is also applied in HUI-list-INS to speed up the incremental mining process.
Fournier-Viger et al. designed algorithm EIHI [16] (Efficient Incremental High-utility Itemset Miner) to solve the fact that the incremental algorithm is still very expensive in execution time and the One-phase algorithm HUI-List-INS still has room for improvement. It designs a structure called HUI-trie to store the previously mined high utility itemsets and their utility information, which saves utility calculation time. A fact is that if a certain itemset X appears in D but does not appear in N, where D′ = D ∪ N. The utility of X in D′ is the same as that of X in D. Therefore, if X is the high utility itemset in D, then it is also the high utility itemset in D′. Similarly, if X is low utility itemset in D, then it is also an low utility itemset in D′. Therefore, EIHI uses some important attributes (not used in HUI-List-INS) to trim the search space during the update of the utility-list and speed up the search.
HUI-list-INS and EIHI have a shortcoming when constructing their global list data structure, that is, they need to scan the original data or other data twice to calculate the TWU ascending order, which is the optimized sort order. In order to solve this shortcoming, the LIHUP method proposed by Yun et al. [17] can not only construct and update its global data structure according to the best sorting order, but also reconstruct the previously constructed data structure with only one scan, without other database scanning, so that the mining process can be executed more efficiently. Since the proposed method only scans the given data once, the optimized sort order is determined after the construction phase. The utility-list created to store the utility information of the candidate pattern has the remaining utility information to effectively reduce the search space. Its value is initialized to zero in the construction step, and is calculated according to the optimized sort order in the reorganization step. This process passes go through all the entries in the list. That is to say, the main difference from the existing scheme’s list-based progressive high utility pattern mining technology is that the algorithm does not obtain the best sort order based on two database scans, but it establishes an initial global data structure. The data structure is reconstructed in the best sort order without further scanning. Compared with IHUP, HUPID-Growth and other algorithms that only scan the database once to build a global structure, its memory stability and speed are better in both dense and sparse datasets, which shows that there are no candidates for generating utility-list based on various methods. The above reasons prevent it from incurring large memory and time overhead.
In order to reduce the number of database scans, Yun et al. [28] proposed a new algorithm for mining high utility patterns from incremental databases without generating any candidates, which is called incremental high-utility pattern mining based on indexed lists. (IIHUM). The data in a given incremental database is refined and stored in a global data structure, called “Incremental Index Utility List (IIU-List)". The global IIU-List has been allocated to every database in the current database. Each item, each IIU-List is composed of many tuples, used to store transaction information with incremental data processing. For each tuple (a tuple is related to a transaction of the current item), NextItem points to the next item next to the current item in corresponding transaction; ActUtil is the actual utility of the current item in the transaction; NextIdx indicates the corresponding tuple of the item pointed to by NextItem in its IIU-List. SumUtil is the sum of all ActUtil in the current IIU-List, which is equal to the actual utility of the current item. RUtil represents the maximum utility that can be added to SumUtil when the current item is expanded to the maximum. This method first scans the given original database once to construct the global incremental index utility-list. After that, the constructed list structure is reorganized according to the ascending order of the Transaction-Weighted Utilization, and its index information is correctly set again. If new transaction data is entered, the proposed method will reflect them to the previously constructed data structure without renewing scan the previously processed data. In this process, calculate the TWU value of the items that make up the global structure again, and reorganize the updated data structure according to the changed TWU ascending order. After reorganizing the data structure, the user provides a threshold percentage value to request high-utility patterns for the current incremental database. Then, the algorithm recursively generates the condition indexed list data structure from the global data structure. After all the recursive work is completed, the user can directly get all the high utility patterns without any candidate checking. Experiments show that, in general, the memory usage and running time of the proposed algorithm are much lower than all the latest incremental algorithms based on the traditional utility list structure.
Jaysawal et al. [29] proposed a data structure called IUDataListSW for storing the utility and upper-bounds of items in the current sliding window along with the position in a transaction to get the projected database fastly. IUDataListSW helps output high-utility items in the current sliding window, and also helps to identify promising items for further expansion. In addition, with the help of the IUDataListSW structure, the algorithm SOHUPDS proposed in this paper can effectively obtain the initial projected database of all promising items for further expansion.
Jaysawa et al. proposed a novel algorithm called DMHUPS [30] and combined it with a data structure called IUData List to effectively mine high utility patterns on sparse and dense datasets. The IUData list stores the information of the itemset of length 1 and its position in the transaction to effectively obtain the initial plan database. In addition, the algorithm DMHUPS simultaneously calculates utility and stricter extended upper bounds for multiple promising candidates. Therefore, DMHUPS finds multiple high utility patterns at the same time and effectively trims the search space.
Combining the above algorithms, the incremental utility list with index is derived from the traditional incremental utility-list [28]. The related algorithm reduces the need to scan the original or updated database twice to build a global list to only one scan [17, 28]. In particular, the proposed indexed list makes it unnecessary to perform comparison and join operations during the construction of itemset lists of different lengths, because the index indicates the tuple position of items in the same transaction in each list, which eliminates a major drawback of traditional incremental utility-list.
Array-based storage methods
Although One-phase incremental high utility pattern mining algorithms HUI-LIST-INS and EIHI can mine all high utility patterns without accessing the original database again after the transaction update, However, these two algorithms are based on the structure of utility-list, because of the uncertainty of search space, it is still necessary to compare the utility-list of new transaction with the utility-list in the original database to mine high utility patterns, and it is not convenient to merge transactions. Especially when the amount of data is too large to store in memory, the utility-list needs frequent memory swap leads to system memory thrashing, which greatly reduces the performance of the algorithm.
In order to fundamentally solve the above problems, some array-based pattern storage structures have been proposed. For example, Zida et al. introduced a new utility counting technique based on projection arrays, called Fast Utility Counting (FAC) [31]. In the utility bin array U, each entry is called the utility bin, and the utility bin U[i] of the item i is stored in the i-th position of the array U. To calculate the upper limit on the linear time and space of all expansions of the itemset. It uses the proposed algorithm EFIM to introduce two effective techniques to reduce the cost of database scanning. These techniques are called “High-Utility Database Projection (HDP)” and “Efficient Transaction Merging (HTM)". With the exploration of larger itemsets, both of these techniques reduce the size of the database, thereby greatly reducing the cost of database scanning. In order to avoid a large number of structure comparisons and joins when generating transaction data structures, Wang et al. in 2016 proposed a new projection-based high-utility pattern mining algorithm HUPMP [32]. This algorithm also uses the form of projection array to store the information of each item and mines it. Compared with the previous algorithm, it uses a more compact reduction strategy, and its computational complexity is also greatly reduced. The latest algorithms use TWU and residual utility strategies to trim the search space, but these strategies are not efficient. In order to improve performance, Singh [33] and others adopted and modified the SU-Prune strategy proposed by EFIM [31]. In addition, it also uses the array-based utility counting technique to calculate the transaction utility and Transaction-Weighted Utilization of items. The experimental results show that the proposed idea is effective and produces good results.
In the field of high utility sequential pattern mining, the current methods mostly use utility matrix to store the sequence database in memory. Unfortunately, the utility-list consumes a lot of storage. To solve this problem, Le et al. introduced a pure array structure, which can reduce memory consumption compared with the utility matrix. Experimental results show that AHUS [34] can effectively discover all high utility itemsets. In terms of running time, memory usage and scalability of all experimental datasets, the two proposed algorithms are better than the latest algorithm HUS-Span.
Luo et al. proposed the algorithm HUP [35], and IHUP is also based on the HUP-Array structure. It consists of two parts: the processing part of the original database and the processing part of the new transaction. Using this structure, when a new transaction is added, only the HUP-Array of items that appear in the new transaction needs to be processed, and the HUP-Array of items that do not appear in the new transaction does not need to be considered, then the main projection structure is constructed separately for the new transaction, and then it is merged with the main projection of the original database. In the process of mining, there is no need to mine the entire merged main projection, only the items that appear in the new transaction. At the same time, the HUP-Array group of each item is independent. In the case of a large amount of data, it can be distributed on multiple computers for mining. Based on the HUP-Array structure, it becomes easier to update and mine new transactions. Experimental results show that with the change of transaction insertion ratio, the proposed algorithm is about two orders of magnitude faster than the list storage structure algorithms such as HUI-list-INS [15] and EIHI [16].
Lee et al. based on the sliding window model (WMFP-SW) [36] to mine the weighted maximum frequent pattern of the data stream to obtain the weighted maximum frequent pattern reflecting the latest information of the data stream. It proposes that the conditional WMFP-SW tree generated from the global tree does not need to have information about weighted infrequent items, because in the process of extracting WMFP only the current weighted frequent items are needed. Therefore, invalid items should be deleted when creating the condition tree.
Yun et al. [37] adopted the new tree structure WEPS-Tree and sliding window model suitable for weighted erasable pattern mining, and a new array structure containing the basic information of the mining process. The algorithm is completing the insertion or update of WEPS-Tree. After the operation, the algorithm extracts each unsorted path from the corresponding tail node to the root node, and stores the extracted path in a temporary array. After that, sort the items in the array in descending order of the support of the items in the tree. The sorted items of the array are re-inserted into the tree in order from the root. In this algorithm, the algorithm updates the information of the child node and the parent node link, and sets the node with the last item of the array as the tail node. This structure greatly improves the efficiency of mining operations.
It is not difficult to find that most of incremental algorithms based on array storage store the utility of each projected transaction and the upper limit information of the utility expansion of the expandable item, and use these utility arrays for fast utility calculations. That is, the utility of calculating and accessing items in O(1) time. For example, EFIM only creates three utility arrays, which are used to calculate TWU, Subtree utility, and Local utility. Secondly, it can be reused multiple times by reinitializing the same utility array with zero values before each use. This can avoid creating multiple arrays, thereby reducing memory usage. However, because the above array-based incremental high utility pattern mining algorithm are mostly carried out by projection, whenever an itemset is extended and projected, it is necessary to scan and traverse possible transactions to determine whether a transaction should be added to the projection database, which is for those with average length longer datasets are disastrous, because it is unknown whether the items that need to be newly expanded exist in the current transaction, and where they are in the utility array.
Considering the characteristics of incremental databases and projection arrays, when a new transaction arrives, the TWU of an item changes. The previously constructed projection databases may not include a large number of low TWU items, which makes the original utility array unsuitable for direct use or update. The calculation algorithm has made improvements to the static array structure. For example, IHUP [35] introduced earlier, in order to avoid rescanning all transactions to construct the projection array, choose to keep the low TWU items in the projection array, construct the HUP-Array for each item separately, and only need to mine the HUP-Array of items that appear in the new transaction.
Other methods
In incremental high utility pattern mining, the high utility pattern of using different length sets to store horizontal expansion is also a commonly used storage structure scheme. At the same time, the above-mentioned data structures have more or less efficiency problems, and it is still necessary to explore other more efficient data structures. In some special pattern mining, some scholars proposed a hash table-based pattern information storage structure, and achieved good results.
Some existing incremental algorithms use a Two-phase model, which generates a large number of candidates that cause scalability problems, or uses a vertical data structure, which causes a large number of join operations, which leads to efficiency problems. To solve these incremental algorithms, In 2019, Liu et al. [68] proposed a new incremental algorithm Id2HUP+ to directly discover the high-utility pattern. Id2HUP+ adapts to the one-stage paradigm by improving correlation-based pruning and cap-based pruning, and proposes a novel data structure niCAUL for the rapid update of dynamic databases. Take the original database D and the new transaction set

The representation of D2 in niCAUL.
Lin et al. [38] proposed a concept based on the fast-update (FUP) method to mine high utility patterns, which was originally designed for association rule mining. The proposed method first divides the itemset into four parts according to whether it belongs to the high Transaction-Weighted Utilization itemset in the original database and the newly inserted transaction, as shown in Fig. 3. Then, each part is executed by its own algorithm. It uses a horizontal mode to mine incremental high utility patterns, and uses a set to store potential candidates of different lengths generated during the enumeration process. And when new data is inserted, it will judge the corresponding situation with the itemset stored in the new dataset, so as to update the candidate itemset in the set. The execution process can be roughly divided into two steps. In the first step, the Transaction-Weighted Utilization of itemset with length K needs to be calculated as the overestimated utility. If it is greater than or equal to the updated minimum utility threshold of the transaction, it is put into a updated set, this set is used to store candidate high Transaction-Weighted Utilization itemsets. The second step is to calculate the actual utility of the high Transaction-Weighted Utilization itemset. If the threshold condition is met, it will be stored in the second type of set, which only stores the updated real high utility itemset.

Four cases of incremental algorithm updating data based on the concept of utility FUP.
Hong [39] and others also proposed to mine high average utility in incremental database based on four situations in the concept of FUP. The proposed incremental average-utility mining algorithm is divided into two stages. In the first stage, the upper limit of average utility is used to overestimate the utility of itemset. In the second stage, the actual average utility of overestimated average utility itemset is calculated. The proposed incremental high average-utility mining algorithm mines in a level-wise mode in the initial stage, and uses a set to store the set of potential candidates of different lengths generated during the enumeration process. And when new data is inserted, it will judge the corresponding situation with the itemset stored in the new dataset, so as to update the candidate itemset in the set. Because it relies on the concept of FUP, it suffers from the combinatorial explosion of search space.
Jie [40] and others designed a new DS_CWFP structure to dynamically maintain the transaction information in the current sliding window. The structure consists of three parts: CWFP-Tree, item header table (IHT) and CW-HT (closed weighted hash table). CW-HT hash table is a two-level hash mapping structure used to check closed weighted frequent patterns. An algorithm is proposed that uses a divide-and-conquer strategy to mine closed weighted frequent patterns in the tree from bottom to top using a depth-first recursive method. If the weight support of a prefix pattern P is greater than the minimum threshold, the subset detection of P is performed through the closed pattern information stored in the CW-HT hash table. If the pattern P is a subset of the closed patterns that have been found and the support is equal, then start searching for the next item in the header. Otherwise, update the information of pattern P to the closed weighted hash table. Finally, the algorithm uses CW-HT two-level hash table and CWFP-Tree to find all closed weighted frequent patterns. Experiments show that DCWFP-Miner is better than Close+ in time and memory consumption.
Among the other methods described in this section, although the utility-based FUP strategy can reduce the number of database scans, the algorithm still needs to perform multiple time-consuming database scans to find all the HUIs in the incrementally updated database. At the same time, the hash set directly stores the candidate itemsets generated level by level, which will keep a large number of candidates in the memory, so it consumes a lot of memory, especially for dense datasets. Through the comparative analysis of each structure method in this section, the tree-based method is more efficient than the set-based method. HUPID-Growth is one of the fastest algorithms. HUPID-Growth [23] only need one scan to build their global tree. However, because the above tree-based algorithms need to traverse all the tree nodes during the mining process and must recursively create a subtree structure, it is very time-consuming to apply the above tree pattern growth method. In addition, most of these algorithms are based on the TWU model, which is a loose upper limit is used for utility of the itemset, so too many candidates may be generated, especially IHUP, which is prone to memory overflow when the threshold is set low. In these tree-based pattern mining methods, they will inevitably occur candidate patterns. Limited to the nature of the tree structure, when the number of dataset items is large or the number of high TWU items is small, the efficiency of the algorithm is not satisfactory. Algorithms that uses the traditional utility list structure can quickly complete the update when a new transaction arrives. During the algorithm operation, no candidates are generated, and the database is not scanned repeatedly. Through the “build once and then update multiple times” property, this kind of structure is significantly better than related algorithms based on storage structures such as sets and trees. One-phase list-based algorithms such as IIHUM [28] by adding indexes between the utility-list tuples, making it more efficient in incremental dataset scanning and condition list construction. The array structure proposed in IHUP algorithm eliminates the complicated join operations of the traditional vertical list structure, and can greatly reduce time and memory consumption. However, as the incremental data continues to arrive, the structure requires multiple scan transactions to construct the sub-projection array, resulting in the overhead will be very large, so projection arrays are not suitable for incremental mining scenarios with large amounts of data, but for small and medium datasets or sparse datasets, it still has a great advantage over traditional list structures. The algorithm Id2HUP+ [68] with a hybrid structure is lower to the algorithm using the traditional list structure in terms of memory and time consumption. It is not difficult to find that the first proposed storage scheme of transactional list and hash table using index linked list proposed by this algorithm is one of the latest storage methods, the problem that the traditional list requires complex join operations and cannot be merged is improved. It has a great inspiration for solving the shortcomings of traditional projection arrays that rely on repeated scanning of the original database to establish sub-projections. In general, the various data structures can be expressed more efficiently in different ways such as vertical and projection, so that the mining efficiency is better than the traditional structure proposed in the past.
Although most incremental pattern mining algorithms can re-mine the changed database, and calculate the high utility pattern based on this. However, with the continuous increase of transactions, a large number of high-utility itemsets are usually mined out. While they contain many redundant itemsets, they also have a large number of itemsets with specific meanings. How to search for specific itemsets that meet user needs from a large number of result set is also a difficult task. Therefore, some derivative incremental high-utility pattern mining algorithms have been proposed. The expression of the incremental high-utility pattern mining results usually includes the incremental compressed high-utility pattern, the incremental high average-utility pattern, and the incremental high utility sequential pattern, etc. Table 6 summarizes the typical algorithms of each derivative pattern. The model relationship is shown in Table 5, and the related concepts are as follows:
Characteristics of derivative incremental high-utility patterns
Characteristics of derivative incremental high-utility patterns
Characteristics of typical derivative incremental high-utility pattern mining algorithms
Incremental high-utility pattern mining can be divided into full set of high-utility pattern and compressed high-utility pattern mining. Compressed high-utility pattern mainly includes maximum high-utility pattern, closed high-utility pattern, and Top-K high-utility pattern.
Among them, the closed high utility pattern belongs to the lossless compression pattern, which can retain all useful pattern information. While the maximum high-utility pattern removes some useful subsets in the closed high utility pattern, there is no superset in this pattern. Top-K high-utility pattern is to keep the Top-K highest utility pattern information from the closed high-utility or full set of high utility pattern mining process, which also belongs to the lossy compression pattern. This section introduces the definition of incremental compression high-utility pattern and its algorithm. The algorithm is shown in Fig. 4.

Classification diagram of incremental compressed high-utility pattern mining algorithms.
In the process of mining compression high-utility patterns, two types of constraints are mainly used for mining, namely compactness and high-utility. The compactness constraint depends on the support value of the item or itemset, and the constraint method is the same as the frequent compression pattern. The researcher gave the relevant definition of the compression high-utility pattern, as follows:
Although existing high-utility pattern mining algorithms can find all itemsets that meet a given minimum utility threshold, it is often difficult for users to set an appropriate minimum utility threshold. A small minimum utility threshold may produce a large number of itemsets, while a higher minimum utility threshold may produce several itemsets. It is difficult and time-consuming to specify the suitable minimum utility threshold. In order to solve these problems, the Top-K high utility itemset mining is defined, where K is the number of high-utility itemsets to be found. Moreover, a large number of researches on high utility mining are only for static databases, but in practical applications, new transactions are often added.
In order to efficiently process the dynamically increased transaction data and mine the number of high-utility patterns required by users, Wu et al. [41] proposed the algorithm TOPK-HUP-INS for mining incremental Top-K high-utility patterns in dynamic databases. This algorithm processes the original database before the arrival of new transactions, that is, creates a “1-pattern” utility-list. It is a process of recursively building the utility-list to find the pattern with the highest utility in the utility-list. For newly added data, the minimum utility threshold obtained before the arrival of the new transaction is retained, and is continuously updated during the mining process after the arrival of the new transaction. The order of items in the new transaction is still based on ITable. If there are no items in the new transaction in ITable, the newly added items are sorted in descending order of TWU and added to the original items in ITable. Reconstruct the items in the new transaction in the above order, that is, re-establish the “1-pattern” utility-list, merge it with the original “1-pattern” utility-list, and finally start the mining process. The algorithm adopts four different strategies to delete more low-utility patterns through threshold raising methods such as utility descending order. At the same time, it uses EUCS to reduce utility-list construction to reduce mining time and memory consumption. Lu et al. proposed the algorithm TOPK-SW [42]. Record the batch data of the current window and the transaction information of the item in the HUI-Tree, and store the path information of each item node in the leaf node, and use the method of pattern growth to mine this part of the information. Effectively calculate the utility without rescanning the dataset. Zihaya [43] and others proposed the algorithm T-HUDS, which designed a utility model called SFU to estimate the prefix utility, which is very effective for pruning the search space. At the same time, it uses HUDS to store information about transactions over the data stream, uses the pattern growth method to obtain the prefix utility of items, and proposes several threshold enhancement strategies to improve the update efficiency of the Top-K cache area and accelerate the Top-K pattern of mining. Based on the variable sliding window model, Dawar [44] proposed an effective algorithm for mining Top-K high utility itemsets from data streams. The algorithm designs a data structure called iList, which makes the algorithm very efficient in batch insertion and deletion. The utility-list information that exists in the previous window batch will be retained in order to reduce the calculation cost and time. The list structure can also ensure that no candidate itemsets are generated during the mining process. In addition, the threshold raising strategy and the residual utility pruning strategy also reduce the search space of the algorithm.
In the current research on incremental frequent itemset mining, mining the maximum frequent itemsets is more efficient than mining basic frequent itemsets and frequent closed itemsets, because the number of maximum frequent itemsets is smaller and the compression strength is higher than that of closed frequent patterns. However, because the maximum high-utility pattern is lossy compression, only the longest itemset can be recorded, and its useful high-utility subset is not saved, so the high-utility pattern cannot be recorded completely.
The algorithm WMFP-SW proposed by Lee et al. [45] expresses the importance of items by adding weight constraints, and designs the WMFP-SW-tree structure for pattern expansion, extraction of WMFP information, and subset checking operations. The WMFP-SW-array strategy is proposed to reduce the traversal time of the conditional algorithm and improve the mining efficiency. Wang Shaopeng [46] and others proposed the concept of full weighted maximum frequent itemsets FWMF and the algorithm FWMFI-SW. The Naive algorithm adds compression filter conditions to the algorithm WMFP-SW, and FWMFI-SW reduces the redundant operations of Naive. This algorithm introduces the reconstruction discriminant function RJF of WMFP-SW-Tree, and adds a conditional optimization strategy based on frequent constraints to meet the definition of the maximum high utility pattern. These two algorithms have less time overhead than algorithm WMFP-SW. Yun [37] and others proposed the algorithm MWS, which uses the tree structure MWS-tree to manage and maintain the information of continuously added transactions and related weight conditions. The MWS algorithm uses the global MWS-tree header table to traverse the tree in a bottom-up manner to perform mining operations, using the maximum weight (MW) as one of the pruning conditions. For each item selected in the header table, the algorithm passes the divide-and-conquer method finds WMFP, and checks whether WMFP is effective. This algorithm consumes less memory and time than the similar high utility pattern mining algorithm IWFP, and the effect is obvious. In addition, the GUIDE [46] framework also uses an effective compact tree structure, which can effectively mine the maximum high utility pattern.
If there is no superset in the same transaction database and it has the same support, its called a Closed High Utility Itemset (CHUI). The process of mining patterns containing such itemsets is called closed high-utility pattern mining. This compression method greatly reduces the number of itemsets without losing the original high-utility pattern, and effectively improves the analysis efficiency of the high-utility pattern. Especially when the system resources are limited, it is usually impractical to generate the entire set of high utility patterns. Moreover, it is difficult for users to utilize a large amount of high-utility information. Therefore, the closed high utility pattern has high research value.
Dam [67] and others proposed an incremental utility-list structure, which can be built and updated only through one database scan. In addition, it uses a novel global list update algorithm IncCHUI, and combines with the temporary array RuA to dynamically update the residual utility of the original transaction after the new thing is inserted, so as to effectively ensure the correctness of the pruning strategy after transaction insertion, and quickly eliminate the low utility candidates in the mining process. A method for updating or inserting a new closed itemset found based on the hash table CHT is also proposed. The method first checks whether the itemset P is already in the CHT. If it is, it means that the itemset P has been previously found when the algorithm is applied to the original database. In this case, we will update its utility and support. Otherwise, because it is a new CHUI, P is inserted into the table CHT.
Aiming at closed high-utility pattern mining over data streams, Ge et al. [47] proposed the algorithm DS_CRWF for mining closed weighted patterns over the data stream, and proposed the CRWF_tree structure, which dynamically records the candidate closed high-utility patterns in the sliding window as the window slides. Using the method of first mining high-utility patterns and then using frequent closure constraints, the two constraints of closure and weighting are satisfied, and all closed high-utility patterns can be obtained through one scan. Zeng [48] and others proposed the algorithm WCSPMPD-Stream, which designed time weight decay and no-update decay methods to control the decay rate, and increased frequent constraints to satisfy closed high-utility pattern mining. A new data structure WFPS-Tree is proposed to store and update the mining results. Experiments show that the algorithm WCSPMPD-Stream overcomes the memory limitation, and its execution time is less than the previous algorithm under the same conditions.
High Average Utility Pattern Mining (HAUPM) is used to solve the problem that high utility pattern mining (HUPM) mining results usually include long patterns composed of many low utility items. These patterns are redundant information in some business areas. The average utility measure reveals better utility effect of combining multiple items than the original utility measurement.
Hong [49] and others proposed a two-phase average utility mining algorithm ITPAU, which can gradually maintain a high average utility itemset as the database grows. Based on the concept of the algorithm FUP, the algorithm combines the information previously mined from the original database with the new mining results from the newly inserted transaction, reducing the required database scanning to speed up the mining process. The experimental results also show the effectiveness of the algorithm. Even so, because it follows Apriori method and uses a simple hashset structure to directly store a large number of candidate itemsets, it still has limitations.
Later, Yun et al. proposed an algorithm called “Incremental Mining of High Average Utility Itemsets (IMHAUI)” [52] to discover high average-utility itemsets (HAUI) from incremental databases in a more efficient way. IMHAUI uses its own compact tree data structure is called IHAUI-Tree. After inserting each incremental part of the database into the tree, it uses a reorganization phase on IHAUI-Tree to make it as compact as possible. IMHAUI uses a pattern growth method from its IHAUI-Tree finds the collection of a high average-utility upper-bound itemset(HAUUBI). It extracts a conditional pattern base for each prefix itemset, and recursively builds its local tree based on this conditional pattern base. However, this recursive working principle is negatively reflected in runtime efficiency.
In order to speed up the time required to discover high average-utility upper-bound itemsets (HAUUBI), YILDIRIM et al. made some modifications to IHAUI-Tree, and then proposed the structure of mIHAUI-Tree [3]. Based on this structure, a method called FIMHAUI was proposed using database projection and merging technology. In the original IHAUI-Tree, each node N has an average-utility upper-bound (AUUB) value (N.AUUB), which is the sum of the maximum utility of the overlapping transactions on the path from the root node to N. However, in the proposed algorithm, the tree is used to obtain the projection database of promising itemsets, not to generate the conditional pattern library and local tree, so only the leaf nodes of the path need to maintain the corresponding AUUB value in the tree. The database used by the projection has a great impact on reducing the cost of database scanning, because when the set of items to be explored becomes larger, the transaction becomes smaller. Experimental results show that this algorithm is slightly better than the earlier proposed IMHAUI [52] on dense datasets such as Chess and Mushroom dataset. Because this algorithm still needs to construct and reconstruct the tree structure in the incremental data environment, and in the process of traversing the tree that candidate itemsets is inevitably generated, so it still has the significant disadvantage of the tree storage structure and is not suitable for frequently updated large or dense datasets.
Lin et al. [50] proposed a pattern mining algorithm called IHAUPM to deal with incremental database transaction insertion. The well-known fast-update (FUP) concept in FIM has been modified to be adopted into the designed algorithm, so as to effectively update the discovered high average-utility itemset. The algorithm also maintains the correctness and completeness of the updated HAUI to ensure that all necessary information is fully discovered. It uses an effective HAUP-tree structure to maintain the necessary information in order to continue mining in the future. Based on the above structure, proposed algorithm is developed to divide the original database and new transactions into four situations for maintenance. The itemsets in each case can be maintained and processed correctly. Compared with some algorithms running in batch mode, sometimes it does not need to rescan the original database multiple times to update the discovered high average-utility itemsets. The number of candidate itemsets generated by the designed algorithm is much less than that of PAI, HAUI-tree and HAUP-growth. The reason is that these algorithms have no effective pruning strategy to reduce the useless candidates of the entire updated dataset, and the designed IHAUPM algorithm only deals with in the incremental part, the size of the incremental dataset is much smaller than the updated dataset. Based on the concept of FUP, Zhang [51] and others recently proposed an incremental HAUIM algorithm FUP-HAUIMI for transaction insertion to maintain information about the pattern when updating the database. First, construct the average utility-list (AUL) structure by scanning the original database. Then, FUP-HAUIMI selects the upper-limit itemset with high average utility and classifies it according to different situations. In each case, a specific update process is used to maintain and update the itemset. When traversing the enumeration tree representing the search space in a depth-first manner, the algorithm performs a join operation to incrementally update the AUL structure.
Incremental high utility sequential patterns
High utility sequential pattern (HUSP) mining is more complicated than traditional high utility pattern mining. It needs to find all high utility subsequences that are not lower than the user-set utility threshold.
Wang et al. [56] proposed the algorithm IncUSP-Miner+ to gradually mine HUSPs. The approach consists of two phases, the initial phase and the incremental phase. In the initial phase, proposed algorithm USP-Miner incorporating TSU and PEU to mine HUSPs on the origin database D and keep the HUSPs in the list HUSP. To facilitate better performance in the incremental phase, we propose the concept of the candidate pattern tree and in addition to HUSPs, USP-Miner will construct a candidate pattern tree T as well.
Wu [55] and others defined the problem of mining high utility sequential patterns (HUSPs) over high-speed data streams. A compact data structure named HUSP-Tree is proposed to maintain the basic information of the mined HUSP using online methods, introduce the efficient single-channel algorithm of HUSP-Stream, and use a new utility estimation model to more effectively trim the search space. Then generate HUSP from the HUSP-Tree structure. Compared with the similar algorithm Uspan [25], the running time of HUSP-Stream is significantly reduced, and it is scalable.
In addition, Zihayat [57] proposed that MAHUSP algorithm is based on two memory adaptive mechanisms, which can allocate memory adaptively at the premise of sacrificing HUSP quality. Tang et al. [54] proposed a HUSP mining algorithm based on UT-tree called HUSP-UT. The algorithm can obtain the actual utility value of the sequence from the UT-tree instead of the estimated value. It does not generate candidates during the mining process, effectively reducing storage consumption. When the algorithm creates a subtree based on the pointers of related nodes, the utility value of each item in these sequences can be obtained. This also ensures that the utility of the items in the header table and the new SWU value can be quickly calculated, which greatly reduces the running time of the algorithm. Experimental results show that the performance of the algorithm HUSP-UT is better than the latest algorithm HUSP-Stream over the data stream.
Incremental temporal high utility patterns
Temporal high utility pattern mining has attracted a lot of attention due to its practicality. Temporal or time series data widely exist in economics, finance, communications and other fields, such as weather forecasting. The time series transaction database is divided into several partitions according to the time period. In a particular time or season, the frequency of certain items or itemsets may be very high. Therefore, mining time-related information is interesting and useful.
Traditional association rule algorithms cannot find temporal association rules in a specific period. Temporal association rule mining does not consider the utility of each item. Therefore, temporal high-utility pattern mining is a research extended from temporal association rule mining and utility mining. Wang et al. [58] proposed the HUTPMiner algorithm to mine temporal high utility patterns with the help of the proposed matrix structure. The matrix structure is used to represent each sequence in the database, such as s, to avoid scanning s back and forth to improve mining efficiency. The purpose of temporal high utility pattern mining is to find all temporal patterns with high utility, even with low frequency. The traditional algorithm LMSpan [59] can only extract frequent temporal patterns, this feature is the main difference between the algorithm HUTPMiner and the algorithm LMSpan. These are the temporal high utility pattern mining algorithms on static data, and then researchers have proposed some temporal high utility pattern mining algorithms in an incremental environment.
Yeh et al. [10] researched and proposed novel incremental utility mining methods, IUM-Algorithm and FIUM-Algorithm, which can discover temporal high-utility patterns. When adding new transaction data to the original transaction database, the algorithm partitions the transaction data that arrives at different times, and conducts pattern mining for each partition in turn based on the Two-phase model, At the same time, according to the distribution of candidate itemsets in different partitions, the minimum utility threshold is dynamically updated. It can effectively identify all temporal high-utility patterns that users are interested in in a specific period. And use RU
k
to store all candidate high Transaction-Weighted Utilization patterns of partition K, and use LU[I,J] to store the temporal high-utility pattern of each time period in the range of time I to J, not only can be found temporal high-utility patterns in a specific time period, and the high utility pattern of the entire transaction database can be found. The process of the algorithm FIUM is similar to the algorithm IUM. FIUM also uses the transaction weighted downward closure property to trim candidates. The difference between the two algorithms lies in the way in which candidates are generated. In the step of generating p-pattern candidates, the algorithm IUM connects any two candidates of
Nevertheless, most temporal high-utility pattern mining ignores the on-shelf time periods of products in stores. In order to solve this problem, Lan and Tseng [61] proposed a new research topic, called “high on-shelf utility pattern ming", the model takes into account the on-shelf periods of items with the quantity and profit. Subsequently, Radkar et al. [62] proposed an effective algorithm to quickly generate a fast-update utility model tree (FUPT-HOUIN) of on-shelf items with negative utility, high on-shelf utility patterns with positive and negative values in the current period are searched from the dynamically updated temporal incremental database. The proposed algorithm will scan the original database to construct a utility tree. Then, Transaction-Weighted Utilization of all items is calculated. It calculates the Transaction-Weighted Utilization difference before and after transition modification for 1-itemsets. According to the difference value, it judges how an itemset should be handled according to the four cases predefined by FUP, and decides whether to insert, remove or rescan the previous dataset. Therefore, the database will not be rescanned for every modification, and the existence of the utility tree avoids unnecessary generation of candidate itemsets, thus shortening the execution time. The experimental results show that compared with the existing similar algorithm, the algorithm requires less execution time.
Other derivative high-utility patterns
There are many other types of incremental high utility patterns, such as interactive high utility erasable patterns, hybrid high-utility patterns, etc.
In recent years, researchers have also proposed a variant of frequent pattern mining: erasable pattern mining. The current more advanced erasable frequent pattern mining algorithm is MERIT [47]. Aiming at the high utility erasable pattern, Yun et al. [37] proposed the algorithm WEPS, which is suitable for mining weighted erasable pattern in data stream environments. The MWS tree structure is proposed based on the sliding window. First, the global MWS tree is constructed using the transactions added so far, and then the tree is reconstructed according to the item support in descending order. The WMFP tree is constructed by using the conditional MWS tree, and the WMFP-array and pattern growth technology are used to mine WMFP, when a new transaction arrives, iteratively build and mine the MWS tree. At the same time, a MW pruning strategy is designed to ensure the efficiency of the proposed algorithm. The scalability, running time and memory usage of this algorithm are better than those of MERIT and dMERIT+.
In addition to the high-utility erasable pattern, there is also a comprehensive high-utility pattern that combines the characteristics of multiple patterns. Zeng [63] defined a method for calculating time weights. The weights vary linearly with the length of the time interval between each stream segment and the current stream segment. Based on this definition, a algorithm WCSPMPD-Stream involving closed pattern and sequential pattern was proposed. The algorithm calculates the utility of the sequence pattern through the weights of the items in the sequence, and introduces closed pattern constraints to meet the requirements of the closed high-utility sequence pattern. The sliding window is divided into several segments and mined separately to obtain the closed high-utility sequential pattern.
This section reviews several derivative incremental high-utility pattern mining algorithms in recent years. It can be seen that compared with the full set of high-utility pattern mining algorithm introduced in Section 4, it has a stronger pertinence, and to a certain extent, it makes up for the shortcomings of full set of high-utility pattern mining in application scenarios. Although some incremental compressed high utility and high-average utility pattern mining algorithms [41, 67] have achieved good results using new list structures, however, many of the more advanced storage structures in the full set of incremental pattern mining have not yet been widely used and expanded in the derivative pattern. Most of the current derivative incremental high utility pattern mining algorithms use older storage structures such as hash sets and trees, this status quo especially exists in the field of incremental temporal, sequential and other derivative pattern mining. Through our previous analysis, it is found that the hash set and tree structure have weak advantages for frequently updated and large incremental data. Moreover, the current derivative algorithms are mostly based on Two-phase model, which cannot adapt well to mining scenarios where data such as sequence and time series that often change periodically and require real-time performance. At the same time, this section hopes to give researchers some reference and inspiration by introducing several derivative algorithms based on newer data structures and strategies.
Further research direction
In recent years, domestic and foreign researchers have conducted a lot of research on incremental high-utility pattern mining, but the existing methods still have many shortcomings. Based on the above research content, further research ideas have been introduced.
(1) Improve the type of incremental high-utility pattern
There are still shortcomings in the derivative high-utility pattern. For example, in Top-K high-utility pattern mining, although some patterns with higher utility can be quickly found, However, as the dataset is constantly updated, it is inevitable that there will be no redundant subset patterns. We can consider combining Top-K high-utility pattern with closed high-utility pattern or maximum high-utility pattern mining method, and introduce constraints such as superset closure detection to improve the quality of Top-K high utility pattern.
(2) Research on more efficient incremental high-utility pattern mining methods
Most of the latest incremental mining algorithms are based on traditional lists for pattern mining. A large number of join operations are required in the list construction process, which consumes time and system resources. How to apply projection-based pattern mining methods and efficient storage structures such as index utility-lists, projection arrays in the field of incremental data mining is an urgent problem.
(3) Apply incremental high-utility pattern mining research to sequential patterns or data streams
Mining association rules over the data stream requires scanning the data stream in the shortest time to obtain the next required transaction information. Sequential pattern is a data combination with timestamp. The data stream is similar to it, and many incremental pattern mining algorithms can be applied to both. At the same time, compared with other traditional classification and clustering methods over data streams, the association rules generated by the incremental high utility pattern can effectively reduce the influence and interference of noise on the data, and improve the accuracy of the results.
(4) Research on incremental association rule mining in distributed environment or big data scenario
Literature [64] first proposes a Spark-based parallel utility pattern mining algorithm, which improves the single-phase utility pattern mining algorithm HUIMiner and its vertical data structure, and designs a three-phase parallel mining model. This efficient algorithm is helpful for distributed and parallel mining on machine clusters. Literature [65] proposes a novel framework for parallel mining of Top-K high utility itemsets in big data. Besides, a new algorithm called Parallel Top-K High Utility Itemset Mining (PKU) is proposed for parallel mining of Top-K high utility patterns on Spark in-memory platform. Nevertheless, the research on algorithms in the big data environment is still in its infancy, and further development is urgently needed. In recent years, cloud computing technology has matured, and some association rule mining algorithms based on cloud computing have been proposed, such as literature [66]. It is also one of the hotspots to realize large-scale computing incremental association rule mining in the cloud computing environment.
Summary
Traditional high-utility pattern mining algorithms are designed to discover high-utility patterns in static databases. Because these algorithms usually face considerable performance problems when dealing with dynamic database, incremental data mining has become an important research topic in recent decades. However, as far as we know, in recent years, various incremental high-utility pattern mining algorithms including derivative algorithms have emerged one after another. However, there are few comprehensive reviews of incremental high-utility pattern mining and its derivative algorithms, and the classification method of these algorithms on the data storage structure has not been proposed. In this article, we discuss and classify the main high-utility pattern mining methods. The problems of high utility itemset mining and incremental high utility itemset mining are proposed. Afterwards, the incremental high utility pattern mining algorithms are summarized, including iHUIM’s tree-based method, array-based method, hashset-based method, and list-based method. The method summarized in this paper is not only suitable for incremental high-utility pattern mining tasks, but also can stimulate other related data mining tasks, such as high-utility pattern mining over data streams. Incremental high-utility pattern mining and other related problems have been studied for more than 20 years, and they are still very active research fields.
Footnotes
Acknowledgments
This work was supported by the National Nature Science Foundation of China (62062004), the Ningxia Natural Science Foundation Project (2020AAC03216), and the Graduate Innovation Project of North Minzu University (YCX20077).
