Abstract
Periodic high-utility sequential patterns (PHUSPs) mining is one of the research hotspots in data mining, which aims to discover patterns that not only have high utility but also regularly appear in sequence datasets. Traditional PHUSP mining mainly focuses on mining patterns from a single sequence, which often results in some interesting patterns being discarded due to strict constraints, and most of the discovered patterns are unstable and difficult to use for decision-making. In response to this issue, a novel algorithm called TKSPUS (top-k stable periodic high-utility sequential pattern mining) is proposed to discover stable top-k periodic high-utility sequential patterns that co-occur in multi-sequences. TKSPUS extends the traditional periodic high-utility sequential patterns mining, and designs two new metrics, namely utility stability coefficient (usc) and periodic stability coefficient (sr), to determine the periodic stability and utility stability of patterns in multi-sequences respectively. Additionally, the TKSPUS algorithm adopts the projection mechanism to mine stable periodic high-utility patterns over multi-sequence, while a new data structure called pusc and two corresponding pruning strategies are also introduced to boost the mining process. Experiments show that compared with the other four related algorithms, the TKSPUS algorithm has better performance in memory consumption and execution time, and the stability of the mining results is improved by 47% on average compared with the traditional periodic high-utility patterns mining algorithm.
Introduction
As an important research issue in recent years, high-utility sequential pattern mining (HUSPM) can reveal the high-utility knowledge in databases by considering both inner quantity and external profit, which is widely used in e-commerce recommendation, click stream analysis, and route planning [1]. However, compared to traditional sequential pattern mining (SPM), HUSPM is more challenging, as the utility of sequences does not have downward closure properties [2]. In addition, HUSPM also faces difficulty specifying an appropriate minimum utility threshold for users, especially when unfamiliar with the characteristics of the database. A too-small threshold means that many insignificant HUSPs will be discovered, while a too-large threshold may result in only capturing a few HUSPs that cannot provide sufficient information. Extracting the appropriate number of patterns by fine-tuning the threshold is very time-consuming. In response to this issue, top-k HUSPM was proposed [3], which only requires the user to directly specify the number of desired patterns without considering the setting of the minimum utility threshold. Thus, a large number of trial and error costs for determining an appropriate minimum utility threshold are avoided.
Sequential data often contains periodic patterns, and exploring these patterns can help with better prediction and trend analysis. However, the constraints used in existing periodic pattern mining algorithms are too strict, which results in some patterns that may be periodic patterns being discarded. Therefore, the concept of stable periodic pattern has been proposed [4]. Periodic high-utility sequential patterns (PHUSPs) mining not only can discover patterns with high utility but also can effectively capture patterns that regularly appear in sequence databases [5]. However, existing PHUSPs mining approaches are usually only applicable to single sequence data and ignore the stability of patterns. Mining periodic high-utility patterns that co-occur in multi-sequences is more suitable for practical scenarios [6]. For example, conducting customer marketing not only requires discovering the periodic behavior of a single customer, but finding the co-occurring periodic behavior of multiple customers is more conducive to decision-making. Figure 1 contains the transaction sequence of five customers in the retail store. The single sequence based periodic high-utility algorithms only concern the periodicity of the high-utility patterns between sequences, as shown in Fig. 1(a). When considering all customers, it can be found that bread and milk regularly appear in the database, the intra-sequence periodicity of the high-utility patterns should not be ignored, as shown in Fig. 1(b). Therefore, exploring common high-utility periodic behavior information of customers is more helpful in designing effective sales. Furthermore, decision-makers prefer to discover stable and efficient high-utility patterns that can more accurately predict user behavior and determine more appropriate marketing strategies, rather than the patterns that change significantly over time generated by traditional approaches. Therefore, an efficient mining algorithm of top-k stable periodic high-utility sequential patterns (TKSPUS) that co-occur in multi-sequences is proposed, which considers the periodic stability of patterns in each sequence and the entire database, as well as their utility stability, to ensure the stability of mining results. The main contributions are summarized below.
A mining task of top-k stable periodic high-utility sequential patterns (TKSPUS) that co-occur in multi-sequences is defined, and the corresponding properties are further explored. To evaluate the stability of the patterns, the concepts of periodic stability and utility stability are respectively introduced. Furthermore, the lability coefficient la and periodic stability coefficient Sr are presented to measure the periodic stability of patterns in single and multi-sequences, while the utility stability coefficient usc is presented to evaluate the utility stability of patterns. An efficient algorithm TKSPUS is proposed to effectively discover top-k stable periodic high-utility sequential patterns that co-occur in multi-sequences. TKSPUS adopts a new data structure called pusc to avoid duplicate database scans. Simultaneously, two new pruning strategies were designed to effectively decrease search space. Extensive experiments are conducted on real datasets to verify that the proposed TKSPUS algorithm extracts patterns with effectiveness and novelty.

Illustration of (a) inter-sequence periodicity and (b) intra-sequence periodicity of bread, milk.
The remainder of this paper is organized as follows. Section 2 discusses the related work. Section 3 formulates the related definitions. Section 4 presents the design issues of TKSPUS. Section 5 evaluates the performance of TKSPUS on real-world datasets. Finally, Section 6 concludes the paper with future research directions.
High-utility sequential pattern mining (HUSPM)
Sequential pattern mining (SPM) aims to discover all the frequent subsequences in a sequence database, which has received extensive attention in many scenarios, such as DNA analysis and stock market analysis. Enormous and long sequential patterns will emerge, and it is crucial to continuously explore more effective and scalable mining methods. The available sequential pattern mining methods are mainly divided into two categories. The first category is the Apriori-based algorithm represented by GSP [7], which mainly includes algorithms such as AprioriAll, GSP, PSP, and SPADE [8]. And, SPADE algorithm adopts the vertical data format, while the horizontal data format is used by the others. The second category employs pattern growth-based strategies, mainly including the FreeSpan algorithm [9] and the Prefix-Span algorithm [10]. The FreeSpan [9] algorithm utilizes frequent itemsets recursively to project the sequence database into smaller projected databases and generate subsequence fragments in each projected database. However, a large number of projected databases will arise. When a sequential pattern appears in each sequence in the database, the corresponding projection databases will not be scaled back, and even a combination explosion will occur. The PrefixSpan [10] has improved the FreeSpan algorithm by using pseudo projection technology, which uses prefix projection to mine sequential patterns without generating candidate sets. Subsequently, some pruning strategies were designed to avoid constructing projection storage structures for uninteresting sequences. For example, DISC-all [11] employs a novel pruning strategy called DISC to remove uninteresting patterns in advance based on sequences with the same length. However, the strategy cannot handle all cases. Therefore, some novel methods dedicated to coping with more complex situations have begun to be explored [2], such as constraint-based sequential pattern mining, closed and maximum sequential pattern mining, approximate sequential patterns mining, etc. Sequential pattern mining algorithms have been continuously developed and improved along with practical applications, and new challenges have also been encountered. However, its research is still in its initial stage and needs to be continuously explored in terms of algorithm performance, scalability, and suitability for large datasets.
Frequency is one of the most important metrics in SPM algorithms, but the frequency is not equivalent to importance. Therefore, high-utility SPM (HUSPM) [12] was proposed, as well as many optimized strategies were designed to shrink search space. Ahmed et al. [12] firstly incorporated the concept of utility into SPM, and proposed two mining algorithms, namely, UtilityLevel (UL) and UtilitySpan (US). Yin et al. [13] proposed a HUSPM algorithm USpan, which relies on a dictionary q-sequence tree. Furthermore, two extension mechanisms (called i-Extension and s-Extension) and two pruning strategies (width pruning and depth pruning) are presented to accelerate HUSPs mining. However, USpan may cause the loss of some HUSPs. HUS-UT [14] has adopted an efficient data structure called utility table to facilitate utility calculations, and a parallel version called HUS-Par is also given. In addition, the UL list data structure and corresponding algorithm are proposed by ProUM [15] and HUSP-ULL [16]. In recent years, researchers have developed more interesting concepts in the HUSPM field. Ishita et al. [17] introduced the concept of regular high-utility sequential patterns and designed different data structures to mine these patterns from incremental databases and sliding window-based data streams. Huang et al. [18] proposed a new algorithm to discover all sequential rules with high utility and high confidence to predict or recommend some scenarios. Alam et al. [19] solved the limitation of the frequency-based framework in expressing user ’s interest and proposed a complete algorithm named UGMINE for high-utility subgraph mining. In short, different application scenarios focus on various pattern features.
Traditional HUSPM algorithms are designed to discover all high-utility sequential patterns (HUSPs), resulting in an excessive number of generated patterns, and even most patterns may be uninteresting or redundant [15]. Therefore, constraint-based HUSPs mining algorithms, such as top-k HUSPs mining, have been addressed. The TUS [3] algorithm is the first top-k HUSPM algorithm, which only needs to set the value of k to identify the top-k high-utility sequential pattern without presetting the minimum utility threshold. Similar to the USpan algorithm [13], the SPU upper bound is used. To quickly increase the minimum utility threshold and reduce the search space as much as possible, some top-k HUSPs may be missed in some cases. Wang et al. [20] further developed the TKHUS-Span algorithm, which adopts three optimized search strategies based on the HUS-Span algorithm, namely BFS-based strategy, DFS-based strategy, and hybrid strategy. Lin et al. [21] designed an efficient chain structure to store more relevant information to improve mining performance. Additionally, there have been proposed several HUSPM algorithms that consider additional constraints, such as closure property [22, 23], negative utility [24], or periodicity [5]. For the traditional HUSPM algorithm, the mined high-utility itemsets could be extremely large based on user-defined minimum utility thresholds, resulting in excessive time and space cost consumption. Therefore, closed high-utility itemsets mining and corresponding improved algorithms have been proposed [22]. To further reduce the temporal and spatial performance overhead, Han et al. [25] proposed a CHUInd algorithm for mining high-utility closed itemsets in dynamic databases containing negative items. Qi et al. [26] first introduced the concepts of periodicity and recency into closed high-utility pattern mining and proposed a CPR-Miner algorithm to discover closed periodic recent high-utility patterns.
Periodic sequential pattern mining (PSPM)
Periodic frequent patterns (PFPs) mining can discover patterns that appear regularly in databases [27]. Tanbeer et al. [28] first proposed an algorithm called PF-Tree, which uses an effective tree-based data structure to generate a complete set of periodic frequent patterns that meet the user-specified periodic threshold. Kiran et al. [29] proposed a maximum periodic frequent pattern and introduced the maxPFP-Growth algorithm to solve patterns combination explosion. To restrain the generation of redundant patterns, Likhitha et al. [30] proposed a new model of closed periodic frequent patterns. Closed periodic frequent patterns represent a compact lossless subset that uniquely preserves the complete information of all periodic frequent patterns in the database. Huang et al. [31] proposed a novel algorithm MIPPS based on a suffix tree to mine different types of periodic patterns simultaneously.
The constraints of traditional periodic pattern mining are too strict, resulting in the discarding of some patterns that may be periodic, so the stability of periodic patterns needs to be considered. The TSPIN algorithm [4] is designed to efficiently discover the complete set of top-k stable periodic patterns, but is limited to identifying periodic patterns in a single discrete sequence. However, for some practical applications, it is more desirable to analyze and identify periodic patterns in multi-sequences [32] [33]. But, the discovery of stable periodic patterns in multiple sequences and the importance of these patterns have not yet been addressed.
Periodic high-utility sequential pattern mining (PHUSPM)
So far, there has been little work on PHUSPM. Fournier-Viger et al. [34] proposed the PHM algorithm to mine periodic high-utility itemsets. Huynh et al. [35] proposed an algorithm called PHUSPM, which can discover complete PHUSP sets. However, due to the unavailability of any pruning strategy to reduce the search space of PHUSPM, excessive spatiotemporal consumption is caused. Subsequently, Dinh et al. [5] adopted a periodic high-utility sequential pattern mining structure called PUSP and corresponding mining algorithm (PUSOM) to explore PHUSPs in sequential databases. The PUSOM algorithm can discover the complete PHUSP sets in the multi-sequence datasets, but its practicality is affected due to ignoring other constraints.
It can be seen that research on PHUSPM for multi-sequences is extremely rare. Compared with single-sequence mining, it deserves further investigation due to the ability to find high-utility patterns that appear regularly within each sequence and across different sequences. For example, when analyzing market baskets, it is possible to discover high-profit shopping behaviors that vary frequently over time, thereby further improving sales and marketing strategies.
Related definition
In this section, we first introduce some basic definitions from previous studies, and then present the definitions and principles involved in our method. Finally, the problem statement of top-k SPHUPM is formally described. To help readers better understand the paper, we summarize the symbols used in our paper in Table 1.
List of symbols
List of symbols
In this study, pattern
Sequence dataset
Sequence dataset
External utility corresponding to the items
(
(
(
(
(
In Table 2, given the pattern
(

I-Extension and S-Extension.
Figure 2 shows a partial lexicographic tree example of I-Extension and S-Extension in Table 2. Pattern
To measure the stability of pattern utility over time, three measurement coefficients were designed, namely average window utility apu, average utility au, and utility stability coefficient usc.
(
(
(
(
Given a pattern
For traditional periodic patterns, when
(
The calculation formula for la is as follows:
In addition, to measure the periodic stability of patterns for multi-sequences, a new periodic stability coefficient Sr is designed. A sequence database is an ordered collection composed of multi-sequences. For a given pattern
(
For given pattern
The
(
Based on the above concepts, a novel top-k stable periodic high-utility sequential pattern mining algorithm (abbr. TKSPUS) is proposed, which does not require users to set the minimum utility threshold and can effectively discover stable patterns.
pusc structure
The main data structure used by the TKSPUS algorithm is an index chain structure called pusc. Each pattern can construct a corresponding pusc structure. The pusc structure consists of three parts:
Sequence id ( Projection sequence ( Utility list ( tid: the transaction id where pattern utility: the utility of pattern restutility: remaining utility of pattern PEU: the PEU value of pattern
The pusc structure of
pusc structure of pattern
To further accelerate the mining process, four pruning strategies are introduced. Referring to the literature [20], the Prefix extension utility strategy and reduced sequence utility strategy are used to crop out the pattern with low utility. Two new pruning strategies are presented in Subsection 4.2.3 to avoid extension of the items that do not meet utility and periodic stability.
Prefix extension utility strategy
PEU
The PEU value of pattern
The RSU upper bound [20] of pattern
Two new pruning strategies are presented based on the utility stability and periodic stability of the patterns, thereby reducing the search space of the pattern.
TKSPUS algorithm
Based on the above data structure and pruning strategy, the proposed TKSPUS algorithm is described as follows. The two main steps of the TKSPUS algorithm are described in Algorithm 1 and Algorithm 2, which are the main process and the recursive mining process respectively.
In Algorithm 1, TKSPUS first scans the input database
As shown in Algorithm 2, the Mining(
Scan the projection dataset of pattern Scan each extension item Similar to step 2, but the difference is that the items in the s-list extended by S-Extension are scanned, and another discriminant function ISSPP-S is applied to determine the periodic stability of pattern
Two discrimination functions were designed for two different pattern Extension mechanisms, namely ISSPP-I and ISSPP-S, as shown in Algorithm 3 and Algorithm 4. ISSPP-I first determines whether the pattern is a stable periodic pattern in a single sequence, and then determines in multi-sequences, while ISSPP-S directly determines whether the pattern is a stable periodic pattern in the dataset.
The US pruning strategy corresponds to the ISUSC method, which is described in Algorithm 6. First, the list apulist is initialized (line 1) to store the average periodic utility apu, and the size of a fixed window is set (line 2). Scan the projection sequence set of pattern
Let
Experimental evaluation
To evaluate the performance of the proposed TKSPUS, we select two periodic high-utility patterns mining algorithms PHM [34] and PUSOM [5], as well as two advanced high-utility pattern mining algorithms TKUCE+ [36] and THUI [37]. In Section 5.2, the effects of parameters on TKSPUS algorithm were evaluated. The runtime, memory consumption, and scalability of TKSPUS algorithm were evaluated by comparing it with four algorithms, the experimental results are shown in Sections 5.3, 5.4, and 5.6, respectively. Section 5.5 compares the stability of the patterns discovered by the five algorithms, that is, the stability of the algorithms.
Experimental setting and dataset
The experiments were conducted on a computer node equipped with 64-bit Microsoft Windows 11 OS, 1.90 GHz CPU, and 16 GB of memory. To verify the applicability of the proposed algorithm on multi-sequence datasets, six datasets were selected in the experiment, including five real databases [38] and one synthetic database. The detailed characteristics of these datasets are shown in Table 5.
Sign. A dataset of sign language utterance containing approximately 800 sequences. Each utterance in the dataset is associated with a video segment that has been meticulously transcribed. Kosarak10k. A subset of the original Kosarak clickstream dataset from the Hungarian online news portal. Bible. A sequence dataset converted by the Bible, and each word represents an item. Mushroom. This dataset is prepared based on UCI mushroom dataset. Yoochoose-buys. This dataset is obtained from the RecSys2015 Challenge and contains all transactions that customers purchase electronic goods. Syn10k. A synthetic dataset was generated using IBM generator [39].
Dataset characteristics
In this experiment, we changed the
It can be found that the increase of the
In later experiments, these parameters in the TKSPUS algorithm are set to a fixed size
The impact of parameters on sign dataset
The impact of parameters on sign dataset
The impact of parameters on Yoochoose-buys dataset
The precision of the five algorithms on the Kosarak10k dataset
This group of experiments evaluates the execution time of the algorithms by setting different numbers of expected items (i.e.
The experimental results in Fig. 3 show that the running time of TKUCE+ algorithm on six datasets is higher than that of the other four algorithms. This is because TKUCE+ involves complex computational steps or iterative processes, resulting in an increase in overall execution time. In most cases, the proposed algorithm TKSPUS is superior to other comparison algorithms in terms of execution time. This result demonstrates that the pusc structure introduced by TKSPUS can effectively improve the mining efficiency, and TKSPUS can adapt well to large-scale datasets containing a large number of sequences (Yoochoose-buys). In some cases, the TKSPUS algorithm is inferior to PUSOM and PHM. This is mainly because the utility threshold calculated by TKSPUS algorithm is often large, resulting in fewer patterns discovered by PUSOM and PHM algorithms and shorter runtime. However, in general, if users do not use the top-k mining method, it is necessary to repeatedly explore to find the appropriate threshold. Therefore, although PUSOM and PHM exhibit better, such performance cannot be guaranteed in many practical applications. In summary, our TKSPUS algorithm performs well in runtime on dense and sparse datasets.

Execution time of different algorithms on six datasets.
In this group of experiments, we compare memory consumption in five algorithms. Similarly, different

Memory consumption of different algorithms on six datasets.
Stability is an important metric for our proposed algorithm. In this group of experiments, we measure the consistency and stability of the algorithm in identifying related patterns by quantifying the precision of positive predictions. To evaluate precision, the dataset is divided into 60% for training and 40% for testing, and the training results are used to evaluate the precision of the periodic high-utility patterns in the testing set. The experimental results are illustrated in Tables 8 and 9. The precision is defined as follows. The higher the precision value, the stronger the stability of the pattern mined by the algorithm.
In which, TruePositive refers to the patterns with stable periodic high-utility generated in both training and testing datasets. FalsePositive refers to the patterns with periodic high-utility generated from the training set, but are not discovered in the test set.
The precision of the five algorithms on the sign dataset

Scalability comparison of different algorithms.
Similarly, in this section of the experiment, since PUSOM and PHM algorithms are not top-k mining, it is still necessary to first apply the TKSPUS algorithm to calculate the utility value as the minimum utility threshold for PUSOM. From the experimental results, it can be seen that traditional high-utility mining algorithms TKUCE+ and THUI perform very unstable, and the stability of periodic high-utility mining PHM and PUSOM algorithms is also greatly affected by the characteristics of the dataset. However, our TKSPUS algorithm has a precision of over 50% and exhibits good stability when mining periodic high-utility patterns in multi-sequences. The advantage of multi-sequence mining is that the frequent regularity of patterns within a single sequence and among sequences can be comprehensively considered. The two new pruning strategies adopted by TKSPUS algorithm can effectively filter out unstable patterns, thereby improving the stability and accuracy of mining results. Therefore, the TKSPUS algorithm can produce consistent and reliable mining results regardless of which dataset it is applied to.
List of abbreviations
The impact of database size on the overall performance of the algorithm is investigated. We fixed the
Conclusion and future work
Periodic high-utility patterns mining is an emerging research field, which aims to discover periodic patterns with high-utility. However, there are still many limitations in traditional algorithms, such as most of the mining results are unstable, constraints are too strict, and focusing on pattern mining for a single sequence. To address these issues, this study proposes a mining stable top-k periodic high-utility algorithm from multi-sequences, named TKSPUS. The utility stability constraint and the period stability constraint are added to the top-k periodic high-utility patterns mining to make the mining results more meaningful. The algorithm adopts the projection mechanism and a new data structure, and pruning strategies are introduced, which effectively improves the efficiency of the algorithm. Finally, extensive experiments are carried out on multiple datasets to prove the effectiveness of TKSPUS algorithm. The results show that the patterns generated by traditional high-utility mining algorithms and periodic high-utility mining algorithms are often unstable, especially for datasets with uncertain features. However, experiment results on datasets with different features show that our TKSPUS algorithm performs excellently in terms of execution time, memory consumption, pattern stability, and so on. As a future research direction, we will focus on sequence data pre-processing and extend the TKSPUS algorithm to accommodate the dynamic growth of data in the era of big data, which will enable it to be applied to dynamic and massive time series databases.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China(Grant No. 62272336), Projects of Science and Technology Cooperation and Exchange of Shanxi Province (Grant Nos. 202204041101037, 202204041101033).
