Stable top-k periodic high-utility patterns mining over multi-sequence

Abstract

Periodic high-utility sequential patterns (PHUSPs) mining is one of the research hotspots in data mining, which aims to discover patterns that not only have high utility but also regularly appear in sequence datasets. Traditional PHUSP mining mainly focuses on mining patterns from a single sequence, which often results in some interesting patterns being discarded due to strict constraints, and most of the discovered patterns are unstable and difficult to use for decision-making. In response to this issue, a novel algorithm called TKSPUS (top-k stable periodic high-utility sequential pattern mining) is proposed to discover stable top-k periodic high-utility sequential patterns that co-occur in multi-sequences. TKSPUS extends the traditional periodic high-utility sequential patterns mining, and designs two new metrics, namely utility stability coefficient (usc) and periodic stability coefficient (sr), to determine the periodic stability and utility stability of patterns in multi-sequences respectively. Additionally, the TKSPUS algorithm adopts the projection mechanism to mine stable periodic high-utility patterns over multi-sequence, while a new data structure called pusc and two corresponding pruning strategies are also introduced to boost the mining process. Experiments show that compared with the other four related algorithms, the TKSPUS algorithm has better performance in memory consumption and execution time, and the stability of the mining results is improved by 47% on average compared with the traditional periodic high-utility patterns mining algorithm.

Keywords

High-utility pattern mining periodic pattern pattern stability multi-sequences

1. Introduction

As an important research issue in recent years, high-utility sequential pattern mining (HUSPM) can reveal the high-utility knowledge in databases by considering both inner quantity and external profit, which is widely used in e-commerce recommendation, click stream analysis, and route planning [1]. However, compared to traditional sequential pattern mining (SPM), HUSPM is more challenging, as the utility of sequences does not have downward closure properties [2]. In addition, HUSPM also faces difficulty specifying an appropriate minimum utility threshold for users, especially when unfamiliar with the characteristics of the database. A too-small threshold means that many insignificant HUSPs will be discovered, while a too-large threshold may result in only capturing a few HUSPs that cannot provide sufficient information. Extracting the appropriate number of patterns by fine-tuning the threshold is very time-consuming. In response to this issue, top-k HUSPM was proposed [3], which only requires the user to directly specify the number of desired patterns without considering the setting of the minimum utility threshold. Thus, a large number of trial and error costs for determining an appropriate minimum utility threshold are avoided.

Sequential data often contains periodic patterns, and exploring these patterns can help with better prediction and trend analysis. However, the constraints used in existing periodic pattern mining algorithms are too strict, which results in some patterns that may be periodic patterns being discarded. Therefore, the concept of stable periodic pattern has been proposed [4]. Periodic high-utility sequential patterns (PHUSPs) mining not only can discover patterns with high utility but also can effectively capture patterns that regularly appear in sequence databases [5]. However, existing PHUSPs mining approaches are usually only applicable to single sequence data and ignore the stability of patterns. Mining periodic high-utility patterns that co-occur in multi-sequences is more suitable for practical scenarios [6]. For example, conducting customer marketing not only requires discovering the periodic behavior of a single customer, but finding the co-occurring periodic behavior of multiple customers is more conducive to decision-making. Figure 1 contains the transaction sequence of five customers in the retail store. The single sequence based periodic high-utility algorithms only concern the periodicity of the high-utility patterns between sequences, as shown in Fig. 1(a). When considering all customers, it can be found that bread and milk regularly appear in the database, the intra-sequence periodicity of the high-utility patterns should not be ignored, as shown in Fig. 1(b). Therefore, exploring common high-utility periodic behavior information of customers is more helpful in designing effective sales. Furthermore, decision-makers prefer to discover stable and efficient high-utility patterns that can more accurately predict user behavior and determine more appropriate marketing strategies, rather than the patterns that change significantly over time generated by traditional approaches. Therefore, an efficient mining algorithm of top-k stable periodic high-utility sequential patterns (TKSPUS) that co-occur in multi-sequences is proposed, which considers the periodic stability of patterns in each sequence and the entire database, as well as their utility stability, to ensure the stability of mining results. The main contributions are summarized below.

–
A mining task of top-k stable periodic high-utility sequential patterns (TKSPUS) that co-occur in multi-sequences is defined, and the corresponding properties are further explored.
–
To evaluate the stability of the patterns, the concepts of periodic stability and utility stability are respectively introduced. Furthermore, the lability coefficient la and periodic stability coefficient Sr are presented to measure the periodic stability of patterns in single and multi-sequences, while the utility stability coefficient usc is presented to evaluate the utility stability of patterns.
–
An efficient algorithm TKSPUS is proposed to effectively discover top-k stable periodic high-utility sequential patterns that co-occur in multi-sequences. TKSPUS adopts a new data structure called pusc to avoid duplicate database scans. Simultaneously, two new pruning strategies were designed to effectively decrease search space.
–
Extensive experiments are conducted on real datasets to verify that the proposed TKSPUS algorithm extracts patterns with effectiveness and novelty.

Figure 1.
Illustration of (a) inter-sequence periodicity and (b) intra-sequence periodicity of bread, milk.

The remainder of this paper is organized as follows. Section 2 discusses the related work. Section 3 formulates the related definitions. Section 4 presents the design issues of TKSPUS. Section 5 evaluates the performance of TKSPUS on real-world datasets. Finally, Section 6 concludes the paper with future research directions.
2. Related work

2.1. High-utility sequential pattern mining (HUSPM)

Sequential pattern mining (SPM) aims to discover all the frequent subsequences in a sequence database, which has received extensive attention in many scenarios, such as DNA analysis and stock market analysis. Enormous and long sequential patterns will emerge, and it is crucial to continuously explore more effective and scalable mining methods. The available sequential pattern mining methods are mainly divided into two categories. The first category is the Apriori-based algorithm represented by GSP [7], which mainly includes algorithms such as AprioriAll, GSP, PSP, and SPADE [8]. And, SPADE algorithm adopts the vertical data format, while the horizontal data format is used by the others. The second category employs pattern growth-based strategies, mainly including the FreeSpan algorithm [9] and the Prefix-Span algorithm [10]. The FreeSpan [9] algorithm utilizes frequent itemsets recursively to project the sequence database into smaller projected databases and generate subsequence fragments in each projected database. However, a large number of projected databases will arise. When a sequential pattern appears in each sequence in the database, the corresponding projection databases will not be scaled back, and even a combination explosion will occur. The PrefixSpan [10] has improved the FreeSpan algorithm by using pseudo projection technology, which uses prefix projection to mine sequential patterns without generating candidate sets. Subsequently, some pruning strategies were designed to avoid constructing projection storage structures for uninteresting sequences. For example, DISC-all [11] employs a novel pruning strategy called DISC to remove uninteresting patterns in advance based on sequences with the same length. However, the strategy cannot handle all cases. Therefore, some novel methods dedicated to coping with more complex situations have begun to be explored [2], such as constraint-based sequential pattern mining, closed and maximum sequential pattern mining, approximate sequential patterns mining, etc. Sequential pattern mining algorithms have been continuously developed and improved along with practical applications, and new challenges have also been encountered. However, its research is still in its initial stage and needs to be continuously explored in terms of algorithm performance, scalability, and suitability for large datasets.

Frequency is one of the most important metrics in SPM algorithms, but the frequency is not equivalent to importance. Therefore, high-utility SPM (HUSPM) [12] was proposed, as well as many optimized strategies were designed to shrink search space. Ahmed et al. [12] firstly incorporated the concept of utility into SPM, and proposed two mining algorithms, namely, UtilityLevel (UL) and UtilitySpan (US). Yin et al. [13] proposed a HUSPM algorithm USpan, which relies on a dictionary q-sequence tree. Furthermore, two extension mechanisms (called i-Extension and s-Extension) and two pruning strategies (width pruning and depth pruning) are presented to accelerate HUSPs mining. However, USpan may cause the loss of some HUSPs. HUS-UT [14] has adopted an efficient data structure called utility table to facilitate utility calculations, and a parallel version called HUS-Par is also given. In addition, the UL list data structure and corresponding algorithm are proposed by ProUM [15] and HUSP-ULL [16]. In recent years, researchers have developed more interesting concepts in the HUSPM field. Ishita et al. [17] introduced the concept of regular high-utility sequential patterns and designed different data structures to mine these patterns from incremental databases and sliding window-based data streams. Huang et al. [18] proposed a new algorithm to discover all sequential rules with high utility and high confidence to predict or recommend some scenarios. Alam et al. [19] solved the limitation of the frequency-based framework in expressing user ’s interest and proposed a complete algorithm named UGMINE for high-utility subgraph mining. In short, different application scenarios focus on various pattern features.

Traditional HUSPM algorithms are designed to discover all high-utility sequential patterns (HUSPs), resulting in an excessive number of generated patterns, and even most patterns may be uninteresting or redundant [15]. Therefore, constraint-based HUSPs mining algorithms, such as top-k HUSPs mining, have been addressed. The TUS [3] algorithm is the first top-k HUSPM algorithm, which only needs to set the value of k to identify the top-k high-utility sequential pattern without presetting the minimum utility threshold. Similar to the USpan algorithm [13], the SPU upper bound is used. To quickly increase the minimum utility threshold and reduce the search space as much as possible, some top-k HUSPs may be missed in some cases. Wang et al. [20] further developed the TKHUS-Span algorithm, which adopts three optimized search strategies based on the HUS-Span algorithm, namely BFS-based strategy, DFS-based strategy, and hybrid strategy. Lin et al. [21] designed an efficient chain structure to store more relevant information to improve mining performance. Additionally, there have been proposed several HUSPM algorithms that consider additional constraints, such as closure property [22, 23], negative utility [24], or periodicity [5]. For the traditional HUSPM algorithm, the mined high-utility itemsets could be extremely large based on user-defined minimum utility thresholds, resulting in excessive time and space cost consumption. Therefore, closed high-utility itemsets mining and corresponding improved algorithms have been proposed [22]. To further reduce the temporal and spatial performance overhead, Han et al. [25] proposed a CHUInd algorithm for mining high-utility closed itemsets in dynamic databases containing negative items. Qi et al. [26] first introduced the concepts of periodicity and recency into closed high-utility pattern mining and proposed a CPR-Miner algorithm to discover closed periodic recent high-utility patterns.

2.2. Periodic sequential pattern mining (PSPM)

Periodic frequent patterns (PFPs) mining can discover patterns that appear regularly in databases [27]. Tanbeer et al. [28] first proposed an algorithm called PF-Tree, which uses an effective tree-based data structure to generate a complete set of periodic frequent patterns that meet the user-specified periodic threshold. Kiran et al. [29] proposed a maximum periodic frequent pattern and introduced the maxPFP-Growth algorithm to solve patterns combination explosion. To restrain the generation of redundant patterns, Likhitha et al. [30] proposed a new model of closed periodic frequent patterns. Closed periodic frequent patterns represent a compact lossless subset that uniquely preserves the complete information of all periodic frequent patterns in the database. Huang et al. [31] proposed a novel algorithm MIPPS based on a suffix tree to mine different types of periodic patterns simultaneously.

The constraints of traditional periodic pattern mining are too strict, resulting in the discarding of some patterns that may be periodic, so the stability of periodic patterns needs to be considered. The TSPIN algorithm [4] is designed to efficiently discover the complete set of top-k stable periodic patterns, but is limited to identifying periodic patterns in a single discrete sequence. However, for some practical applications, it is more desirable to analyze and identify periodic patterns in multi-sequences [32] [33]. But, the discovery of stable periodic patterns in multiple sequences and the importance of these patterns have not yet been addressed.

2.3. Periodic high-utility sequential pattern mining (PHUSPM)

So far, there has been little work on PHUSPM. Fournier-Viger et al. [34] proposed the PHM algorithm to mine periodic high-utility itemsets. Huynh et al. [35] proposed an algorithm called PHUSPM, which can discover complete PHUSP sets. However, due to the unavailability of any pruning strategy to reduce the search space of PHUSPM, excessive spatiotemporal consumption is caused. Subsequently, Dinh et al. [5] adopted a periodic high-utility sequential pattern mining structure called PUSP and corresponding mining algorithm (PUSOM) to explore PHUSPs in sequential databases. The PUSOM algorithm can discover the complete PHUSP sets in the multi-sequence datasets, but its practicality is affected due to ignoring other constraints.

It can be seen that research on PHUSPM for multi-sequences is extremely rare. Compared with single-sequence mining, it deserves further investigation due to the ability to find high-utility patterns that appear regularly within each sequence and across different sequences. For example, when analyzing market baskets, it is possible to discover high-profit shopping behaviors that vary frequently over time, thereby further improving sales and marketing strategies.

3. Related definition

In this section, we first introduce some basic definitions from previous studies, and then present the definitions and principles involved in our method. Finally, the problem statement of top-k SPHUPM is formally described. To help readers better understand the paper, we summarize the symbols used in our paper in Table 1.

Table 1
List of symbols

Symbols Description

$q_{k}$ The internal utility of item $i_{k}$

$q (i, T)$ The internal utility value of item $i$ in transaction $T$

$p (i_{k})$ The external utility of item $i_{k}$

$U (i, T)$ The utility of item $i$ in transaction $T$

$U (p, S)$ The utility of pattern $p$ in sequence $S$

$U (p, D)$ The utility of pattern $p$ in sequence dataset $D$

$U_{max} (p, D)$ The maximum utility value of pattern $p$ in sequence dataset $D$

$T u$ Transaction utility

$g (p, S)$ The transaction list for pattern $p$ in sequence $S$

$s u p (p)$ Number of occurrences of $p$ in sequence $S$

$p e s (p, S)$ The period of pattern $p$ in sequence $S$

$t i d (T_{i})$ The transaction $i d$ at the time of the ith occurrence

$a p u (p, T_{i, j})$ The average window utility value of pattern $p$

$a u (p)$ The average utility of pattern $p$

$| T |$ The number of transactions for the sequence dataset $D$

$u s c (p)$ The utility stability coefficient usc of pattern $p$

la The lability coefficient

$S u m s p p (p)$ The number of pattern $p$ in sequences that satisfy periodic stability

$S r (p)$ The periodic stability coefficient of pattern $p$

$| D |$ The number of sequences in the sequence dataset $D$

$⟨ p \oplus i ⟩$ The I-Extension of $p$ with item $i$

$⟨ p \otimes i ⟩$ The S-Extension of $p$ with item $i$

$m a x p e r (p)$ The maximum period value of pattern $p$ in sequence $S$ , that is, $m a x (p e s (p, S))$

$m i n p e r (p)$ The minimum period value of pattern $p$ in sequence $S$ , that is, $m i n (p e s (p, S))$

$a v g p e r (p)$ The average period value of pattern $p$ in sequence $S$ , that is, $a v g (p e s (p, S))$

minUSC The minimum utility stability threshold given by the user

maxLa The maximum lability threshold given by the user

maxPer The minimum period threshold given by the user

minAvg The minimum Average period threshold given by the user

maxAvg The maximum Average period threshold given by the user

minutil The minimum utility threshold given by the user

minSr The minimum periodic stability threshold given by the user

Symbols	Description
$q_{k}$	The internal utility of item $i_{k}$
$q (i, T)$	The internal utility value of item $i$ in transaction $T$
$p (i_{k})$	The external utility of item $i_{k}$
$U (i, T)$	The utility of item $i$ in transaction $T$
$U (p, S)$	The utility of pattern $p$ in sequence $S$
$U (p, D)$	The utility of pattern $p$ in sequence dataset $D$
$U_{max} (p, D)$	The maximum utility value of pattern $p$ in sequence dataset $D$
$T u$	Transaction utility
$g (p, S)$	The transaction list for pattern $p$ in sequence $S$
$s u p (p)$	Number of occurrences of $p$ in sequence $S$
$p e s (p, S)$	The period of pattern $p$ in sequence $S$
$t i d (T_{i})$	The transaction $i d$ at the time of the ith occurrence
$a p u (p, T_{i, j})$	The average window utility value of pattern $p$
$a u (p)$	The average utility of pattern $p$
$\| T \|$	The number of transactions for the sequence dataset $D$
$u s c (p)$	The utility stability coefficient usc of pattern $p$
la	The lability coefficient
$S u m s p p (p)$	The number of pattern $p$ in sequences that satisfy periodic stability
$S r (p)$	The periodic stability coefficient of pattern $p$
$\| D \|$	The number of sequences in the sequence dataset $D$
$⟨ p \oplus i ⟩$	The I-Extension of $p$ with item $i$
$⟨ p \otimes i ⟩$	The S-Extension of $p$ with item $i$
$m a x p e r (p)$	The maximum period value of pattern $p$ in sequence $S$ , that is, $m a x (p e s (p, S))$
$m i n p e r (p)$	The minimum period value of pattern $p$ in sequence $S$ , that is, $m i n (p e s (p, S))$
$a v g p e r (p)$	The average period value of pattern $p$ in sequence $S$ , that is, $a v g (p e s (p, S))$
minUSC	The minimum utility stability threshold given by the user
maxLa	The maximum lability threshold given by the user
maxPer	The minimum period threshold given by the user
minAvg	The minimum Average period threshold given by the user
maxAvg	The maximum Average period threshold given by the user
minutil	The minimum utility threshold given by the user
minSr	The minimum periodic stability threshold given by the user

3.1. Basic definitions

$I = {i_{1}, i_{2}, \dots, i_{n}}$ ( $n ⩾ 1$ ), is a finite set containing $n$ items. $T = [(i_{1}, q_{1}), (i_{2}, q_{2}), \dots, (i_{m}, q_{m})]$ can be expressed as a transaction. Each item in the transaction database is recorded as $(i_{k}, q_{k})$ , where $i_{k} \in I$ ( $1 ⩽ k ⩽ n$ ), and $q_{k}$ represents the internal utility. In general, the order of items in a transaction is sorted lexicographically. Table 2 gives a quantitative transaction sequence database $D = {S_{1}, S_{2}, \dots, S_{n}}$ , each sequence set $S_{i}$ ( $1 ⩽ i ⩽ n$ ) contains one or more transactions $T$ . In addition, each item $i_{k}$ is also associated with a value representing its importance or profit, which is called the external utility and denoted by $p (i_{k})$ . Table 3 shows the corresponding external utility for each item.

In this study, pattern $p$ involves two representations: pattern $p = ⟨ (i_{1}, i_{2}, \dots, i_{n}) ⟩$ or $p = ⟨ (i_{1}), (i_{2}), \dots, (i_{n}) ⟩$ . In the former representation, the items are present in the same transaction, while in the latter, the items occur in different transactions or sequences.

Table 2
Sequence dataset

Sid Tid Transaction Tu

$1$ 1 ( $a$ ,1)( $b$ ,1)( $e$ ,1) 10

2 ( $a$ ,1)( $b$ ,2)( $e$ ,3) 20

3 ( $a$ ,2)( $d$ ,1) 5

4 ( $a$ ,1)( $e$ ,3) 8

5 ( $a$ ,3)( $b$ ,1)( $c$ ,2) 20

$2$ 1 ( $c$ ,2) 8

2 ( $a$ ,1)( $b$ ,2)( $c$ ,1)( $e$ ,1) 20

3 ( $c$ ,1)( $d$ ,3) 7

4 ( $a$ ,2)( $b$ ,1)( $c$ ,2)( $e$ ,2) 22

5 ( $a$ ,1)( $b$ ,2)( $d$ ,1) 15

$3$ 1 ( $b$ ,2)( $c$ ,1) 16

2 ( $a$ ,1)( $b$ ,2) 14

3 ( $a$ ,2)( $c$ ,1)( $d$ ,2) 10

4 ( $a$ ,2)( $c$ ,3) 16

5 ( $a$ ,1)( $b$ ,3) 20

$4$ 1 ( $a$ ,1)( $b$ ,2)( $d$ ,1)( $e$ ,3) 21

2 ( $a$ ,2)( $b$ ,1)( $e$ ,8) 26

3 ( $a$ ,1)( $b$ ,2)( $c$ ,1) 18

4 ( $a$ ,1)( $b$ ,1)( $d$ ,2)( $e$ ,1) 12

5 ( $a$ ,2)( $b$ ,1) 10

Sid	Tid	Transaction	Tu
$1$	1	( $a$ ,1)( $b$ ,1)( $e$ ,1)	10
	2	( $a$ ,1)( $b$ ,2)( $e$ ,3)	20
	3	( $a$ ,2)( $d$ ,1)	5
	4	( $a$ ,1)( $e$ ,3)	8
	5	( $a$ ,3)( $b$ ,1)( $c$ ,2)	20
$2$	1	( $c$ ,2)	8
	2	( $a$ ,1)( $b$ ,2)( $c$ ,1)( $e$ ,1)	20
	3	( $c$ ,1)( $d$ ,3)	7
	4	( $a$ ,2)( $b$ ,1)( $c$ ,2)( $e$ ,2)	22
	5	( $a$ ,1)( $b$ ,2)( $d$ ,1)	15
$3$	1	( $b$ ,2)( $c$ ,1)	16
	2	( $a$ ,1)( $b$ ,2)	14
	3	( $a$ ,2)( $c$ ,1)( $d$ ,2)	10
	4	( $a$ ,2)( $c$ ,3)	16
	5	( $a$ ,1)( $b$ ,3)	20
$4$	1	( $a$ ,1)( $b$ ,2)( $d$ ,1)( $e$ ,3)	21
	2	( $a$ ,2)( $b$ ,1)( $e$ ,8)	26
	3	( $a$ ,1)( $b$ ,2)( $c$ ,1)	18
	4	( $a$ ,1)( $b$ ,1)( $d$ ,2)( $e$ ,1)	12
	5	( $a$ ,2)( $b$ ,1)	10

Table 3

External utility corresponding to the items

Item	$a$	$b$	$c$	$d$	$e$
Utility	2	6	4	1	2

Definition 1.

(Item utility [5]). For an item $i$ in transaction $T$ , its utility is defined as the product of internal utility and external utility.

\begin{aligned} U (i, T) = q (i, T) \cdot p (i) \end{aligned}

(1)

Definition 2.

(Pattern utility [5]). The utility of pattern p in sequence S and dataset D is denoted as:

\begin{aligned} U (p, S) = \sum_{i \in p \land p \in S} U (i, S) U (p, D) = \sum_{i \in p \land p \in D} U (i, D) \end{aligned}

(2)

Definition 3.

(Transaction utility [5]). For a transaction $T$ , its utility is the sum of the utilities of the items included in the transaction $T$ , denoted as:

\begin{aligned} T u = \sum_{i \in T} U (i, T) \end{aligned}

(3)

Definition 4.

(High-utility pattern [5]). Given pattern $p$ , if $U_{max} (p, D) ⩾ m i n u t i l$ , the pattern $p$ is referred to as a high-utility pattern.

Definition 5.

(Period of pattern [5]). Assuming that the transaction list for pattern $p$ in sequence $S$ is labeled as $g (p, S) = {t i d (T_{0}), t i d (T_{1}), \dots, t i d (T_{s u p (p)}), t i d (T_{s u p (p) + 1})}$ , where $t i d (T_{i})$ refers to the transaction number at position $i$ and $s u p (p)$ is the support of $p$ in sequence $S$ . And $t i d (T_{0}) =$ 0, $t i d (T_{s u p (p) + 1}) = | S |$ . The number of transactions between any consecutive $T_{i}$ and $T_{i + 1}$ in the transaction list is represented as $p e s (p, S, i)$ , and is defined as $p e s (p, S, i) = t i d (T_{i + 1}) - t i d (T_{i})$ . The period of pattern $p$ in sequence $S$ , as shown in Eq. (4).

\begin{aligned} p e s (p, S) = {p e s (p, S, 0), p e s (p, S, 1), \dots, p e s (p, S, s u p (p)} \end{aligned}

(4)

Example 1.

In Table 2, given the pattern $p = ⟨ (a b) ⟩$ , then $g (p, S_{1}) = {0, 1, 2, 5, 5}$ . According to Eq. (4), $p e s (p, S_{1}, 0) = t i d (T_{1}) - t i d (T_{0}) = 1; p e s (p, S_{1}, 1) = t i d (T_{2}) - t i d (T_{1}) = 1; p e s (p, S_{1}, 2) = t i d (T_{3}) - t i d (T_{2}) = 3; p e s (p, S_{1}, 3) = t i d (T_{4}) - t i d (T_{3}) = 0$ . Therefore, the period of pattern $p$ in sequence $S_{1}$ is $p e s (p, S_{1}) = {1, 1, 3, 0}$ .

Definition 6.

(Pattern extension [20]). The extension of patterns can be divided into two types: I-Extension and S-Extension. Given a pattern $p$ , I-Extension is to attach the item $i$ in the same transaction to $p$ to form a new itemset, which is denoted as $⟨ p \oplus i ⟩$ . S-Extension of pattern $p$ is to append the item $i$ in other transactions behind the transaction where pattern $p$ is located to form a new itemset, denoted as $⟨ p \otimes i ⟩$ . Generally, performing an I-Extension or S-Extension increases the length of pattern $p$ by one, but an I-Extension does not change the size of pattern $p$ , while an S-Extension increases the size by one.

Figure 2.

I-Extension and S-Extension.

Example 2.

Figure 2 shows a partial lexicographic tree example of I-Extension and S-Extension in Table 2. Pattern $⟨ (a b e) ⟩$ and $⟨ (a b) (a) ⟩$ are generated by performing an I-Extension and a S-Extension of pattern $⟨ (a b) ⟩$ , respectively.

3.2. New contribution definitions

To measure the stability of pattern utility over time, three measurement coefficients were designed, namely average window utility apu, average utility au, and utility stability coefficient usc.

Definition 7.
(Average window utility apu). Given a transaction window $T_{i, j}$ , the window size is denoted as $| w i n d o w |$ , $| w i n d o w | = j - i + 1$ . The average window utility value of pattern $p a p u (p, T_{i, j})$ is defined as follows:
$\begin{aligned} a p u (p, T_{i, j}) = \frac{\sum_{k = 1}^{j} U (t, T_{k})}{| w i n d o w |} \end{aligned}$
(5)
Definition 8.
(Average utility au). The average utility $a u (p)$ is defined as:
$\begin{aligned} a u (p) = \frac{U (p, D)}{| T |} \end{aligned}$
(6)
Definition 9.
(Utility stability coefficient usc). The utility stability coefficient usc of pattern $p$ is defined as the ratio of the minimum average window utility value of pattern $p$ to its average utility au.
$\begin{aligned} u s c (p) = \frac{m i n (a p u (p, T_{i, j}))}{a u (p)} \end{aligned}$
(7)
Definition 10.
(Utility stability). If $u s c (p) ⩾ m i n U S C$ , then the pattern $p$ is considered to satisfy utility stability.
Example 3.
Given a pattern $p = ⟨ (a, b, e) ⟩$ , and the window size is 3, for the sequence $s_{1}$ in Table 2, it can be obtained that $U (p, T_{1}) = 10$ , $U (p, T_{2}) = 20$ , $U (p, T_{3}) = 0$ , $U (p, T_{4}) = 0$ , $U (p, T_{5}) = 0$ . Then $a p u (p, T_{1, 3}) = \frac{30}{3} = 10$ , $a p u (p, T_{2, 4}) = 6.66$ , $a p u (p, T_{3, 5}) = 0$ . Similarly, the apu value of pattern $p$ in other sequences is calculated, and the minimum apu value of $3.33$ is obtained. Then, the total utility of pattern $p$ in the dataset $U (p, D) = 116$ , so $a u (p) = 5.8$ , and the value of $u s c (p)$ is about 0.57.

For traditional periodic patterns, when $p e s (p) ⩾ m a x P e r$ , pattern $p$ is considered not a periodic pattern, resulting in some patterns that may be periodic being missed. To make the mining results more accurate, the lability coefficient la is introduced [4]. $l a (i) = p e s (p, S, i) - m a x P e r$ , which reflects the fluctuation of the period of pattern $p$ compared to maxPer. The model proposed in [4] can completely discover patterns with stable periodic behavior, but it is incapable for multi-sequence databases. Therefore, we redefine periodic stability and corresponding judgment methods from the perspectives of single and multiple sequences to adapt to multi-sequence databases.
Definition 11.
(Periodic stability of patterns in a single sequence). If the period of pattern $p$ (i.e. $p e s (p)$ ) is always less than the maxPer threshold, then pattern $p$ must have stable periodicity. Else the maximum lability value $m a x l a (p)$ needs to be further calculated. When $m a x l a (p) ⩽ m a x L a$ , it is considered that the pattern $p$ has a stable periodicity in a single sequence.

The calculation formula for la is as follows: $l a (i) = p e s (p, S, i) - m a x P e r$ ; $l a [] \leftarrow l a (i + 1) = l a (i) + p e s (p, S, i + 1) - m a x P e r$ , if $p e s (p, S, i) ⩾ m a x P e r (0 ⩽ i ⩽ s u p (p))$ .

In addition, to measure the periodic stability of patterns for multi-sequences, a new periodic stability coefficient Sr is designed. A sequence database is an ordered collection composed of multi-sequences. For a given pattern $p$ , the number of sequences that satisfy periodic stability is denoted as Sumspp, see Eq. (2). And the periodic stability coefficient $S r$ of pattern $p$ is gotten using Eq. (9). It can be seen that the larger the periodic stability coefficient Sr, the higher the frequency of stable periodic occurrences of pattern $p$ in the sequence database.
$\begin{aligned} S u m s p p (p) & = | {S ∣ m a x p e r (p) ⩽ m a x P e r \cap m i n p e r (p) ⩾ m i n P e r \cap m i n A v g ⩽ a v g p e r (p) \\ ⩽ m a x A v g \cap m a x l a (p) ⩽ m a x L a \cap S \in D} | \end{aligned}$
(8)

$\begin{aligned} S r (p) & = \frac{S u m s p p (p)}{| D |} \end{aligned}$
(9)
Definition 12.
(Periodic stability of patterns in multi-sequences). If pattern $p$ satisfies $m a x l a (p) ⩽ m a x L a$ and $S r (p) ⩾ m i n S r$ , then $p$ is considered to be periodic stability in multi-sequences.
Example 4.
For given pattern $p = ⟨ (a b) ⟩$ , set $m a x P e r = 2$ , $m a x L a = 2$ , $m i n S r = 0.5$ . According to the sequence dataset in Table 2 and Definition 5, $p e s (p, S_{1}) = {1, 1, 3, 0}$ , where $p e s (p, S_{1}, 2) = 3 > m a x P e r$ , $l a (2) = p e s (p, S_{1}, 2) - m a x P e r = 1$ , $l a (3) = l a (2) + p e s (p, S_{1}, 3) - m a x P e r = - 1$ . We get $maxla (p) = 1 < m a x L a$ , so the pattern $p$ is considered to be a stable periodic pattern in sequence $S_{1}$ .

The $p e s (p, S_{2}) = {2, 2, 1, 0}$ of $S_{2}$ , in which the pes values are less than the maxPer value. Therefore, the pattern $p$ is also a stable periodic pattern in sequence $S_{2}$ . Similarly, the pattern $p$ is a stable periodic pattern in sequences $S_{1}$ , $S_{2}$ , $S_{3}$ , $S_{4}$ . Thus, $S u m s p p (p) = 4$ and $S r = 1$ .
Definition 13.
(Stable periodic high-utility patterns). If pattern $p$ satisfies the following three conditions, it is called a stable periodic high-utility patterns (SPHUP): (i) $U (p) ⩾ m i n u t i l$ , (ii) pattern $p$ satisfies the utility stability introduced in Definition 10, (iii) pattern $p$ satisfies the periodic stability described in Definitions 11 and 12.

Problem statement. The problem of top-k SPHUPM includes identifying the complete set of top-k SPHUPs from multi-sequence databases.
4. TKSPUS algorithm

Based on the above concepts, a novel top-k stable periodic high-utility sequential pattern mining algorithm (abbr. TKSPUS) is proposed, which does not require users to set the minimum utility threshold and can effectively discover stable patterns.

4.1. pusc structure

The main data structure used by the TKSPUS algorithm is an index chain structure called pusc. Each pattern can construct a corresponding pusc structure. The pusc structure consists of three parts:

(1)
Sequence id (sid);
(2)
Projection sequence (PS): The projection mechanism is adopted to reduce the scanning overhead by scanning the original database only once and building the projection dataset recursively.
(3)
Utility list (UL), which contains the following four elements: –
tid: the transaction id where pattern $p$ appears.
–
utility: the utility of pattern $p$ .
–
restutility: remaining utility of pattern $p$ .
–
PEU: the PEU value of pattern $p$ is the sum of its utility value and remaining utility.

The pusc structure of $p = ⟨ (b) ⟩$ is shown in Table 4.

Table 4
pusc structure of pattern $p = ⟨ (b) ⟩$

Sid Projectionsequence Utilitylist

1 [ $t_{1}$ , $<$ (e,1) $>$ ] [ $t_{2}$ , $<$ (a,1)(b,2)(e,3) $>$ ] [ $t_{3}$ , $<$ (a,2)(d,1) $>$ ] [ $t_{4}$ , $<$ (a,1)(e,3) $>$ ] [ $t_{5}$ , $<$ (a,3)(b,1)(c,2) $>$ ] [1,6,55,61] $\to$ [2,12,39,51] $\to$ [5,6,8,14]

2 [ $t_{2}$ , $<$ (c,1)(e,1) $>$ ] [ $t_{3}$ , $<$ (c,1)(d,3) $>$ ] [ $t_{4}$ , $<$ (a,2)(b,1)(c,2)(e,2) $>$ ] [ $t_{5}$ , $<$ (a,1)(b,2)(d,1) $>$ ] [2,12,50,62] $\to$ [4,6,27,33] $\to$ [5,12,1,13]

3 [ $t_{1}$ , $<$ (c,1) $>$ ] [ $t_{2}$ , $<$ (a,1)(b,2) $>$ ] [ $t_{3}$ , $<$ (a,2)(c,1)(d,2) $>$ ] [ $t_{4}$ , $<$ (a,2)(c,3) $>$ ] [ $t_{5}$ , $<$ (a,1)(b,3) $>$ ] [1,12,64,76] $\to$ [2,12,36,48] $\to$ [5,18,0,18]

4 [ $t_{1}$ , $<$ (d,1)(e,3) $>$ ] [ $t_{2}$ , $<$ (a,2)(b,1)(e,8) $>$ ] [ $t_{3}$ , $<$ (a,1)(b,2)(c,1) $>$ ] [ $t_{4}$ , $<$ (a,1)(b,1)(d,2)(e,1) $>$ ] [ $t_{5}$ , $<$ (a,2)(b,1) $>$ ] [1,12,73,85] $\to$ [2,6,56,62] $\to$ [3,12,26,38] $\to$ [4,6,14,20] $\to$ [5,6,0,6]

4.2. Pruning strategy

Sid	Projectionsequence	Utilitylist
1	[ $t_{1}$ , $<$ (e,1) $>$ ] [ $t_{2}$ , $<$ (a,1)(b,2)(e,3) $>$ ] [ $t_{3}$ , $<$ (a,2)(d,1) $>$ ] [ $t_{4}$ , $<$ (a,1)(e,3) $>$ ] [ $t_{5}$ , $<$ (a,3)(b,1)(c,2) $>$ ]	[1,6,55,61] $\to$ [2,12,39,51] $\to$ [5,6,8,14]
2	[ $t_{2}$ , $<$ (c,1)(e,1) $>$ ] [ $t_{3}$ , $<$ (c,1)(d,3) $>$ ] [ $t_{4}$ , $<$ (a,2)(b,1)(c,2)(e,2) $>$ ] [ $t_{5}$ , $<$ (a,1)(b,2)(d,1) $>$ ]	[2,12,50,62] $\to$ [4,6,27,33] $\to$ [5,12,1,13]
3	[ $t_{1}$ , $<$ (c,1) $>$ ] [ $t_{2}$ , $<$ (a,1)(b,2) $>$ ] [ $t_{3}$ , $<$ (a,2)(c,1)(d,2) $>$ ] [ $t_{4}$ , $<$ (a,2)(c,3) $>$ ] [ $t_{5}$ , $<$ (a,1)(b,3) $>$ ]	[1,12,64,76] $\to$ [2,12,36,48] $\to$ [5,18,0,18]
4	[ $t_{1}$ , $<$ (d,1)(e,3) $>$ ] [ $t_{2}$ , $<$ (a,2)(b,1)(e,8) $>$ ] [ $t_{3}$ , $<$ (a,1)(b,2)(c,1) $>$ ] [ $t_{4}$ , $<$ (a,1)(b,1)(d,2)(e,1) $>$ ] [ $t_{5}$ , $<$ (a,2)(b,1) $>$ ]	[1,12,73,85] $\to$ [2,6,56,62] $\to$ [3,12,26,38] $\to$ [4,6,14,20] $\to$ [5,6,0,6]

To further accelerate the mining process, four pruning strategies are introduced. Referring to the literature [20], the Prefix extension utility strategy and reduced sequence utility strategy are used to crop out the pattern with low utility. Two new pruning strategies are presented in Subsection 4.2.3 to avoid extension of the items that do not meet utility and periodic stability.

4.2.1. Prefix extension utility strategy

PEU upper bound. Suppose that the position of pattern $p$ in sequence $S$ : $⟨ i_{1}, i_{2}, \dots, i_{m} ⟩$ , the PEU value of pattern $p$ in $P$ position for sequence $S$ is defined and represented as [20]:

\begin{aligned} P E U (p, P, S) = {\begin{cases} U (p, P, S) + U_{r e s t} (p, i_{m}, S) & U_{r e s t} > 0 \\ 0 & other \end{cases} \end{aligned}

(10)

The PEU value of pattern $p$ in sequence $S$ is defined as $P E U (p, S) = m a x {P E U (p, P_{i}, S)}$ , where $P_{i}$ refers to the position where $p$ appears in sequence $S$ . Then, the PEU value of pattern $p$ is calculated as follows.

\begin{aligned} P E U = \sum_{p \in S \cap S \in D} P E U (p, S) \end{aligned}

(11)

Prefix extension utility strategy (PE pruning strategy)). Assume that $p$ and minutil are a candidate pattern and the current minimum utility threshold, respectively. If $P E U (p) < m i n u t i l$ , then the TKSPUS algorithm does not need to check the extended descendants of pattern $p$ . The PE pruning strategy can prevent the algorithm from considering unpromising items because the PEU upper bound has the property of downward closure. Therefore, if $P E U (p) < m i n u t i l$ , all extensions of pattern $p$ are considered useless.

4.2.2. Reduced sequence utility strategy

RSU upper bound. Suppose $a$ is a pattern that can be extended to pattern $p$ through an I-Extension or S-Extension. Then, the RSU upper bound of pattern $p$ in the sequence $S$ is defined as $R S U (p, S)$ .

\begin{aligned} R S U (p, S) = {\begin{cases} P E U (a, S) & p \in S \cap a \in S \\ 0 & o t h e r \end{cases} \end{aligned}

(12)

The RSU upper bound [20] of pattern $p$ in dataset $D$ is defined as:

\begin{aligned} R S U (p) = \sum_{p \in S \cap S \in D} R S U (p, S) \end{aligned}

(13)

Reduced sequence utility strategy (RS pruning strategy). Let $p$ and minutil be the candidate pattern and the current minimum utility threshold, respectively. Pattern $p$ can be extended to $p^{'}$ by the items contained in the set $I : {i_{1}, i_{2}, i_{3}, \dots, i_{n}}$ . When $R S U (p^{'}) < m i n u t i l$ , The extension operation of pattern $p^{'}$ will be terminated.

4.2.3. Utility and periodic stability pruning strategy

Two new pruning strategies are presented based on the utility stability and periodic stability of the patterns, thereby reducing the search space of the pattern.

Utility stability pruning strategy (US pruning strategy). To find the stable and high-utility patterns, the utility stability coefficient $u s c (p)$ proposed in Definition 9 is introduced. If the pattern $p$ satisfies $u s c (p) > m i n U S C$ , pattern $p$ is considered to be utility stability, which is called SHUP.

Periodic stability pruning strategy (PS pruning strategy). To find patterns with stable periodicity in multi-sequences, lability coefficient $l a$ and periodic stability coefficient Sr proposed in Definition 11 and Eq. (9) are used to calculate the $m a x l a (p)$ and $S r (p)$ values of pattern $p$ . If $m a x l a (p) ⩽ m a x L a$ and $S r (p) ⩾ m i n S r$ , pattern $p$ is considered to be a stable periodic pattern, denoted as SPP. Therefore, the search containing the pattern $p$ and its superset will stop.

4.3. TKSPUS algorithm

Based on the above data structure and pruning strategy, the proposed TKSPUS algorithm is described as follows. The two main steps of the TKSPUS algorithm are described in Algorithm 1 and Algorithm 2, which are the main process and the recursive mining process respectively.

In Algorithm 1, TKSPUS first scans the input database $D$ to calculate the utility value of all $1 - p a t t e r n s$ (line 1) and sorts the patterns in descending order of utility values. The $k$ -th value is taken as the minutil value (line 2) to initialize the TKList table, which is a sorted list with a fixed size of $k$ (line 3). Then we construct the $p u s c$ structure of all $1 - p a t t e r n s$ (line 4). If the utility value of $1 - p a t t e r n s$ is not less than minutil, it is stored in the TKList table (lines 5–7). According to the PE strategy, if the PEU value of pattern $p$ is not less than minutil, then go to Algorithm 2 (lines 8–10). Next, the TKList is finally returned, which contains stable high-utility periodic patterns (line 11).

As shown in Algorithm 2, the Mining( $p$ , pusc) process mainly involves three steps, which take each prefix pattern $p$ as an input.

Step 1:
Scan the projection dataset of pattern $p$ to find items that can be extended. The items extended by I-Extension are stored in the i-list, and the items extended by S-Extension are stored in the s-list (see line 1 in Algorithm 2). Meanwhile, the RSU value of $p^{'}$ is obtained.
Step 2:
Scan each extension item $p^{'}$ in the i-list, and the RS strategy is applied to delete useless items. Then the pusc structure of $p^{'}$ is constructed, which is described in Algorithm 5. Subsequently, two discriminant functions, ISSPP-I and ISUSC-S, are invoked to determine whether the pattern $p^{'}$ has stable high-utility and periodicity, and whether pattern $p^{'}$ that meets the conditions is obtained. The TKList and the minutil value are updated. According to the PE strategy, if the PEU value of $p^{'}$ is greater than minutil, then recursively enter the Mining( $p^{'}$ , pusc) algorithm (see lines 2 to 21 in Algorithm 2).
Step 3:
Similar to step 2, but the difference is that the items in the s-list extended by S-Extension are scanned, and another discriminant function ISSPP-S is applied to determine the periodic stability of pattern $p^{'}$ (see lines 22 to 40 in Algorithm 2).

Two discrimination functions were designed for two different pattern Extension mechanisms, namely ISSPP-I and ISSPP-S, as shown in Algorithm 3 and Algorithm 4. ISSPP-I first determines whether the pattern is a stable periodic pattern in a single sequence, and then determines in multi-sequences, while ISSPP-S directly determines whether the pattern is a stable periodic pattern in the dataset.

The US pruning strategy corresponds to the ISUSC method, which is described in Algorithm 6. First, the list apulist is initialized (line 1) to store the average periodic utility apu, and the size of a fixed window is set (line 2). Scan the projection sequence set of pattern $p$ , and calculate the utility and sum of pattern $p$ that appears in the window as the window moves. If the sum is not 0, the value of apu equals the sum divided by the window size (lines 3–5). The value of apu is stored in apulist (line 6), and the minimum value in apulist is labeled minapu (line 7). Scan the projection sequence set of pattern $p$ again, and calculate the overall utility sum of pattern $p$ , which is recorded as au (line 8). The utility stability coefficient usc of pattern $p$ is calculated by dividing minapu and au (line 9). According to the US strategy, if $u s c (p)$ is not less than the minimum stability threshold, then $p$ is considered to be utility stability.

4.4. Complexity analysis

Let $m$ , $n$ , and $i$ represent the number of sequences, the number of transactions, and the average number of items, respectively. TKSPUS first scans the dataset to calculate the utility and PEU values of all individual items, with an average time complexity of $O (m \times i)$ . Subsequently, a depth-first search is performed to recursively find all itemsets after I-Extension and S-Extension. The maximum sequence size for i-list and s-list is $m$ . For each extended itemsets, TKSPUS creates a new pusc structure by traversing the pusc structure of its prefix itemsets. This process is completed in linear time with $O (n)$ the time complexity. It can be observed that the size of the pusc structure of the extended itemsets is limited by the size of the prefix itemsets. In the worst case, the time complexity of the process of extending the candidate is $O (m \times n)$ . Therefore, the time complexity of TKSPUS algorithm is $O (m \times i + (n + m \times n))$ . It can be seen that the main factors affecting the time complexity of the TKSPUS algorithm are the size of the database and the number of project sets. Of course, the appropriate setting of some other factors, such as parameters minSr, minUsc, and $k$ , will further improve the performance of the TKSPUS algorithm due to the introduction of pruning strategies.

5. Experimental evaluation

To evaluate the performance of the proposed TKSPUS, we select two periodic high-utility patterns mining algorithms PHM [34] and PUSOM [5], as well as two advanced high-utility pattern mining algorithms TKUCE+ [36] and THUI [37]. In Section 5.2, the effects of parameters on TKSPUS algorithm were evaluated. The runtime, memory consumption, and scalability of TKSPUS algorithm were evaluated by comparing it with four algorithms, the experimental results are shown in Sections 5.3, 5.4, and 5.6, respectively. Section 5.5 compares the stability of the patterns discovered by the five algorithms, that is, the stability of the algorithms.

5.1. Experimental setting and dataset

The experiments were conducted on a computer node equipped with 64-bit Microsoft Windows 11 OS, 1.90 GHz CPU, and 16 GB of memory. To verify the applicability of the proposed algorithm on multi-sequence datasets, six datasets were selected in the experiment, including five real databases [38] and one synthetic database. The detailed characteristics of these datasets are shown in Table 5.

–
Sign. A dataset of sign language utterance containing approximately 800 sequences. Each utterance in the dataset is associated with a video segment that has been meticulously transcribed.
–
Kosarak10k. A subset of the original Kosarak clickstream dataset from the Hungarian online news portal.
–
Bible. A sequence dataset converted by the Bible, and each word represents an item.
–
Mushroom. This dataset is prepared based on UCI mushroom dataset.
–
Yoochoose-buys. This dataset is obtained from the RecSys2015 Challenge and contains all transactions that customers purchase electronic goods.
–
Syn10k. A synthetic dataset was generated using IBM generator [39].

Table 5
Dataset characteristics

Dataset Seqs Items Avg. seq. length Items per trans Type

Sign 800 267 51.9 1 Dense

Kosarak10k 10,000 10,094 8.14 1 Sparse

Bible 36,369 13,905 21.6 1 Dense

Mushroom 8,416 119 23 1 Dense

Yoochoose-buys 234,300 16,004 1.13 1.97 Sparse

Syn10k 9,976 7,029 6.2 4.3 Sparse

5.2. The effect of the parameter

Dataset	Seqs	Items	Avg. seq. length	Items per trans	Type
Sign	800	267	51.9	1	Dense
Kosarak10k	10,000	10,094	8.14	1	Sparse
Bible	36,369	13,905	21.6	1	Dense
Mushroom	8,416	119	23	1	Dense
Yoochoose-buys	234,300	16,004	1.13	1.97	Sparse
Syn10k	9,976	7,029	6.2	4.3	Sparse

In this experiment, we changed the $k$ , maxLa, minSr, and minUsc parameters to evaluate their combined effects. Two datasets (Sign and Yoochoose-buys) are selected, and the effects of parameters on the running time and memory consumption of our TKSPUS algorithm are recorded in Tables 6 and 7, where [M, N, L] denotes the parameters $m a x L a =$ M, $m i n S r =$ N, $m i n U s c =$ L.

It can be found that the increase of the $k$ value usually increases the running time and memory occupation. As $k$ increases, more patterns will be considered and generated, resulting in an increase in the search space of the algorithm and the required memory usage. At the same time, the experimental results indicate that our algorithm consumes less memory and time when setting lower maxLa, higher minSr, and higher minUsc values. This trend is reasonable because our PS and US pruning strategies can trim more sequential patterns in these cases, thereby improving the performance of the TKSPUS algorithm.

In later experiments, these parameters in the TKSPUS algorithm are set to a fixed size $(m i n U s c = 200; m i n S r = 10; m a x L a = 10)$ . For the remaining thresholds, including minAvg, maxAvg, minPer, and maxPer, a suitable empirical value will be selected for each dataset to ensure that the algorithm TKSPUS, PHM, and PUSOM find a certain number of patterns.

Table 6
The impact of parameters on sign dataset

(a) Fixed maxLa

Time(s) Memory(MB)

$K$ [10,0,10] [10,50,10] [10,0,100] [10,50,100] [10,0,10] [10,50,10] [10,0,100] [10,50,100]

50 0.625 0.582 0.533 0.459 120.907 116.429 110.935 91.011

100 1.732 1.614 1.527 1.364 160.141 148.353 149.276 139.337

150 5.233 5.081 4.753 4.623 300.254 288.869 296.264 278.958

200 9.443 9.087 9.262 8.725 302.761 296.309 299.243 285.925

250 10.344 9.916 10.112 9.628 309.279 302.753 306.341 291.046

(b) Fixed minSr

Time(s) Memory(MB)

$K$ [0,10,10] [50,10,10] [0,10,100] [50,10,100] [0,10,10] [50,10,10] [0,10,100] [50,10,100]

50 0.584 0.633 0.518 0.603 118.382 144.536 119.877 107.294

100 1.524 1.581 1.447 1.493 148.379 148.552 148.637 148.186

150 5.033 5.915 4.880 4.958 281.077 289.349 279.872 283.195

200 8.826 9.284 8.593 8.855 290.262 309.738 287.439 295.963

250 9.643 9.882 9.415 9.758 299.641 303.618 294.237 301.366

(c) Fixed minUsc

Time (s) Memory (MB)

$K$ [0,0,100] [50,0,100] [0,50,100] [50,50,100] [0,0,100] [50,0,100] [0,50,100] [50,50,100]

50 0.591 0.654 0.572 0.628 123.187 145.936 119.893 125.244

100 1.582 1.618 1.452 1.579 148.193 148.533 148.216 148.419

150 4.792 5.057 4.594 4.921 281.183 292.969 277.144 285.738

200 9.291 8.950 8.616 8.841 279.784 309.829 281.143 291.437

250 9.723 10.509 9.488 10.274 287.139 310.861 285.793 298.155

(a) Fixed maxLa
	Time(s)	Memory(MB)
$K$	[10,0,10]	[10,50,10]	[10,0,100]	[10,50,100]	[10,0,10]	[10,50,10]	[10,0,100]	[10,50,100]
50	0.625	0.582	0.533	0.459	120.907	116.429	110.935	91.011
100	1.732	1.614	1.527	1.364	160.141	148.353	149.276	139.337
150	5.233	5.081	4.753	4.623	300.254	288.869	296.264	278.958
200	9.443	9.087	9.262	8.725	302.761	296.309	299.243	285.925
250	10.344	9.916	10.112	9.628	309.279	302.753	306.341	291.046
(b) Fixed minSr
	Time(s)	Memory(MB)
$K$	[0,10,10]	[50,10,10]	[0,10,100]	[50,10,100]	[0,10,10]	[50,10,10]	[0,10,100]	[50,10,100]
50	0.584	0.633	0.518	0.603	118.382	144.536	119.877	107.294
100	1.524	1.581	1.447	1.493	148.379	148.552	148.637	148.186
150	5.033	5.915	4.880	4.958	281.077	289.349	279.872	283.195
200	8.826	9.284	8.593	8.855	290.262	309.738	287.439	295.963
250	9.643	9.882	9.415	9.758	299.641	303.618	294.237	301.366
(c) Fixed minUsc
	Time (s)	Memory (MB)
$K$	[0,0,100]	[50,0,100]	[0,50,100]	[50,50,100]	[0,0,100]	[50,0,100]	[0,50,100]	[50,50,100]
50	0.591	0.654	0.572	0.628	123.187	145.936	119.893	125.244
100	1.582	1.618	1.452	1.579	148.193	148.533	148.216	148.419
150	4.792	5.057	4.594	4.921	281.183	292.969	277.144	285.738
200	9.291	8.950	8.616	8.841	279.784	309.829	281.143	291.437
250	9.723	10.509	9.488	10.274	287.139	310.861	285.793	298.155

Table 7

The impact of parameters on Yoochoose-buys dataset

(a) Fixed maxLa
	Time (s)				Memory (MB)
$K$	[10,0,10]	[10,50,10]	[10,0,100]	[10,50,100]	[10,0,10]	[10,50,10]	[10,0,100]	[10,50,100]
50	5.496	5.154	4.992	4.897	405.812	401.184	400.387	336.412
100	8.368	7.975	8.017	7.809	437.395	427.216	430.167	412.766
150	10.873	10.554	10.379	10.061	459.013	432.217	441.186	422.961
200	14.505	14.249	14.322	13.952	506.306	480.105	486.602	463.056
250	21.938	21.424	21.294	20.698	547.213	521.171	534.866	498.935
(b) Fixed minSr
	Time (s)				Memory(MB)
$K$	[0,10,10]	[50,10,10]	[0,10,100]	[50,10,100]	[0,10,10]	[50,10,10]	[0,10,100]	[50,10,100]
50	5.241	6.184	4.937	6.079	402.623	403.245	396.872	399.247
100	8.190	8.717	7.846	8.289	429.167	432.290	421.739	426.815
150	10.502	10.686	10.224	10.499	443.539	457.232	438.543	450.138
200	14.093	14.527	13.907	14.287	494.599	522.746	480.137	491.512
250	21.679	21.916	21.245	21.751	534.280	540.893	531.084	533.946
(c) Fixed minUsc
	Time (s)				Memory(MB)
$K$	[0,0,100]	[50,0,100]	[0,50,100]	[50,50,100]	[0,0,100]	[50,0,100]	[0,50,100]	[50,50,100]
50	4.976	5.583	4.501	5.197	396.697	407.980	371.902	395.449
100	7.593	7.983	7.137	7.204	423.062	430.186	397.534	419.223
150	10.316	10.531	9.783	10.026	439.859	443.929	425.221	432.637
200	14.135	14.476	13.907	14.399	482.625	500.590	475.440	480.546
250	21.387	21.966	20.598	21.782	533.436	545.813	524.753	528.961

Table 8

The precision of the five algorithms on the Kosarak10k dataset

K	5	10	15	20	25	30
TKSPUS	100%	90%	87%	90%	92%	90%
PUSOM	100%	100%	75%	67%	67%	50%
PHM	81%	81%	50%	54%	63%	60%
THUI	100%	90%	100%	95%	92%	83%
TKUCE+	20%	0%	0%	0%	0%	0%

5.3. Execution time

This group of experiments evaluates the execution time of the algorithms by setting different numbers of expected items (i.e. $k$ ) for each dataset. However, the comparison algorithms PUSOM and PHM do not belong to top-k mining. Therefore, we use our algorithm TKSPUS algorithm to determine a final utility threshold based on the given $k$ value and take it as the minimum utility threshold input for the PUSOM and PHM algorithms.

The experimental results in Fig. 3 show that the running time of TKUCE+ algorithm on six datasets is higher than that of the other four algorithms. This is because TKUCE+ involves complex computational steps or iterative processes, resulting in an increase in overall execution time. In most cases, the proposed algorithm TKSPUS is superior to other comparison algorithms in terms of execution time. This result demonstrates that the pusc structure introduced by TKSPUS can effectively improve the mining efficiency, and TKSPUS can adapt well to large-scale datasets containing a large number of sequences (Yoochoose-buys). In some cases, the TKSPUS algorithm is inferior to PUSOM and PHM. This is mainly because the utility threshold calculated by TKSPUS algorithm is often large, resulting in fewer patterns discovered by PUSOM and PHM algorithms and shorter runtime. However, in general, if users do not use the top-k mining method, it is necessary to repeatedly explore to find the appropriate threshold. Therefore, although PUSOM and PHM exhibit better, such performance cannot be guaranteed in many practical applications. In summary, our TKSPUS algorithm performs well in runtime on dense and sparse datasets.

Figure 3.

Execution time of different algorithms on six datasets.

5.4. Memory consumption

In this group of experiments, we compare memory consumption in five algorithms. Similarly, different $k$ values were set, and the experimental results are shown in Fig. 4. On the five datasets except the Yoochoose-buys dataset, the TKSPUS algorithm performs well in memory consumption comparing with TKUCE+ and THUI. However, it is worth noting that in some cases, the memory consumption of TKSPUS is slightly higher than that of the PUSOM and PHM algorithms. This is mainly because the performance advantages of PUSOM and PHM algorithms depend seriously on the utility threshold, a lower threshold may lead to a sharp decline in their performance. In addition, on Yoochoose-buys dataset involving a large number of sequences, the TKSPUS algorithm will generate more candidate sets during operation, resulting in a large memory consumption, but it is still better than the comparison algorithm TKUCE+.

Figure 4.

Memory consumption of different algorithms on six datasets.

5.5. Stability

Stability is an important metric for our proposed algorithm. In this group of experiments, we measure the consistency and stability of the algorithm in identifying related patterns by quantifying the precision of positive predictions. To evaluate precision, the dataset is divided into 60% for training and 40% for testing, and the training results are used to evaluate the precision of the periodic high-utility patterns in the testing set. The experimental results are illustrated in Tables 8 and 9. The precision is defined as follows. The higher the precision value, the stronger the stability of the pattern mined by the algorithm.

\begin{aligned} p r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e s} \end{aligned}

(14)

In which, TruePositive refers to the patterns with stable periodic high-utility generated in both training and testing datasets. FalsePositive refers to the patterns with periodic high-utility generated from the training set, but are not discovered in the test set.

Table 9

The precision of the five algorithms on the sign dataset

K	5	10	20	30	40	50
TKSPUS	100%	100%	90%	60%	53%	58%
PUSOM	3%	3%	3%	3%	2%	2%
PHM	0%	0%	0%	0%	0%	19%
THUI	0%	0%	0%	0%	3%	2%
TKUCE+	0%	0%	0%	0%	0%	0%

Figure 5.

Scalability comparison of different algorithms.

Similarly, in this section of the experiment, since PUSOM and PHM algorithms are not top-k mining, it is still necessary to first apply the TKSPUS algorithm to calculate the utility value as the minimum utility threshold for PUSOM. From the experimental results, it can be seen that traditional high-utility mining algorithms TKUCE+ and THUI perform very unstable, and the stability of periodic high-utility mining PHM and PUSOM algorithms is also greatly affected by the characteristics of the dataset. However, our TKSPUS algorithm has a precision of over 50% and exhibits good stability when mining periodic high-utility patterns in multi-sequences. The advantage of multi-sequence mining is that the frequent regularity of patterns within a single sequence and among sequences can be comprehensively considered. The two new pruning strategies adopted by TKSPUS algorithm can effectively filter out unstable patterns, thereby improving the stability and accuracy of mining results. Therefore, the TKSPUS algorithm can produce consistent and reliable mining results regardless of which dataset it is applied to.

Table 10

List of abbreviations

Abbreviations	complete meaning
SPM	Sequential pattern mining
HUSPs	High utility sequential patterns
HUSPM	High utility sequential pattern mining
PFPs	Periodic frequent patterns
PSPM	Periodic sequential pattern mining
PHUSPs	Periodic high utility sequential patterns
PHUSPM	Periodic high utility sequential pattern mining
TKSPUS	Top-k stable periodic high-utility sequence pattern mining
SPHUPs	Stable periodic high utility patterns
SPHUPM	Stable periodic high utility pattern mining
PEU	Prefix extension utility
RSU	Reduced sequence utility
PE pruning strategy	Prefix extension utility strategy
RS pruning strategy	Reduced sequence utility strategy
US pruning strategy	Utility stability pruning strategy
PS pruning strategy	Periodic stability pruning strategy

5.6. Scalability

The impact of database size on the overall performance of the algorithm is investigated. We fixed the $k$ value at 20 varied the database size in increments of 25%, and compared the performance of the five algorithms in terms of running time and memory consumption. We also take the final utility value generated by the TKSPUS algorithm when $k$ is set to 20 as the minimum utility threshold of the comparison algorithms PUSOM and PHM to conduct scalability experiments. Typically, both runtime and memory usage should tend to increase as the dataset grows larger. The experimental results are shown in Fig. 5. It can be observed that the TKUCE+ algorithm has significantly higher runtime and memory consumption than the other four algorithms and the running time of the TKSPUS algorithm is similar to that of THUI and PUSOM, but TKSPUS behaves better than THUI and PUSOM in terms of memory consumption. On the Bible dataset, the memory consumption of the PHM algorithm is better than that of the TKSPUS algorithm. Experimental results show that the proposed TKSPUS method exhibits good scalability on datasets. As the number of sequences in the dataset increases, the execution time and memory consumption of the TKSPUS algorithm show an almost linear growth trend. The reason is that the four pruning strategies used in TKSPUS can prune more nodes than other algorithms.

6. Conclusion and future work

Periodic high-utility patterns mining is an emerging research field, which aims to discover periodic patterns with high-utility. However, there are still many limitations in traditional algorithms, such as most of the mining results are unstable, constraints are too strict, and focusing on pattern mining for a single sequence. To address these issues, this study proposes a mining stable top-k periodic high-utility algorithm from multi-sequences, named TKSPUS. The utility stability constraint and the period stability constraint are added to the top-k periodic high-utility patterns mining to make the mining results more meaningful. The algorithm adopts the projection mechanism and a new data structure, and pruning strategies are introduced, which effectively improves the efficiency of the algorithm. Finally, extensive experiments are carried out on multiple datasets to prove the effectiveness of TKSPUS algorithm. The results show that the patterns generated by traditional high-utility mining algorithms and periodic high-utility mining algorithms are often unstable, especially for datasets with uncertain features. However, experiment results on datasets with different features show that our TKSPUS algorithm performs excellently in terms of execution time, memory consumption, pattern stability, and so on. As a future research direction, we will focus on sequence data pre-processing and extend the TKSPUS algorithm to accommodate the dynamic growth of data in the era of big data, which will enable it to be applied to dynamic and massive time series databases.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China(Grant No. 62272336), Projects of Science and Technology Cooperation and Exchange of Shanxi Province (Grant Nos. 202204041101037, 202204041101033).

References

Zhang

Gan

P.S.

, TKUS: Mining top-k high utility sequential patterns, Information Sciences (2021).

Swati

, Soni Hemant Kumar, Issues and research challenges in sequential pattern mining, 2020 IEEE International Conference on Advances and Developments in Electrical and Electronics Engineering (ICADEE), 2020.

Yin

Zheng

Cao

Song

Wei

, Efficiently Mining Top-K High Utility Sequential Patterns, 2013 IEEE International Conference on Data Mining (ICDM), IEEE, 2013.

Fournier-Viger

Wang

Yang

et al. Tspin: Mining top-k stable periodic patterns, Applied Intelligence 439 (2021).

Dinh

D.T.

Fournier-Viger

, An efficient algorithm for mining periodic high-utility sequential patterns, Applied Intelligence 48 (2018), 4694–4714.

Fournier-Viger

Lin

C.W.

, Efficient algorithms to identify periodic patterns in multiple sequences.Information Sciences 489 (2019), 205–226.

Agarwal

Srikant

, Mining Sequential Patterns: Generalizations and Performance Improvements, International Conference on Extending Database Technology Springer, Berlin, Heidelberg, 1996.

Zaki

M.J.

, SPADE: An Efficient Algorithm for Mining Frequent Sequences, Machine Learning 42 (2001), 31–60.

Han

Pei

Mortazaviasl

et al., FreeSpan: frequent pattern-projected sequential pattern mining, Proc.int.conf.on Knowledge Discovery and Data Mining Boston Ma, 2000.

10.

Jian

Han

Mortazaviasl

et al., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, International Conference on Data Engineering, 2001.

11.

Chiu

D.Y.

Y.H.

Chen

A.L.P.

, An efficient algorithm for mining frequent sequences by a new strategy without support counting, International Conference on Data Engineering, IEEE Computer Society, 2004.

12.

Ahmed

C.F.

Tanbeer

S.K.

Jeong

B.S.

, A Novel Approach for Mining High-Utility Sequential Patterns in Sequence Databases, ETRI Journal 32 (2010), 676–686.

13.

Yin

Zheng

Cao

, USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns, SIGKDD Explorations CD/ROM (2012).

14.

Zhang

, An Efficient Parallel High Utility Sequential Pattern Mining Algorithm, 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2019.

15.

Wga

Cwla

et al., ProUM: Projection-based utility mining on sequence data, Information Sciences 513 (2020), 222–240.

16.

Gan

Lin

C.W.

Zhang

et al., Fast Utility Mining on Sequence Data, IEEE Transactions on Cybernetics 2 (2021), 487–500.

17.

Ishita

S.Z.

Ahmed

C.F.

Leung

C.K.

, New approaches for mining regular high utility sequential patterns, Applied Intelligence 52 (2022), 3781–3806.

18.

Huang

Gan

Weng

et al., US-Rule: Discovering Utility-driven Sequential Rules, ACM Transactions on Knowledge Discovery from Data 17 (2023), 1556–4681.

19.

Alam

M.T.

Roy

Ahmed

C.F.

et al., UGMINE: utility-based graph mining, Applied Intelligence 53 (2023), 49–68.

20.

Wang

Huang

Chen

, On efficiently mining high utility sequential patterns, Knowledge and Information Systems 49 (2016), 597–627.

21.

Lin

C.W.

Fournier-Viger

et al., Efficient Chain Structure for High-Utility Sequential Pattern Mining, IEEE Access 8(2020), 40714–40722.

22.

Lin

C.W.

Djenouri

Srivastava

et al., Efficient evolutionary computation model of closed high-utility itemset mining, Applied Intelligence, 52(9) (2022), 10604–10616.

23.

Likhitha

Ravikumar

Kiran

R.U.

et al., Discovering Closed Periodic-Frequent Patterns in Very Large Temporal Databases, IEEE BIG DATA, IEEE (2020), 4700–4709.

24.

Kim

Ryu

Lee

et al., EHMIN: Efficient approach of list based high-utility pattern mining with negative unit profits, Expert Systems with Applications 209 (2022), 118214.

25.

Han

Zhang

Wang

et al., Mining closed high utility patterns with negative utility in dynamic databases, Applied Intelligence 53 (2023), 11750–11767.

26.

Zhang

Chen

et al., Mining periodic trends via closed high utility patterns, Expert Systems with Application2023.

27.

Afriyie

M.K.

Nofong

V.M.

Wondoh

et al., Efficient Mining of Non-Redundant Periodic Frequent Patterns, Vietnam Journal of Computer Science 8(4) (2021), 455–469.

28.

Tanbeer

S.K.

Ahmed

C.F.

Jeong

B.S.

Lee

Y.K.

, Discovering Periodic-Frequent Patterns in Transactional Databases, Pacific-asia Conference on Advances in Knowledge Discovery and Data Mining (2009).

29.

Kiran

R.U.

Watanobe

Chaudhury

Zettsu

Kitsuregawa

, Discovering Maximal Periodic-Frequent Patterns in Very Large Temporal Databases, 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) (2020).

30.

Likhitha

Ravikumar

Uday Kiran

et al., Discovering Closed Periodic-Frequent Patterns in Very Large Temporal Databases, 2020 IEEE International Conference on Big Data (Big Data) (2020), 4700–4709.

31.

Huang

Jaysawal

B.P.

Wang

, Mining full, inner and tail periodic patterns with perfect, imperfect and asynchronous periodicity simultaneously, Data Mining and Knowledge Discovery 35 (2021).

32.

Fournier-Viger

Chi

T.T.

et al., Finding Periodic Patterns in Multiple Sequences, Periodic Pattern Mining, Springer, Singapore, 2021.

33.

Fournier-Viger

Yang

, Discovering rare correlated periodic patterns in multiple sequences, Data and Knowledge Engineering 126 (2020), 101733.

34.

Fournier-Viger

Lin

W.C.

Duong

et al., PHM: Mining Periodic Itemsets, Industrial Conference on Data Mining, 2016.

35.

Ut Huynh, Le

Dinh

Huynh

, Mining Periodic High Utility Sequential Patterns, Intelligent Information and Database Systems, 2017.

36.

Song

Zheng

Huang

Liu

, Heuristically mining the top-k itemsets with cross-entropy optimization, Applied Intelligence, 2021.

37.

Krishnamoorthy

, Mining top-k high utility itemsets with effective threshold raising strategies, Expert Systems with Applications 117 (2018), 148–165.

38.

Fournier-Viger

Lin

C.W.

Gomariz

et al., [Lecture Notes in Computer Science] Machine Learning and Knowledge Discovery in Databases Volume 9853 || The SPMF Open-Source Data Mining Library Version 2, 2016.

39.

Srikant

Agrawal

, Mining sequential patterns: generalizations and performance improvements. In: Advances in Database Technology-EDBT’96, (1996), pp. 1–17.