NetNDP: Nonoverlapping (delta,gamma)-approximate pattern matching

Abstract

Pattern matching can be used to calculate the support of patterns, and is a key issue in sequential pattern mining (or sequence pattern mining). Nonoverlapping pattern matching means that two occurrences cannot use the same character in the sequence at the same position. Approximate pattern matching allows for some data noise, and is more general than exact pattern matching. At present, nonoverlapping approximate pattern matching is based on Hamming distance, which cannot be used to measure the local approximation between the subsequence and pattern, resulting in large deviations in matching results. To tackle this issue, we present a Nonoverlapping Delta and gamma approximate Pattern matching (NDP) scheme that employs the $(\delta,\gamma)$ -distance to give an approximate pattern matching, where the local and the global distances do not exceed $\delta$ and $\gamma$ , respectively. We first transform the NDP problem into a local approximate Nettree and then construct an efficient algorithm, called the local approximate Nettree for NDP (NetNDP). We propose a new approach called the Minimal Root Distance which allows us to determine whether or not a node has root paths that satisfy the global constraint and to prune invalid nodes and parent-child relationships. NetNDP finds the rightmost absolute leaf of the max root, searches for the rightmost occurrence from the rightmost absolute leaf, and deletes this occurrence. We iterate the above steps until there are no new occurrences. Numerous experiments are used to verify the performance of the proposed algorithm.

Keywords

Patterns matching approximate pattern matching nonoverlapping condition -distance Nettree structure

1. Introduction

Pattern matching (or string matching [1, 2, 3, 4]) is a key issue in computer science [5, 6, 7, 8] and plays an important part in many applications, such as pattern mining [9, 10, 11], knowledge discovery [12], bioinformatics analysis [13], fault detection [14], and time series forecasting [15]. In recent years, pattern matching with multiple gap constraints, as an important branch of pattern matching [16, 17], has attracted a great deal of attention in many fields, such as time series analysis [18, 19], sequential pattern mining [21, 22], and text key phrase extraction [23]. A pattern with gap constraints can be written as $P=p_{1}[\min_{1},\max_{1}]p_{2}\cdots p_{j}[\min_{j},\max_{j}]p_{j+1}\cdots p_% {m-1}[\min_{m-1},\max_{m-1}]p_{m}$ , where $\min_{j}$ and $\max_{j}$ represent the minimal and maximal wildcards between $p_{j}$ and $p_{j+1}$ , respectively. $[\min_{j},\max_{j}]$ can be set flexibly [24], and this appoach is more practical than the traditional wildcards “?” and “*”.

In a pattern matching problem with gap constraints, the number of occurrences is exponential if there are no constraints [25], and some researchers have therefore focused on pattern matching under the one-off condition [26]. However, the one-off condition is too stringent, and means that some useful information is lost. Ding et al. [27] then proposed the concept of nonoverlapping. The nonoverlapping pattern matching was proved to be solved in polynomial time complexity [28]. To make the gap constraint more suitable for practical applications, non-overlapping pattern matching with general gaps was studied [29]. Nonoverlapping sequential pattern mining is able to find valuable frequent patterns effectively [30]. To reduce the number of redundant patterns, nonoverlapping closed sequential pattern matching was explored and improved the mining performance [9].

However, the above mentioned researches are exact pattern matching [28, 31]. It means that noise is not allowed in the above researches, which is difficult to obtain valuable information. Approximate pattern matching [32, 33] allows for noise, and it can therefore handle practical problems more effectively. The Hamming distance [34, 35] is commonly applied in approximate pattern matching. But the Hamming distance only reflects the number of different characters, and ignores the distances between characters. Therefore, nonoverlapping approximate pattern matching scheme based on the Hamming distance may cause large deviations when applied to time series [36, 37]. Inspired by $(\delta,\gamma)$ -distance, [38, 39, 40], this paper focuses on Nonoverlapping Delta and gamma approximate Pattern matching (NDP) that employs the $(\delta,\gamma)$ -distance to give an approximate pattern matching, where the local and the global distances do not exceed $\delta$ and $\gamma$ , respectively. An illustrative example is shown as follows.

Figure 1 shows the matching results of symbolized time series of pattern $P=$ b $[0,1]$ d $[0,1]$ b. Figure 1a is consistent with pattern $P$ without gaps, while Fig. 1b and c contain gaps and can match pattern $P$ exactly. Figure 1d–f match pattern $P$ with a Hamming distance of one, but due to the large deviations, they are not very similar to Fig. 1a. For instance, ‘e’ and ‘b’ show a large deviation in Fig. 1d. Figure 1g–i which match pattern $P$ with the $(\delta,\gamma)$ -distance show a close similarity to Fig. 1a. Figure 1g and h match pattern $P$ with $(\delta=1,\gamma=1)$ , while Fig. 1i matches pattern $P$ with $(\delta=1,\gamma=2)$ . We can therefore see that the $\delta$ -distance and $\gamma$ -distance can be used to measure the local approximation and the global approximation, respectively. $(\delta,\gamma)$ -approximate pattern matching ensures overall similarity.

Figure 1.

Pattern $P=$ b $[0,1]$ d $[0,1]$ b and symbolized time series matching results. The time series in (a), (b), (c), (d), (e), (f), (g), (h), and (i) can be symbolized as “bdb”, “badb”, “baddb”, “ebaddb”, “bafdb”, “badde”, “aaddb”, “bacdb”, and, “bacdc”, respectively. All these sequences can be matched by pattern $P$ . The trends of (b) and (c) are similar to that of (a), since they are exact pattern matching. Although Fig. 1(d)–(f) match pattern $P$ with a Hamming distance of one, the trends of (d), (e), and (f) are significant different from that of (a) due to the large deviations. However, Fig. 1(g)–(i) match pattern P with the $(\delta,\gamma)$ -distance, and the trends of (g), (h), and (i) are similar to that of (a).

The contributions of this paper are as follows.

(1)

To avoid large deviations of Hamming distance, we present a Nonoverlapping Delta and gamma approximate Pattern matching (NDP) shceme and propose an efficient algorithm called NetNDP (a local approximate Nettree for NDP).

(2)

NetNDP employs the concept of a local approximate Nettree and MRD (Minimal Root Distance) to improve the efficiency.

(3)

We carry out numerous experiments to verify the efficiency of NetNDP, and show that the $(\delta,\gamma)$ -distance outperforms the Hamming distance for matching.

The rest of this paper is organized as follows: Section 2 presents some relevant definitions. Section 3 introduces related work. Section 4 explores the local approximate Nettree structure, and proposes the NetNDP algorithm. Section 5 presents results that validate the performance of NetNDP. We conclude this paper in Section 6.

2. Problem definition

.

A sequence $S$ can be written as $S=s_{1}s_{2}\cdots s_{n}$ , where $n$ is the length of $S$ , $s_{i}\in\Sigma(1\leqslant i\leqslant n)$ , and $\Sigma$ is the character set.

.

A pattern with gap constraints can be expressed as $P=p_{1}[\min_{1},\max_{1}]p_{2}\cdots p_{j}[\min_{j},\linebreak\max_{j}]p_{j+1% }\cdots p_{m-1}[\min_{m-1},\max_{m-1}]p_{m}$ , where $p_{j}\in\Sigma(1\leqslant j\leqslant m)$ , $m$ is the length of $P$ , and $\min_{j}$ and $\max_{j}$ represent the minimal and maximal wildcards between $p_{j}$ and $p_{j+1}$ , respectively. $[\min_{j},\max_{j}]$ is a gap constraint.

.

Given any two characters $c$ and $d$ in $\Sigma$ , the $\delta$ -distance between $c$ and $d$ is $|c-d|$ , and is denoted by $D_{\delta}(c,d)$ .

.

Suppose we have two sequences $S_{1}=x_{1}x_{2}\cdots x_{n}$ and $S_{2}=y_{1}y_{2}\cdots y_{n}$ , where $x_{i}\in\Sigma,y_{i}\in\Sigma(1\leqslant i\leqslant n)$ . The $\gamma$ -distance between $S_{1}$ and $S_{2}$ is $\sum_{i=1}^{n}D_{\delta}(x_{i},y_{i})=\sum_{i=1}^{n}|x_{i}-y_{i}|$ , and is denoted by $D_{\gamma}(S_{1},S_{2})$ .

.

Suppose $S_{1}=$ aef and $S_{2}=$ cee, the $\delta$ -distances between the corresponding characters of $S_{1}$ and $S_{2}$ are then $D_{\delta}(x_{1},y_{1})=|\mathrm{a}-\mathrm{c}|=2$ , $D_{\delta}(x_{2},y_{2})=|\mathrm{e}-\mathrm{e}|=0$ , and $D_{\delta}(x_{3},y_{3})=|\mathrm{f}-\mathrm{e}|=1$ . Thus, the $\gamma$ -distance between $S_{1}$ and $S_{2}$ is $D_{\gamma}(S_{1},S_{2})=\sum_{i=1}^{3}D_{\delta}(x_{i},y_{i})=\sum_{i=1}^{3}|x% _{i}-y_{i}|=2+0+1=3$ .

.

Suppose we have a sequence $S=s_{1}s_{2}\cdots s_{n}$ , a pattern $P=p_{1}[\min_{1},\max_{1}]p_{2}\cdots[\min_{j},\linebreak\max_{j}]p_{j+1}% \cdots[\min_{m-1},\max_{m-1}]p_{m}$ , a local threshold $\delta$ , and a global threshold $\gamma$ . If $<l_{1},l_{2},\cdots,l_{m}>$ satisfies the following conditions, then $<l_{1},l_{2},\cdots,l_{m}>$ is a $(\delta,\gamma)$ -approximate occurrence of $P$ in $S$ .

(1)
$1\leqslant l_{1}<l_{2}<\cdots<l_{m}\leqslant n$ and $\min_{j}\leqslant l_{j+1}-l_{j}-1\leqslant\max_{j}(1\leqslant j\leqslant m-1)$ .
(2)
(local constraint) $\max_{j=1}^{m}D_{\delta}(s_{l_{j}},p_{j})\leqslant\delta$ .
(3)
(global constraint) $D_{\gamma}(s_{l_{1}}s_{l_{2}}\cdots s_{l_{m}},p_{1}p_{2}\cdots p_{m})=\sum_{j=% 1}^{m}|s_{l_{j}}-p_{j}|\leqslant\gamma$ .

.

$L=<l_{1},l_{2},\cdots,l_{m}>$ and $L^{\prime}=<l_{1}^{\prime},l_{2}^{\prime},\cdots,l_{m}^{\prime}>$ are any two occurrences of pattern $P$ in sequence $S$ . If for all $j(1\leqslant j\leqslant m)l\neq l_{j}^{\prime}$ , then $L$ and $L^{\prime}$ are two nonoverlapping occurrences.

.

The NDP problem is to find the maximum $(\delta,\gamma)$ -approximate occurrences of pattern $P$ in sequence $S$ , where any two occurrences are nonoverlapping.

.

We have a sequence $S=s_{1}s_{2}s_{3}s_{4}s_{5}=$ acaba, a pattern $P=p_{1}[\min_{1},\max_{1}]p_{2}[\min_{2},\linebreak\max_{2}]p_{3}=$ a $[0,1]$ b $[0,2]$ a, a local threshold $\delta=1$ , and a global threshold $\gamma=1$ . According to Fig. 2, we know that there are four $(\delta,\gamma)$ -approximate occurrences of $P$ in $S$ under no special condition, which are $<$ 1, 2, 3 $>$ , $<$ 1, 3, 5 $>$ , $<$ 1, 4, 5 $>$ , and $<$ 3, 4, 5 $>$ . The maximum nonoverlapping $(\delta,\gamma)$ -approximate occurrences are $\{<$ 1, 2, 3 $>$ , $<$ 3, 4, 5 $>\}$ .
3. Related work

Pattern matching can be divided into exact [28, 31] and approximate pattern matching [32]. Exact pattern matching requires that the pattern and subsequence are the same, which limits the flexibility of matching, while approximate pattern matching allows differences between the pattern and subsequence, and can discover more useful information. For instance, Hu et al. [32] implemented a string similarity search through a gram filter. Chen et al. [33] discovered all subsequences with $k$ mismatches using an indexing mechanism. The Hamming distance [34, 35] is commonly applied in approximate pattern matching. However, it only reflects the number of different characters, and ignores the distances between characters. Motivated by this, Zhang et al. [38] developed threshold-based approximate pattern matching, in which the maximal distance between characters does not exceed $\delta$ . Fredriksson and Grabowski [40] added a threshold $\gamma$ , meaning that the sum of the distances between the characters does not exceed $\gamma$ , which can obtain a better matching effect. Compared with traditional wildcard pattern matching, the pattern matching with gap constraints is both more flexible and difficult. An illustrative example is shown as follows.

.

Suppose we have a sequence $S=s_{1}s_{2}s_{3}s_{4}s_{5}=$ acaba, a pattern $P=p_{1}[\min_{1},\max_{1}]p_{2}$ $[\min_{2},\max_{2}]p_{3}=$ a $[0,1]$ b $[0,2]$ a, a local threshold $\delta=1$ and global threshold $\gamma=1$ . Figure 2 shows all $(\delta,\gamma)$ -approximate occurrences of pattern $P$ in sequence $S$ .

Figure 2.

All $(\delta,\gamma)$ -approximate occurrences of $P=a[0,1]b[0,2]a$ in $S$ with $(\delta=1,\gamma=1)$ .

The subsequence $s_{1}s_{2}s_{3}$ is abbreviated as $<$ 1, 2, 3 $>$ , and the other three occurrences are $<$ 1, 2, 5 $>$ , $<$ 1, 3, 5 $>$ , and $<$ 3, 4, 5 $>$ . The reasons are as follows. Let us consider $<$ 3, 4, 5 $>$ at first. Since $s_{3}=p_{1}=$ a, we know that $D_{\delta}(s_{3},p_{1})=0$ . Similarly, $D_{\delta}(s_{4},p_{2})=D_{\delta}(s_{5},p_{3})=0$ , Therefore, $D_{\delta}(s_{3}s_{4}s_{5},p_{1}p_{2}p_{3})=0$ . Hence, $<$ 3, 4, 5 $>$ is an exact match which is a special case of $(\delta,\gamma)$ -approximate pattern matching. Now, let us take $<$ 1, 2, 3 $>$ as another example. Since we know that $D_{\delta}(s_{2},p_{2})=|\mathrm{c}-\mathrm{b}|=1\leqslant\delta$ , and $D_{\delta}(s_{1},p_{1})=D_{\delta}(s_{3},p_{3})=0$ , we also know that $D_{\gamma}(s_{1}s_{2}s_{3},p_{1}p_{2}p_{3})=1\leqslant\gamma$ . Thus, $<$ 1, 2, 3 $>$ is a legal $(\delta,\gamma)$ -approximate occurrence with $(\delta=1,\gamma=1)$ . Similarly, $<$ 1, 2, 5 $>$ and $<$ 1, 3, 5 $>$ are two legal $(\delta,\gamma)$ -approximate occurrences with $(\delta=1,\gamma=1)$ . However, $<$ 2, 4, 5 $>$ and $<$ 1, 3, 4 $>$ are illegal $(\delta,\gamma)$ -approximate occurrences with $(\delta=1,\gamma=1)$ , since $D_{\delta}(s_{2},p_{1})=2>\delta$ and $D_{\gamma}(s_{1}s_{3}s_{4},p_{1}p_{2}p_{3})=2>\gamma$ , respectively.

From the perspective of description an occurrence, there are loose pattern matching and strict pattern matching. Loose pattern matching uses the last position in the sequence to express an occurrence [40], while strict pattern matching uses a set of positions to express. Under loose pattern matching [40], there are only two $(\delta,\gamma)$ -approximate occurrences in Example 1, $<$ 3 $>$ and $<$ 5 $>$ , while Example 1 can be seen as strict pattern matching. Obviously, compared with loose pattern matching, strict pattern matching emphasizes the details of matching.

There are three kinds of conditions in strict pattern matching, no special condition [35], the one-off condition [26], and the nonoverlapping condition [34]. Example 3 can be seen as no special condition, since the condition has no restrictions on characters, which result in the number of occurrences grows exponentially. The occurrence under the one-off condition may be any of the four occurrences, since any character of the sequence can only be used once. For instance, if we select $<$ 1, 2, 3 $>$ , then $<$ 3, 4, 5 $>$ cannot be selected since this means that $s_{3}$ would be used twice. However, $<$ 1, 2, 3 $>$ and $<$ 3, 4, 5 $>$ are two occurrences under the nonoverlapping condition. Although $s_{3}$ is used twice, it matches with $p_{3}$ and $p_{1}$ , respectively. We can clearly see that the nonoverlapping condition can simplify the occurrences without excluding useful information, and therefore reduces the limitations of matching.

Although nonoverlapping pattern matching has been investigated in [28], it is an exact pattern matching approach which does not allow noise. To the best of our knowledge, the researches in [40, 34, 37] are the most closest studies. The drawbacks are as follows.

(1)

Although the scheme in [40] focused on approximate pattern matching with the $(\delta,\gamma)$ -distance, it is a kind of loose pattern matching which adopts the end position in the sequence to represent an occurrence. Therefore, it is difficult to represent an occurrence clearly. Our method is a kind of strict pattern matching which uses a set of positions in the sequence to express an occurrence. Thus, our method can represent an occurrence clearly. Hence, our method is a more practical approach.

(2)

Although nonoverlapping approximate pattern matching has been investigated in [34], it adopts the Hamming distance to measure the distance between pattern and occurrence. But the Hamming distance only reflects the number of different characters, and cannot evaluate the distances between characters. However, our method employs $(\delta,\gamma)$ -distance to effectively measure the local and global distances. Therefore, our method is more challenging.

(3)

Although $(\delta,\gamma)$ -distance pattern mining under no special condition was studied in [37], this method can find all approximate occurrences, which contain many redundant occurrences. As Example 3 shown, there are four $(\delta,\gamma)$ -approximate occurrences according to no special condition [37]. However, Example 3 has two nonoverlapping $(\delta,\gamma)$ -approximate occurrences, which can effectively reduce redundant occurrences.

Our method can be used in repetitive sequential pattern mining [41, 42] and sequence classification [43, 44]. For example, the repetitive sequential pattern mining task is to discover frequent patterns in sequences whose supports are no less than the given threshold [45, 46]. To calculate the support of a candidate pattern is a pattern matching task. Thus, many gap constraints sequential pattern mining methods employed pattern matching strategies to discover the interesting patterns [47, 48]. Moreover, contrast pattern mining was investigated to extract the features for classification task [43, 44]. Hence, our method can also be applied in approximate repetitive sequential pattern mining and time series classification.

4. Local approximate Nettree and algorithms

Section 4.1 introduces the local approximate Nettree. Section 4.2 explains how to transform the NDP problem into a local approximate Nettree and proposes the NetNDP algorithm.

4.1 Local approximate Nettree

.

A Nettree [35] is an extended tree structure with multiple roots and multiple parents. In a Nettree, nodes with the same ID can be found at different levels. To describe a node clearly, a node with ID $i$ at the $j^{\rm th}$ level is denoted by $n_{j}^{i}$ .

A standard Nettree is shown in Fig. 3. In Fig. 3, $r_{1},\cdots,r_{m}$ are the roots of the Nettree. $T_{1},T_{2},\cdots,T_{n}$ are the subNettrees. Each subNettree can have many parents.

Figure 3.

A standard Nettree.

.

In a Nettree, a leaf at the $m^{\rm th}$ level is called an absolute leaf.

.

In a Nettree, a path from $n_{j}^{i}$ to a root is called a root path of $n_{j}^{i}$ . A path from a absolute leaf to a root is called a root-leaf path, or a full path, and is denoted by $<n_{1}^{i_{1}},n_{2}^{i_{2}},\cdots,n_{m}^{i_{m}}>$ .

.

An occurrence of pattern $P$ in sequence $S$ can be represented by a full path.

Proof..

We know that an occurrence can be transformed into a full path in a Nettree [35], i.e. a full path $<n_{1}^{i_{1}},n_{2}^{i_{2}},\ldots,n_{m}^{i_{m}}>$ corresponds to an occurrence $<i_{1},i_{2},\cdots,i_{m}>$ . Thus, a search for the occurrences of pattern $P$ in sequence $S$ can be transformed into a search for the full paths in a Nettree. ∎

.

In a Nettree, $n_{j}^{i}$ can reach multiple parents, of which the max parent is the rightmost parent of $n_{j}^{i}$ . In a similar way, we can reach the rightmost absolute leaf.

To effectively solve the problem of nonoverlapping approximate pattern matching with the $(\delta,\gamma)$ -distance, we propose the concept of a local approximate Nettree.

.

Local approximate Nettree has two features different from a Nettree.

(1)

Each node $n_{j}^{i}$ calculates and stores its $D_{\delta}(s_{i},p_{j})(D_{\delta}(s_{i},p_{j})\leqslant\delta)$ .

(2)

Each node $n_{j}^{i}$ calculates and stores its MRD, denoted by $M_{r}(n_{j}^{i})$ , which presents the shortest $\gamma$ -distance from $n_{j}^{i}$ to its roots.

.

If $n_{j-1}^{r^{1}},n_{j-1}^{r^{2}},\cdots$ , and $n_{j-1}^{r^{t}}$ are parents that satisfy the gap constraint $[\min_{j-1},\max_{j-1}]$ with $n_{j}^{i}$ , then $M_{r}(n_{j}^{i})$ can be calculated as follows.

$\displaystyle M_{r}(n_{j}^{i})=\left\{\begin{array}[]{ll}D_{\delta}(s_{i},p_{1% })=|s_{i}-p_{1}|,&j=1\\ \min(M_{r}(n_{j-1}^{r^{1}}),M_{r}(n_{j-1}^{r^{2}}),\cdots,M_{r}(n_{j-1}^{r^{t}% }))+D_{\delta}(s_{i},p_{j}),&2\leqslant j\leqslant m\\ \end{array}\right.$

where $t$ denotes the number of parents of $n_{j}^{i}$ .

Proof..

According to Definition 12, $M_{r}(n_{j}^{i})$ reprents the shortest $\gamma$ -distance from $n_{j}^{i}$ to its roots. If $j=1$ , then $n_{j}^{i}$ is a root, and $M_{r}(n_{j}^{i})$ is the $\gamma$ -distance from $n_{1}^{i}$ to itself, i.e. $D_{\delta}(s_{i},p_{1})=|s_{i}-p_{1}|$ . If $2\leqslant j\leqslant m$ , then the $\gamma$ -distance from $n_{j}^{i}$ to its roots is the sum of $D_{\delta}(s_{i},p_{j})$ and the $\gamma$ -distance from its parents to its roots. Since $D_{\delta}(s_{i},p_{j})$ is a fixed value, we need to find the shortest $\gamma$ -distance from the parents to the roots. We can see that the $\gamma$ -distance from the parent with the minimal MRD to the roots is the shortest, and the sum of the $\gamma$ -distance and $D_{\delta}(s_{i},p_{j})$ is $M_{r}(n_{j}^{i})$ . ∎

To illustrate the above concepts, an example is shown as follows.

.

Suppose we have a sequence $S=s_{1}s_{2}s_{3}s_{4}s_{5}s_{6}s_{7}s_{8}s_{9}=$ baabcbbab, a pattern $P=p_{1}[\min_{1},\max_{1}]p_{2}[\min_{2},\max_{2}]p_{3}[\min_{3},\max_{3}]p_{4}=$ b $[0,1]$ a $[0,2]$ b $[0,2]$ b, a local threshold $\delta=1$ , and a global threshold $\gamma=1$ . The corresponding local approximate Nettree is shown in Fig. 4.

Figure 4.

A local approximate Nettree.

In Fig. 4, nodes with the same ID occur at different levels, such as $n_{1}^{3}$ , $n_{2}^{3}$ , and $n_{3}^{3}$ . $<n_{1}^{4},n_{2}^{6},n_{3}^{7},n_{4}^{9}>$ is a full path, and its corresponding occurrence is $<$ 4, 6, 7, 9 $>$ . $s_{3}$ matches $p_{2}$ exactly, and thus $n_{2}^{3}$ is a white node. $s_{3}$ matches $p_{3}$ approximately, and thus $n_{3}^{3}$ is a grey node. Each node $n_{j}^{i}$ in the local approximate Nettree satisfies $D_{\delta}(s_{i},p_{j})\leqslant 1$ , and each node $n_{1}^{i}$ satisfies $M_{\delta}(n_{1}^{i})=D_{\delta}(s_{i},p_{1})$ . Since $M_{r}(n_{4}^{9})=0$ , the shortest $\gamma$ -distance of $n_{4}^{9}$ to its roots is zero, and the corresponding path is therefore $<n_{1}^{1},n_{2}^{3},n_{3}^{6},n_{4}^{9}>$ . For path $<n_{1}^{2},n_{2}^{3},n_{3}^{6},n_{4}^{8}>$ , $n_{1}^{2}$ is the rightmost parent of $n_{2}^{3}$ .

4.2 NetNDP

4.2.1 Creating the Nettree

There are two key issues in creating a local approximate Nettree: creating nodes and parent-child relationships. We propose Theorem 2 to reduce the invalid nodes, and explore Theorem 3 to reduce parent-child relationships.

.

If $M_{r}(n_{j}^{i})>\gamma$ , then $n_{j}^{i}$ can be deleted.

Proof..

We know that $D_{\gamma}(s_{l_{1}}s_{l_{2}}\cdots s_{l_{j-1}}s_{i},p_{1}p_{2}\cdots p_{j-1}p% _{j})\geqslant D_{\gamma}(s_{l_{1}}s_{l_{2}}\cdots s_{l_{j-1}},$ $p_{1}p_{2}\cdots p_{j-1})$ . Thus, the $\gamma$ -distance is monotonic. Suppose $M_{r}(n_{j}^{i})>\gamma$ , i.e. the shortest $\gamma$ -distance from $n_{j}^{i}$ to its roots is greater than $\gamma$ . Since the $\gamma$ -distance is monotonic, the $\gamma$ -distance of all root-leaf paths via $n_{j}^{i}$ is greater than $\gamma$ . Therefore, there is no path via $n_{j}^{i}$ that satisfies the global constraint. Hence, if $M_{r}(n_{j}^{i})>\gamma$ , $n_{j}^{i}$ can be deleted. ∎

.

If $n_{j-1}^{r^{q}}$ satisfies the gap constraint $[\min_{j-1},\max_{j-1}]$ with $n_{j}^{i}$ , and $M_{r}(n_{j-1}^{r^{q}})+D_{\delta}(s_{i},p_{j})>\gamma$ , then a parent-child relationship cannot be established between $n_{j-1}^{r^{q}}$ and $n_{j}^{i}$ , otherwise the parent-child relationship is created.

Proof..

Suppose $M_{r}(n_{j-1}^{r^{q}})+D_{\delta}(s_{i},p_{j})>\gamma$ , i.e. the sum of $D_{\delta}(s_{i},p_{j})$ and the shortest $\gamma$ -distance from $n_{j-1}^{r^{q}}$ to its roots is greater than $\gamma$ , meaning that the sum of $D_{\delta}(s_{i},p_{j})$ and the $\gamma$ -distance from $n_{j-1}^{r^{q}}$ to any root is greater than $\gamma$ . Since the $\gamma$ -distance is monotonic, the $\gamma$ -distance for all root-leaf paths via $n_{j-1}^{r^{q}}$ and $n_{j}^{i}$ is greater than $\gamma$ . Thus, there is no path via $n_{j-1}^{r^{q}}$ and $n_{j}^{i}$ that satisfies the global constraint. Hence, the parent-child relationship between node $n_{j-1}^{r^{q}}$ and node $n_{j}^{i}$ cannot be established. ∎

.

To illustrate the principles of Theorems 1, 2, and 3, we apply the same conditions as in Example 4.

We first deal with $s_{1}$ . Since $D_{\delta}(s_{1},p_{1})=|\mathrm{b}-\mathrm{b}|=0\leqslant\delta$ , we create $n_{1}^{1}$ at the first level. According to Theorem 1, we know that $M_{r}(n_{1}^{1})=D_{\delta}(s_{1},p_{1})=0$ .

We then turn to $s_{2}$ . Since $D_{\delta}(s_{2},p_{1})=|\mathrm{a}-\mathrm{b}|=1\leqslant\delta$ , we create $n_{1}^{2}$ at the first level, and $M_{r}(n_{1}^{2})=D_{\delta}(s_{2},p_{1})=1$ . Since $D_{\delta}(s_{2},p_{2})=|\mathrm{a}-\mathrm{a}|=0\leqslant\delta$ , we create $n_{2}^{2}$ at the second level. There is a parent $n_{1}^{1}$ that satisfies the gap constraint [0, 1] with $n_{2}^{2}$ . From Theorem 1, we know that $M_{r}(n_{2}^{2})=M_{r}(n_{1}^{1})+D_{\delta}(s_{2},p_{2})=0\leqslant\gamma$ . From Theorem 3, we establish a parent-child relationship between $n_{1}^{1}$ and $n_{2}^{2}$ .

We now deal with $s_{3}$ . Since $D_{\delta}(s_{3},p_{1})=|\mathrm{a}-\mathrm{b}|=1\leqslant\delta$ , we create $n_{1}^{3}$ at the first level, and $M_{r}(n_{1}^{3})=D_{\delta}(s_{3},p_{1})=1$ . Similarly, we create $n_{2}^{3}$ at the second level, since $D_{\delta}(s_{3},p_{2})=|\mathrm{a}-\mathrm{a}|=0\leqslant\delta$ . Both $n_{1}^{1}$ and $n_{1}^{2}$ satisfy the gap constraint [0, 1] with $n_{2}^{3}$ . Thus, according to Theorem 1, $M_{r}(n_{2}^{3})=\min(M_{r}(n_{1}^{1}),M_{r}(n_{1}^{2}))+D_{\delta}(s_{3},p_{2% })=0\leqslant\gamma$ , and according to Theorem 3, $n_{2}^{3}$ establishes a parent-child relationship with both $n_{1}^{1}$ and $n_{1}^{2}$ .

We create the rest of the nodes in a similar way. Since $M_{r}(n_{2}^{4})=\min(M_{r}(n_{1}^{2}),M_{r}(n_{1}^{3}))+D_{\delta}(s_{4},p_{2% })=2>\gamma$ and $M_{r}(n_{3}^{8})=\min(M_{r}(n_{2}^{6}),M_{r}(n_{2}^{7}))+D_{\delta}(s_{8},p_{3% })=2>\gamma$ , we know from Theorem 2 that $n_{2}^{4}$ and $n_{3}^{8}$ can be deleted. Using this method, the local approximate Nettree can be created.

From Example 5, we see that MRD has the following three advantages.

(i)
It allows us to know whether or not a node has root paths that satisfy the global constraint. If $M_{r}(n_{j}^{i})$ is not greater than $\gamma$ , then $n_{j}^{i}$ has root paths that satisfy the global constraint. For instance, we can see that $M_{r}(n_{4}^{9})=0\leqslant\gamma$ , and that a root path that satisfies the global constraint on $n_{4}^{9}$ is $<n_{1}^{1},n_{2}^{3},n_{3}^{6},n_{4}^{9}>$ .
(ii)
Some invalid nodes can be pruned, such as $n_{2}^{4}$ .
(iii)
We can prune invalid parent-child relationships. For instance, a parent-child relationship cannot be established between $n_{3}^{3}$ and $n_{4}^{5}$ .

Algorithm 1, called CreLANtree, is used to create the local approximate Nettree for a sequence $S$ , a pattern $P$ , a local threshold $\delta$ , and a global threshold $\gamma$ .

[t] : CreLANtree Input:sequence $S$ , pattern $P$ , local threshold $\delta$ , global threshold $\gamma$ Output:LANtree[1] $i=1$ to $n$ step 1 $D_{\delta}(s_{i},p_{1})\leqslant\delta$ create $n_{1}^{i}$ and $M_{\delta}(n_{1}^{i})\leftarrow D_{\delta}(s_{j},p_{1})$ ; $j=2$ to $m$ step 1 $D_{\delta}(s_{i},p_{j})\leqslant\delta$ Create $n_{j}^{i}$ ; Update $M_{r}(n_{j}^{i})$ according to Theorem 1; $M_{r}(n_{j}^{i})>\gamma$ Delete $n_{j}^{i}$ according to Theorem 2; Establish parent-child relationships between its parents and $n_{j}^{i}$ according to Theorem 3; return LANtree;
4.2.2 Searching for nonoverlapping $(\delta,\gamma)$ -approximate occurrences

Section 4.2.1 explains the principle used to create a local approximate Nettree, where the corresponding occurrences of all full paths satisfy the local constraint. In this section, we will introduce the principle of searching for the nonoverlapping occurrences that satisfy the global constraint.

.

Let $L_{1}$ and $L_{2}$ be two root-leaf paths that do not involve the same node, the corresponding occurrences of $L_{1}$ and $L_{2}$ are then nonoverlapping.

Proof..

Suppose we have $L_{1}=<n_{1}^{a_{1}},n_{2}^{a_{2}},\cdots,n_{m}^{a_{m}}>$ and $L_{2}=<n_{1}^{b_{1}},n_{2}^{b_{2}},\cdots,n_{m}^{b_{m}}>$ . The corresponding occurrences of $L_{1}$ and $L_{2}$ are $<a_{1},a_{2},\cdots,a_{m}>$ and $<b_{1},b_{2},\cdots,b_{m}>$ , respectively. $L_{1}$ and $L_{2}$ do not involve the same node, i.e. for any $j(1\leqslant j\leqslant m),a_{j}\neq b_{j}$ . Thus, according to Definition 6, $<a_{1},a_{2},\cdots,a_{m}>$ and $<b_{1},b_{2},\cdots,b_{m}>$ are two nonoverlapping occurrences. ∎

In a local approximate tree, when we search for occurrences from an absolute leaf to its roots, we first assess whether or not the rightmost parent satisfies the condition. If not, we will assess the second rightmost parent, until a qualified parent is found. This is known as the rightmost parent strategy.

For nonoverlapping exact pattern matching, NETLAP-Best [28] adopts the rightmost parent strategy, and iteratively searches for the rightmost occurrence of the max absolute leaf, and deletes both the occurrence and the related invalid nodes. However, this method cannot be employed to solve our problem, since this method is too blind, which will lead to the loss of solution. An illustrative example is shown as follows.

.

We use the same conditions as in Example 4.

In Fig. 4, suppose the global threshold $\gamma=1$ . We first find occurrence $<$ 4, 6, 7, 9 $>$ with a $\gamma$ -distance of one, which satisfies the global constraint. We delete $<$ 4, 6, 7, 9 $>$ , and then find occurrence $<$ 2, 3, 6, 8 $>$ from $n_{4}^{8}$ . However, the $\gamma$ -distance of $<$ 2, 3, 6, 8 $>$ is two, which is greater than $\gamma$ . We therefore deselect the rightmost parent $n_{1}^{2}$ of $n_{2}^{3}$ , and instead select the second rightmost parent $n_{1}^{1}$ to obtain $<$ 1, 3, 6, 8 $>$ with a $\gamma$ -distance of one. We delete $<$ 1, 3, 6, 8 $>$ , and no other occurrences can be found after that. Using NETLAP-best, we only obtain two nonoverlapping $(\delta,\gamma)$ -approximate occurrences. However, in Fig. 4, there are three nonoverlapping $(\delta,\gamma)$ -approximate occurrences: $<$ 1, 2, 5, 6 $>$ , $<$ 2, 3, 6, 7 $>$ , and $<$ 4, 6, 7, 9 $>$ . Hence, the principle of NETLAP-Best cannot be applied to NDP.

To avoid the drawbacks of NETLAP-Best algorithm, NetNDP applies two steps to obtain the rightmost occurrence. In the first step, we recalculate the MRD of each node in the subNettree of the max root, and judge whether the root can reach the absolute leaves under the condition of the global constraint. If so, we obtain the rightmost absolute leaf of the max root. In the second step we get the rightmost occurrence using the rightmost parent strategy with the rightmost absolute leaf. We iterate the above process until no new occurrences are found.

The ReachLeaf algorithm, which obtains the rightmost absolute leaf from the max root, is shown in Algorithm 2.

[t] : ReachLeaf Input:LANtree, root $R$ , local threshold $\delta$ , global threshold $\gamma$ Output:the rightmost absolute leaf ral of $R$ [1] $\textit{lal}\leftarrow\textit{ral}\leftarrow R$ ; // lal and ral are used to indicate the range from the leftmost node to the rightmost node traversed by each layer $i=1$ to $m-1$ step 1 $j=\textit{lal}$ to ral step 1 $k=1$ to $\textit{LANtree}[i][j]$ .children.size() step 1 Recalculate the MRD of $\textit{LANtree}[i+1][k]$ in the subnettree of $R$ according to Theorem 1; $\textit{lal}\leftarrow$ first child at the ${i+1}^{\rm th}$ level, $\textit{ral}\leftarrow$ last child at the ${i+1}^{\rm th}$ level; $\textit{lal}==$ NULL return $-$ 1; return ral;

After obtaining the rightmost absolute leaf, we prove that we can obtain the rightmost occurrence without the need for a backtracking strategy.

.

In the subNettree of a root, if the root can reach $n_{m}^{i}$ under the condition of the global constraint, we can obtain a full path from $n_{m}^{i}$ to the root without a backtracking strategy.

Proof..

If the root can reach $n_{m}^{i}$ under the condition of the global constraint, we know that $M_{r}(n_{m}^{i})\leqslant\gamma$ . Otherwise, according to Theorem 2, $n_{m}^{i}$ needs to be deleted and if $M_{r}(n_{m}^{i})>\gamma$ , the shortest $\gamma$ -distance from $n_{m}^{i}$ to the root is greater than $\gamma$ , i.e. there is no path from $n_{m}^{i}$ to the root satisfies the global constraint. Thus, $M_{r}(n_{m}^{i})\leqslant\gamma$ . Suppose $D_{\delta}(s_{i},p_{m})=k\leqslant\gamma$ . We then need to search for a path from the parents of $n_{m}^{i}$ to the root with a $\gamma$ -distance within $\gamma-k$ . From Theorem 1, we know that $M_{r}(n_{m}^{i})=\min(M_{r}(n_{m-1}^{r^{1}}),M_{r}(n_{m-1}^{r^{2}}),\cdots,M_{% r}(n_{m-1}^{r^{t}}))+k\leqslant\gamma$ . Hence, from the rightmost parent to the leftmost parent of $n_{m}^{i}$ , there must be a parent whose MRD is no greater than $\gamma-k$ . We select the parent, and iterate the process until the first level is reached. In other words, when searching for a path from $n_{m}^{i}$ to the root with a $\gamma$ -distance of within $d$ , we seek the rightmost parent whose MRD is not greater than $d$ . Since the first level has only one root in the subNettree, we can obtain a full path from $n_{m}^{i}$ to the root without the need for a backtracking strategy. ∎

Based on Theorem 4, we develop the RightOcc algorithm to obtain the rightmost occurrence, as shown in Algorithm 3.

[h] : RightOcc Input:LANtree, rightmost absolute leaf ral, local threshold $\delta$ , global threshold $\gamma$ Output:a nonoverlapping $(\delta,\gamma)$ -approximate occurrence occ[1] $\textit{occ}[m]\leftarrow\textit{LANtree}[m][\textit{ral}]$ ; $i=m-1$ to 1 step $-$ 1 $\textit{occ}[i]\leftarrow$ the rightmost parent of current under the condition of the global constraint; return occ;

Example 7 illustrates the principle of NetNDP.

.

We use the same conditions as in Example 4.

In Fig. 5, $n_{1}^{6}$ is the max root. It cannot reach the absolute leaves, and neither can $n_{1}^{5}$ . Next, we assess root $n_{1}^{4}$ , and recalculate the MRD for each node in its subnettree. From Fig. 5, we can clearly see that the rightmost absolute leaf of $n_{1}^{4}$ is $n_{4}^{9}$ , and $M_{r}(n_{4}^{9})=0\leqslant\gamma$ . Hence, according to Theorem 4, we find the $(\delta,\gamma)$ -approximate occurrence $<$ 4, 6, 7, 9 $>$ from $n_{4}^{9}$ .

Figure 5.

MRD for each node in the subNettree of $n_{1}^{4}$ .

Figure 6.

MRD for each node in the subNettree of $n_{1}^{1}$ .

We delete $<$ 4, 6, 7, 9 $>$ and mark it in red. Now, the rightmost absolute leaf of $n_{1}^{2}$ is $n_{4}^{7}$ . Hence, we start searching for an occurrence from $n_{4}^{7}$ using the rightmost parent strategy. When searching for a path from $n_{4}^{7}$ to $n_{1}^{2}$ with a $\gamma$ -distance of within one, we seek the rightmost parent whose MRD is no greater than one. Since $D_{\delta}(s_{7},p_{4})=0$ , $n_{3}^{6}$ is the rightmost parent, and $M_{r}(n_{3}^{6})=1$ , meaning that there exists a root path with a $\gamma$ -distance of one from $n_{3}^{6}$ to $n_{1}^{2}$ . We therefore select $n_{3}^{6}$ . This process is iterated until we finally find the $(\delta,\gamma)$ -approximate occurrence $<$ 2, 3, 6, 7 $>$ according to Theorem 4.

We now delete $<$ 2, 3, 6, 7 $>$ and assess the last root $n_{1}^{1}$ , as shown in Fig. 6. We recalculate the MRD for each node in the subNettree of $n_{1}^{1}$ . We can see that $M_{r}(n_{4}^{6})=0\leqslant\gamma$ , and the rightmost absolute leaf of $n_{1}^{1}$ is $n_{4}^{6}$ after deleting $<$ 2, 3, 6, 7 $>$ . According to Theorem 4, we find the $(\delta,\gamma)$ -approximate occurrence $<$ 1, 2, 5, 6 $>$ from $n_{4}^{6}$ . In summary, there are three nonoverlapping $(\delta,\gamma)$ -approximate occurrences of pattern $P$ in sequence $S$ : $<$ 1, 2, 5, 6 $>$ , $<$ 2, 3, 6, 7 $>$ , and $<$ 4, 6, 7, 9 $>$ .

The main steps of NetNDP are as follows.

Step 1: NetNDP uses Algorithm 1 to create a local approximate Nettree at first.

Step 2: NetNDP adopts Algorithm 2 to determine the max root which can reach the absolute leaves under the condition of the global constraint.

Step 3: If we can obtain the rightmost absolute leaf of the max root, then we employs Algorithm 3 to get the rightmost occurrence using the rightmost parent strategy with the rightmost absolute leaf.

Step 4: We iterate Steps 2 and 3 until no new occurrences are found.

The NetNDP algorithm is shown in Algorithm 4.

[h] : NetNDP Input:sequence $S$ , pattern $P$ , local threshold $\delta$ , global threshold $\gamma$ Output:The maximum nonoverlapping $(\delta,\gamma)$ -approximate occurrence set OCC[1] Use Algorithm 1 to create a local approximate Nettree; $r=$ the number of roots to 1 step $-$ 1 Use Algorithm 2 to get the rightmost absolute leaf ral; $\textit{lal}==$ NULL Use Algorithm 3 to get the rightmost occurrence occ; $\textit{OCC}\leftarrow\textit{OCC}\cup\textit{occ}$ ; Delete occ; return OCC;

5. Experimental results and performance analysis

5.1 Experimental environment and datasets

In this paper, we focus on the problem of NDP, and propose the NetNDP algorithm. To validate the performance of NetNDP, we select the following competitive algorithms.

(1)
INSGrow-appro, NETLAP- $(\delta,\gamma)$ , and NETASPNO- $(\delta,\gamma)$ : INSGrow [27], NETLAP-Best [28] and NETASPNO [34] are state-of-the-art algorithms for nonoverlapping pattern matching, but they do not support $(\delta,\gamma)$ -approximate matching. We therefore propose INSGrow-appro, NETLAP- $(\delta,\gamma)$ , and NETAS- PNO- $(\delta,\gamma)$ to improve these algorithms, respectively.
(2)
NetNDP-nonp: To verify the efficiency of our pruning strategy, we also develop an algorithm, called NetNDP-nonp, which does not prune invalid nodes and parent-child relationships.
(3)
NetDAP [37]: To verify the NDP performance, we also select NetDAP [37] as a competitive algorithm which is $(\delta,\gamma)$ -approximate pattern matching under no special condition.

All of the algorithms are run on a computer with an Intel (R) Core (TM) i5-7200U, 2.50 GHz CPU, 8.00 GB RAM, and a Windows10 operating system. The compiling environment is VC $++$ 6.0.

To verify the efficiency of NetNDP, we use eight real protein sequences ( $S1\sim S8$ ), which can be downloaded from https://www.uniprot.org/uniparc/. Protein sequences are composed of 20 different letters (amino acids). Proteins with similar patterns often have similar functions. However, protein sequences may mutate. Therefore, if we know a sequence pattern, we can find similar pattern in the new protein sequences by approximate pattern matching with gap constraints, so as to further study and confirm the functional structure of the new protein sequence. To verify the matching effect of the $(\delta,\gamma)$ -distance, we select two WormsTwoClass time series ( $S9\sim S10$ ), which can be downloaded from https://www.cs.ucr.edu/∼eamonn/time_series_data/. The dataset describes the trajectory of Caenorhabditis elegans. On average, the length of all sequences is 900. Although the accuracy of the data set is very high, there are also some errors. According to the time series of the first Eigenworm and the known trajectory pattern, we get similar time series by approximate pattern matching, which is easy to find similar types of worms. We apply SAX [49] (https://cs.gmu.edu/∼jessica/sax.htm) to symbolize the time series as character sequences ( $A\sim T$ ). A description of the datasets is given in Table 2.

Table 3 shows the patterns $P1$ to $P8$ used to evaluate the experimental performance. Patterns $P1$ and $P2$ are two random patterns. To verify the influence of the length of the pattern on the experimental results, we set the patterns $P3$ to $P5$ as the same gap constraints, and gradually increase the pattern length. To verify the influence of the gap constraints on the experimental results, we set the patterns $P6$ to $P8$ as the same pattern length, and gradually increase the gap constraint. To verify the matching effect of the $(\delta,\gamma)$ -distance, we add the patterns $P9$ to $P10$ , and their trends are shown in Fig. 7.

Table 1
Description of the datasets

Sequence Identification From Length

$S1$ UPI000E62E3D8 Schistosoma mansoni 2578

$S2$ UPI000E62F0E8 Ascaris suum 3081

$S3$ UPI000EC610A6 Corallococcus sp. CA051B 3895

$S4$ UPI000E6F28B3 Xiphophorus maculatus 4260

$S5$ UPI0000F516B0 Mus musculus 4632

$S6$ UPI00078AE9A0 Drosophila simulans 4864

$S7$ UPI00090EABF2 Xenopus laevis 5606

$S8$ UPI000E641A62 Ascaris suum 6638

$S9$ Train79 WormsTwoClass 900

$S10$ Train85 WormsTwoClass 900

Table 2
Patterns

Name Pattern Length

$P1$ V[1, 5]L[1, 7]S[4, 9]L 4

$P2$ L[1, 7]T[0, 6]S[3, 8]L[2, 7] 5

$P3$ E[0, 9]L[0, 9]S[0, 9]E[0, 9]L 5

$P4$ E[0, 9]L[0, 9]S[0, 9]E[0, 9]L[0, 9]S[0, 9]E 7

$P5$ E[0, 9]L[0, 9]S[0, 9]E[0, 9]L[0, 9]S[0, 9]E[0, 9]L 9

$P6$ Q[1, 7]E[1, 7]L[1, 7]E[1, 7]L[1, 7]N 6

$P7$ Q[1, 8]E[1, 8]L[1, 8]E[1, 8]L[1, 8]N 6

$P8$ Q[1, 10]E[1, 10]L[1, 10]E[1, 10]L[1, 10]N 6

$P9$ P[0, 6]M[0, 6]D[0, 6]L[0, 6]Q 5

$P10$ F[0, 6]B[0, 6]J[0, 6]Q[0, 6]E[0, 6]B 6

Figure 7.
Trends in $P9\sim P10$ .

5.2 Efficiency

Sequence	Identification	From	Length
$S1$	UPI000E62E3D8	Schistosoma mansoni	2578
$S2$	UPI000E62F0E8	Ascaris suum	3081
$S3$	UPI000EC610A6	Corallococcus sp. CA051B	3895
$S4$	UPI000E6F28B3	Xiphophorus maculatus	4260
$S5$	UPI0000F516B0	Mus musculus	4632
$S6$	UPI00078AE9A0	Drosophila simulans	4864
$S7$	UPI00090EABF2	Xenopus laevis	5606
$S8$	UPI000E641A62	Ascaris suum	6638
$S9$	Train79	WormsTwoClass	900
$S10$	Train85	WormsTwoClass	900

Name	Pattern	Length
$P1$	V[1, 5]L[1, 7]S[4, 9]L	4
$P2$	L[1, 7]T[0, 6]S[3, 8]L[2, 7]	5
$P3$	E[0, 9]L[0, 9]S[0, 9]E[0, 9]L	5
$P4$	E[0, 9]L[0, 9]S[0, 9]E[0, 9]L[0, 9]S[0, 9]E	7
$P5$	E[0, 9]L[0, 9]S[0, 9]E[0, 9]L[0, 9]S[0, 9]E[0, 9]L	9
$P6$	Q[1, 7]E[1, 7]L[1, 7]E[1, 7]L[1, 7]N	6
$P7$	Q[1, 8]E[1, 8]L[1, 8]E[1, 8]L[1, 8]N	6
$P8$	Q[1, 10]E[1, 10]L[1, 10]E[1, 10]L[1, 10]N	6
$P9$	P[0, 6]M[0, 6]D[0, 6]L[0, 6]Q	5
$P10$	F[0, 6]B[0, 6]J[0, 6]Q[0, 6]E[0, 6]B	6

Since large deviations are not allowed in the matching results, the local threshold $\delta$ should not be set very high. Since the length of pattern $P1$ is four, the global threshold $\gamma$ should also not be set very high. We therefore use values of $(\delta=1,\gamma=2)$ , $(\delta=1,\gamma=3)$ , and $(\delta=2,\gamma=2)$ . Figures 8 and 9 show a comparison of the results and running time for $P1\sim P8$ on $S1\sim S8$ with $(\delta=1,\gamma=2)$ .

Figure 8.

Comparison of results for $P1\sim P8$ on $S1\sim S8$ with $(\delta=1,\gamma=2)$ .

Figure 9.

Comparison of running time for $P1\sim P8$ on $S1\sim S8$ with $(\delta=1,\gamma=2)$ .

The results indicate the following observations.

(1)

From Figs 8 and 9, we can see that NetNDP outperforms INSGrow-appro. Figure 9 shows that INSGrow-appro is the fastest algorithm, since this algorithm is relatively simple. However, from Fig. 8, we know that INSGrow-appro has the lowest number of occurrences. For example, the running time for INSGrow-appro for $P2$ on $S5$ is 6.2 ms, which is the fastest of the five algorithms. However, INSGrow-appro only obtains 80 occurrences, while NetNDP finds 129. The reason for this is that INSGrow-appro employs the principle of INSGrow-appro [27], which only retains the position in which the first match succeeds, thus resulting in the loss of many occurrences. INSGrow-appro therefore overlooks many feasible occurrences, meaning that NetNDP has better performance.

(2)

NetNDP outperforms NETLAP- $(\delta,\gamma)$ . From Fig. 8, we can see that NETLAP- $(\delta,\gamma)$ finds slightly fewer occurrences. For example, the number of occurrences found by NetNDP for $P4$ on $S4$ is 160, while NETLAP- $(\delta,\gamma)$ only obtains 152. The main reason for this is that NETLAP- $(\delta,\gamma)$ employs the principle of NETLAP-Best [28], which iteratively finds the rightmost occurrence from the max absolute leaf by finding the rightmost parent, and deletes the occurrence and related invalid nodes. In a similar way to NETLAP-Best, NETLAP- $(\delta,\gamma)$ does not prejudge the rightmost absolute leaf of each root, resulting in fewer occurrences. The illustrative example given in Section 4.2.2 shows that NETLAP- $(\delta,\gamma)$ only finds $<$ 1, 3, 6, 8 $>$ and $<$ 4, 6, 7, 9 $>$ , while NetNDP finds three occurrences: $<$ 1, 2, 5, 6 $>$ , $<$ 2, 3, 6, 7 $>$ , and $<$ 4, 6, 7, 9 $>$ . As can been seen from Fig. 9, NetNDP is also faster than NETLAP- $(\delta,\gamma)$ . For example, the running time for NetNDP is 46.9 ms for $P5$ on $S4$ , while the running time for NETLAP- $(\delta,\gamma)$ is 51.6 ms. NETLAP- $(\delta,\gamma)$ and NetNDP have the same time complexity of $O(n\times m^{2}\times W)$ . The reason why NETLAP- $(\delta,\gamma)$ is slower is that it needs to prune invalid nodes associated with occurrences. NetNDP does not need to find these invalid nodes, and therefore NetNDP has better performance than NETLAP- $(\delta,\gamma)$ .

(3)

NetNDP outperforms both NETASPNO- $(\delta,\gamma)$ and NetNDP-nonp. From Fig. 8, we can see that these three algorithms obtain the same results, and give better performance than the first two algorithms. However, from Fig. 9, we know that NetNDP has the best time efficiency. For example, although all three algorithms find 112 occurrences for $P4$ on $S6$ , the running time for NetNDP, NETASPNO- $(\delta,\gamma)$ and NetNDP-nonp are 43.7, 54.7, and 50 ms, respectively. The reason for this is because NETASPNO- $(\delta,\gamma)$ and NetNDP-nonp employ the same strategy as NetNDP to search for the occurrences, i.e. they find the rightmost absolute leaf of the max root, search for the rightmost occurrence from the rightmost absolute leaf, and delete the occurrence, meaning that they obtain the same results. The difference is that NETASPNO- $(\delta,\gamma)$ employs the principle of NETASPNO [34], which calculates the number of root paths with a $\gamma$ -distance of $d(0\leqslant d\leqslant\gamma)$ for each node. NETASPNO judges whether or not a node has root paths that satisfy the distance constraint, and prunes invalid nodes and parent-child relationships via the number of root paths. Thus, the time complexity of NETASPNO- $(\delta,\gamma)$ is $O(n\times m^{2}\times W\times\gamma)$ , meaning that it has the longest running time. NetNDP-nonp and NetNDP have the same time complexity of $O(n\times m^{2}\times W)$ , but NetNDP-nonp does not prune invalid nodes and parent-child relationships. Figure 10 further shows the total numbers of nodes and parent-child relationships for NetNDP-nonp and NetNDP.

Figure 10.

Total numbers of nodes and parent-child relationships for NetNDP-nonp and NetNDP with $(\delta=1,\gamma=2)$ .

From Fig. 10, we can see that NetNDP-nonp has numerous invalid nodes and parent-child relationships, and therefore requires excessive numbers of invalid accesses, thus increasing the running time. Hence, NetNDP has better performance than NETASPNO- $(\delta,\gamma)$ and NetNDP-nonp.

5.3 Influence of different parameters

To further demonstrate the impact of different threshold parameters on the experiment, we use values of $(\delta=1,\gamma=3)$ and $(\delta=2,\gamma=3)$ . The corresponding numbers of occurrences are shown in Figs 11 and 12, and the total running time is shown in Figs 13 and 14.

Figure 11.

Total occurrences for $P1\sim P8$ on $S1\sim S8$ with $(\delta=1,\gamma=3)$ .

Figure 12.

Total occurrences for $P1\sim P8$ on $S1\sim S8$ with $(\delta=2,\gamma=3)$ .

Figure 13.

Total running time for $P1\sim P8$ on $S1\sim S8$ with $(\delta=1,\gamma=3)$ .

Figure 14.

Total running time for $P1\sim P8$ on $S1\sim S8$ with $(\delta=2,\gamma=3)$ .

(1)

From Figs 11 and 12, we can see that NetNDP finds the same number of occurrences as NETASPNO- $(\delta,\gamma)$ and NetNDP-nonp, and that its performance is better than INSGrow-appro and NTELAP- $(\delta,\gamma)$ . Figures 13 and 14 show that except for INSGrow-appro, NetNDP is the fastest. The analysis given above is consistent with the results for $(\delta=1,\gamma=2)$ .

(2)

As can be seen from Figs 9, 13, and 14, the running time for NetNDP is positively correlated with $n$ , $m$ , and $W$ , a result that is consistent with our theoretical analysis of a time complexity $O(n\times m^{2}\times W)$ .

From Table 2, we can see that the length of $S1$ is the shortest and the length of $S8$ is the longest. Figure 9 shows that the running time for $P1\sim P8$ is the shortest on $S1$ and the longest on $S8$ ; in other words, the larger the value of $n$ , the longer the running time. For instance, the running time for $P4$ on $S1$ and $S8$ is 25 and 65.6 ms, respectively, while the running time for $P4$ on the other sequences is greater than 25 ms and less than 65.6 ms. Hence, the running time for NetNDP is positively correlated with the length of the sequence.

Table 3 shows that $P3\sim P5$ have the same gap constraints and that the length increases gradually. From Figs 13 and 14, we know that the total running time for $P5$ is always greater than for $P3$ on $S1\sim S8$ . For example, the total running time for $P3$ and $P5$ on $S1\sim S8$ is 286 and 399.6 ms $(\delta=1,\gamma=3)$ , respectively. These experimental results demonstrate that the running time increases with the length of the pattern.

From Table 3, we see that $P6\sim P8$ are same except for the gap constraints and that their maximum gap increases gradually. Figures 13 and 14 show that the total running time for $P8$ is always greater than that of $P6$ on $S1\sim S8$ , indicating that the longer the maximum gap, the longer the running time.

In summary, the running time of NetNDP is consistent with a time complexity of $O(n\times m^{2}\times W)$ , and NetNDP outperforms other competitive algorithms.

5.4 Matching effect

To illustrate that the $(\delta,\gamma)$ -distance outperforms the Hamming distance in terms of the matching effect, we obtain occurrences with the $(\delta,\gamma)$ -distance using NetNDP $(\delta=1,\gamma=3)$ , and occurrences with a Hamming distance using NETASPNO [34] $(h=3)$ . Figures 15 and 16 show the matching results for $P9$ on $S9$ and $P10$ on $S10$ , respectively.

Figure 15.

Matching results for $P9$ on $S9$ with $(\delta=1,\gamma=3,h=3)$ .

Figure 16.

Matching results for $P10$ on $S10$ with $(\delta=1,\gamma=3,h=3)$ .

As can be seen from Fig. 15, the trends of all occurrences found with the $(\delta,\gamma)$ -distance are the same that of $P9$ , while the trends of some occurrences with the Hamming distance are different from that of $P9$ . The reason for this is that the Hamming distance cannot reflect the distance between characters, i.e. it cannot measure the local approximation, resulting in large deviations in matching results. For instance, with the Hamming distance, $\textit{occ}1$ has a large deviation from $P9$ in position 32, which leads to a dissimilarity between $\textit{occ}1$ and $P9$ . However, when the $(\delta,\gamma)$ -distance is used, the distance between the corresponding characters of occ1 and $P9$ is less than one due to the local constraint, and $\textit{occ}1$ therefore is similar to $P9$ . Figure 16 shows a similar phenomenon. Based on this analysis, we know that approximate pattern matching with the $(\delta,\gamma)$ -distance is more effective than with the Hamming distance.

6. Conclusion

The Hamming distance cannot be used to measure the local approximation between the subsequence and pattern, resulting in large deviations in matching results. To overcome this weakness of the Hamming distance, we explore the use of nonpoverlapping approximate pattern matching with the $(\delta,\gamma)$ -distance, where the $\delta$ -distance and $\gamma$ -distance are used to measure the local and the global approximations, respectively. We develop the concept of a local approximate Nettree, and construct an efficient algorithm called NetNDP based on a local approximate Nettree.To improve the time efficiency, NetNDP employs MRD to prune invalid nodes and parent-child relationships, and to assess whether or not the root paths satisfy the global constraint. NetNDP finds the rightmost absolute leaf of the max root, searches for the rightmost occurrence from the rightmost absolute leaf, and deletes the occurrence. These processes are iterated until there are no new occurrences in the local approximate Nettree. Numerous experimental results give rise to the following conclusions. NetNDP runs faster than other competitive algorithms, since it can avoid creating invalid nodes and parent-child relationships. More importantly, approximate pattern matching with the $(\delta,\gamma)$ -distance has better matching performance than the Hamming distance, since the trends of all occurrences found with the $(\delta,\gamma)$ -distance are the same with that of the pattern, while the trends of some occurrences with the Hamming distance are different from that of the pattern.

Footnotes

Acknowledgments

This work was party supported by National Natural Science Foundation of China (61976240, 91746209), National Key Research and Development Program of China (2016YFB1000901), Natural Science Foundation of Hebei Province, China (No. F2020202013).

References

Al-Ssulami

A.M.

, Hybrid string matching algorithm with a pivot, Journal of Information Science 41(1) (2015), 82–88.

Fernau

Manea

MercaÅŸ

and Schmid

M.L.

, Pattern matching with variables: Efficient algorithms and complexity results, ACM Transactions on Computation Theory (TOCT) 12(1) (2020), 1–37.

Qiang

Qian

Yuan

and Wu

, Short text topic modeling techniques, applications, and performance: A survey, IEEE Transactions on Knowledge and Data Engineering, 2020. doi: 10.1109/TKDE.2020.2992485.

Jiang

and Wu

, Strict approximate pattern matching with general gaps, Applied Intelligence 42(3) (2015), 566–580.

Liu

and Wu

, Multi-fuzzy-constrained graph pattern matching with big graph data, Intelligent Data Analysis 24(4) (2020), 941–958.

Nie

Jiang

Ren

Sun

and Li

, Query expansion based on crowd knowledge for code search, IEEE Transactions on Services Computing 9(5), (2016), 771–783.

Yuan

and Li

, A survey of traffic prediction: From spatio-temporal data to intelligent transportation, Data Science and Engineering 6(2) (2021), 63–85.

Wang

Chai

Yang

Wang

and Chai

, Efficient subgraph matching on large RDF graphs using mapReduce, Data Science and Engineering 4(1) (2019), 24–43.

Zhu

Guo

and Wu

, NetNCSP: Nonoverlapping closed sequential pattern mining, Knowledge-Based Systems 196 (2020), 105812.

10.

Min

Zhang

Zhai

and Shen

, Frequent pattern discovery with tripartition alphabets, Information Sciences 507 (2020), 715–732.

11.

Song

Liu

and Huang

, Generalized maximal utility for mining high average-utility itemsets, Knowledge and Information Systems 63 (2021), 2947–2967.

12.

and Wu

, On big wisdom, Knowledge and Information Systems 58(1) (2019), 1–8.

13.

Upama

P.B.

Khan

J.T.

Zemim

Yasmin

and Sakib

, A new approach in pattern matching: Codon detection in DNA and RNA using hash function (CDDRHF), in: Proceedings of the 18th International Conference on Computer and Information Technology, Dhaka, Bangladesh, 2015, pp. 172–177.

14.

Lee

Cho

Kim

and Kang

, Fault group pattern matching with efficient early termination for high-speed redundancy analysis, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(7) (2018), 1473–1482.

15.

Nguyen

T.S.

, Pattern matching-based prediction using affine combination of two measures: Two are better than one, International Journal of Business Intelligence and Data Mining 12(3) (2017), 236–256.

16.

Liu

Yan

Guo

and Wu

, Efficient algorithm for solving strict pattern matching under nonoverlapping condition, Journal of Software 32(11) (2021), 3331–3350.

17.

Liu

Wang

Liu

Zhao

and Wu

, Efficient pattern matching with periodical wildcards in uncertain sequences, Intelligent Data Analysis 22(4) (2018), 829–842.

18.

Luo

Guo

Fournier-Viger

Zhu

and Wu

, NTP-Miner: Nonoverlapping three-way sequential pattern mining, ACM Transactions on Knowledge Discovery from Data. doi: 10.1145/3480245.

19.

Min

Zhang

Zhai

and Shen

, Frequent pattern discovery with tri-partition alphabets, Information Sciences 507 (2020), 715–732.

20.

Chai

Yang

Liu

and Wu

, Top-k sequence pattern mining with non-overlapping condition, Filomat 32(5) (2018), 1703–1710.

21.

Wang

Yao

Fournier-Viger

and Wu

, Self-adaptive nonoverlapping sequential pattern mining, Applied Intelligence, 2021. doi: 10.1007/s10489-021-02763-y.

22.

Fournier-Viger

Yang

Kiran

R.U.

Ventura

and Luna

J.M.

, Mining local periodic patterns in a discrete sequence, Information Sciences 544 (2021), 519–548.

23.

Xie

and Zhu

, Efficient sequential pattern mining with wildcards for keyphrase extraction, Knowledge-Based Systems 115 (2017), 27–39.

24.

Liu

Xie

and Wu

, Suffix array for multi-pattern matching with variable length wildcards, Intelligent Data Analysis 25(2) (2021), 283–303.

25.

Huang

Guo

and Hu

, Algorithms for approximate pattern matching with wildcards and length constraints, Journal of Computer Applications 33(3) (2013), 800–805.

26.

Lei

Guo

and Wu

, HAOP-Miner: Self-adaptive high-average utility one-off sequential pattern mining, Expert Systems with Applications 184 (2021), 115449.

27.

Ding

Han

and Khoo

, Efficient mining of closed repetitive gapped subsequences from a sequence database, in: Proceedings of the 2009 IEEE 25th International Conference on Data Engineering, Shanghai, China, 2009, pp. 1024–1035.

28.

Shen

Jiang

and Wu

, Strict pattern matching under non-overlapping condition, Science China Information Sciences 60(1) (2017), 012101.

29.

Shi

Shan

Yan

and Wu

, NetNPG: Nonoverlapping pattern matching with general gap constraints, Applied Intelligence 50 (2020), 1832–1845.

30.

Tong

Zhu

and Wu

, NOSEP: Nonoverlapping sequence pattern mining with gap constraints, IEEE Transactions on Cybernetics 48(10) (2018), 2809–2822.

31.

Chen

Huang

and Lee

R.C.T.

, Bit-parallel algorithms for exact circular string matching, The Computer Journal 57(5) (2014), 731–743.

32.

Zheng

Wang

and Zhou

, GFilter: A general gram filter for string similarity search, IEEE Transactions on Knowledge and Data Engineering 27(4) (2015), 1005–1018.

33.

Chen

and Wu

, On the string matching with k mismatches, Theoretical Computer Science 726 (2018), 5–29.

34.

Guo

Liu

and Wu

, NETASPNO: Approximate strict pattern matching under nonoverlapping condition, IEEE Access 6(1) (2018), 24350–24361.

35.

Tang

Jiang

and Wu

, Approximate pattern matching with gap constraints, Journal of Information Science 42(5) (2016), 639–658.

36.

Liu

Guo

and Wu

, NetDPO: (delta, gamma)-approximate pattern matching with gap constraints under one-off condition, Applied Intelligence, 2021. doi: 10.1007/s10489-021-03000-2.

37.

Fan

Guo

and Wu

, NetDAP: (delta, gamma)-Approximate pattern matching with length constraints, Applied Intelligence 50(11) (2020), 4094–4116.

38.

Zhang

and Atallah

M.J.

, On approximate pattern matching with thresholds, Information Processing Letters 123 (2017), 21–26.

39.

Fredriksson

and Grabowski

, Efficient algorithms for (delta, gamma, alpha)-matching, Stringology, 2006, 29–40.

40.

Clifford

and Iliopoulos

, Faster algorithms for delta, gamma-matching and related problems, in: Annual Symposium on Combinatorial Pattern Matching, Springer, Berlin, Heidelberg, 2005, pp. 68–78.

41.

Dong

Gong

and Cao

, e-RNSP: An efficient method for mining repetition negative sequential patterns, IEEE Transactions on Cybernetics 50(5) (2020), 2084–2096.

42.

Wang

Guo

Zhang

and Wu

, OWSP-Miner: Self-adaptive one-off weak-gap strong pattern mining, ACM Transactions on Management Information Systems. doi: 10.1145/3476247.

43.

Wang

Zhu

and Wu

, Top-k self-adaptive nonoverlapping contrast sequential pattern mining, IEEE Transactions on Cybernetics, 2021. doi: 10.1109/TCYB.2021.3082114.

44.

Wang

Liu

and Li

, Mining distinguishing subsequence patterns with nonoverlapping condition, Cluster Computing 22(3) (2019), 5905–5917.

45.

Truong

Duong

and Fournier-Viger

, EHAUSM: An efficient algorithm for high average utility sequence mining, Information Sciences 515 (2020), 302–323.

46.

Fournier-Viger

Lin

J.C.W.

Chi

T.T.

Chun-Wei

L.J.

and Kiran

R.U.

, Mining cost-effective patterns in event logs, Knowledge-Based Systems 191 (2020), 105241.

47.

Zhang

Guo

Liu

and Wu

, NetNMSP: Nonoverlapping maximal sequential pattern mining, Applied Intelligence, 2021. doi: 10.1007/s10489-021-02912-3.

48.

Geng

Guo

Fournier-Viger

Zhu

and Wu

, HANP-Miner: High average utility nonoverlapping sequential pattern mining, Knowledge-Based Systems 229 (2021), 107361.

49.

Lin

Keogh

Wei

and Lonardi

, Experiencing SAX: A novel symbolic representation of time series, Data Mining and Knowledge Discovery 15(2) (2007), 107–144.