Efficient pattern matching with periodical wildcards in uncertain sequences

Abstract

Data uncertainty is inherent in many real-world applications such as sensor data monitoring and mobile tracking. Mining sequential patterns from uncertain/inaccurate data, such as sensor readings and GPS trajectories, is important to discover hidden knowledge in such applications. This paper addresses the problem of pattern matching with periodical wildcards for uncertain sequences. We present a dynamic programming approach, called CoDP, to compute the exact probability that a pattern $q$ is a subsequence of an uncertain sequence $s$ , and this approach can be further applied to substring matching for uncertain sequences. The efficiency and effectiveness of our algorithm have been verified through extensive experiments on both real and synthetic data.

Keywords

Pattern matching substring matching wildcards uncertain sequences

1. Introduction

The problem of pattern matching with wildcards in certain sequences has attracted extensive attention in the research community due to its wide spectrum of real life applications, such as bioinformatics [1, 2] and text mining [3]. In bioinformatics, for example, TATA often occurs after the CAATCT with 30–50 wildcards in DNA sequences. In text mining [3], it is very natural to consider sequential patterns with wildcards where patterns are composed of words that can capture semantic relations between words.

Data uncertainty is inherent in many real-world applications such as sensor data monitoring [4], RFID localization [5] and location-based services [6]. It can be caused by various factors including data collection equipment errors, precision limitations, data sampling errors, and transmission errors.

As a result, attention has been drawn to uncertain data mining in recent research. A comprehensive survey of the techniques on uncertain data mining can be found in [7]. However, most of this work focused on uncertainty for itemset data, rather than the uncertainty for sequence data.

In this paper, we tackle the problem of pattern matching with wildcards for uncertain sequences. Our main contribution is the development of an efficient algorithm to compute the following pattern matching query: Given a certain sequence $q$ with wildcard gaps and an uncertain sequence $s$ , what is the probability that $q$ is a subsequence of $s$ ? Table 1 is an example of uncertain sequence $s$ . Each element of the set {A, C, G, T} has a probability of existence at every position of sequence $s$ , and the sum of the probability is one at each column of the matrix. If there is a sequence $q$ with wildcard gaps, for example, A[1,3]C[1,3]G, our goal is to calculate the probability that ACG is a subsequence of $s$ , and the minimum and maximum numbers that a wildcard can match between each element of $q$ should be one and three respectively.

Table 1
An uncertain sequence of length 8

	s[1]	s[2]	s[3]	s[4]	s[5]	s[6]	s[7]	s[8]
A	0.12	0.8	0.1	0.2	0.3	0.7	0.04	0.32
C	0.52	0.04	0.2	0.4	0.2	0.06	0.16	0.4
G	0.16	0.04	0.3	0.2	0.4	0.04	0.3	0.08
T	0.2	0.12	0.4	0.2	0.1	0.2	0.5	0.2

Matching a subsequence query pattern with periodical wildcard gaps in an uncertain sequence poses several challenges. (1) To compute the probability that $q$ is a subsequence of an uncertain sequence $s$ , a naive technique is to enumerate all the possible worlds. But the number of possible worlds will increase exponentially when the length of sequence $s$ increases. For example, the number of possible worlds of the sequence in Table 1 is 4 ${}^{8}=$ 65536. (2) There are some algorithms about pattern matching in uncertain sequences, where unlimited gaps are allowed. And there are algorithms about substring matching in uncertain sequences, which requires that no gaps are allowed within the match. We think pattern matching with periodical wildcards in uncertain sequences are more complex than the above two cases, where gaps are limited. Specifically, we make the following contributions:

To our knowledge, this is the first work that attempts to solve the problem of pattern matching with periodical wildcards in uncertain sequences, the techniques of which are successfully used in a DNA application.

We present a dynamic programming approach, called CoDP, to compute the probability that $q$ is a subsequence of $s$ , and the time complexity decreases significantly compared with the naive approach.

Our algorithm can also calculate the substring matching probability in uncertain sequences with an arbitrary alphabet, which is verified by extensive experiments.

The rest of the paper is organized as follows: We first discuss related work in Section 2. Then we introduce the preliminaries of our work in Section 3. We give the formal definition of our model and describe our proposed approach CoDP in Section 4. Then we sketch an extension that leverages our solution for computing the subsequence matching probability in Section 5. In order to evaluate the effectiveness of our model, we conduct experiments in Section 6 and the experimental results show our model outperforms the state-of-the-art baselines. Finally, we conclude the paper and describe the future work in Section 7.

2. Related work

The problem of sequential pattern matching with wildcards has been well studied in the literature in the context of deterministic data, and many algorithms have been proposed to solve this problem. The work proposed in [8] focused on pattern matching with wildcards, gap-length constraints and the one-off condition. The authors proposed a graph structure WON-Net to obtain all candidate matching solutions, and designed the WOW algorithm with the weighted centralization measure based on nodes’ centrality-degrees. Xie et al. [9] proposed an efficient algorithm SPMW for sequential pattern mining with wildcards. And they proposed a new data structure level instance graph to represent all instances of a pattern with the gap constraint. Wu et al. [10] proposed two novel algorithms, MAPB (Mining sequentiAl Pattern using incomplete Nettree with Breadth first search) and MAPD (Mining sequentiAl Pattern using incomplete Nettree with Depth first search), to mine sequential patterns with periodic wildcard gaps. A more recent study [11] addressed a more general strict approximate pattern matching with Hamming distance, named SAP, and proposed an effective online algorithm named SETA, based on the subnettree structure (a Nettree is an extension of a tree with multi-parents and multi-roots).

In the context of uncertain databases, frequent itemset mining and sequential pattern mining are two of the most important pattern mining problems studied. For the problem of frequent itemset mining, earlier work often used expected support to measure pattern frequency [12, 13]. However, expected support based approaches do not consider probability distributions, that is, they are not able to give any probabilistic guarantees. In fact, an itemset whose expected support is greater than a threshold may have a low probability of having a support greater than this threshold. As a result, recent research focuses more on using probability measurements [14, 15, 16, 17]. Wang et al. [16] proposed the definition, probabilistic prevalent colocations, trying to find all the colocations that are likely to be prevalent in a randomly generated possible world. First, they proposed pruning strategies for candidates to reduce the amount of computation of the probabilistic participation index values. Next, they designed an improved dynamic programming algorithm for identifying candidates. Lee and Yun [17] proposed a List-based uncertain frequent pattern mining algorithm (LUNA), which is an efficient and exact method for mining uncertain frequent patterns based on novel data structures and mining techniques that can guarantee the correctness of the mining results without any false positives.

Unfortunately, there is limited work on the problem of sequential pattern mining on uncertain data. Zhao et al. [18] established two uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data, and developed two algorithms, collectively called U-PrefixSpan, for probabilistically frequent sequential patterns mining. Work by Li et al. [19] focused on the core problem of computing substring matching probability in uncertain sequences and proposed a dynamic programming algorithm for this task. A study in [20] proposed a semantics called (k, $\tau$ )-matching queries and devised techniques for indexing and verification. All the work allowed either unlimited gaps or no gaps. To the best of our knowledge, currently there is no existing work about pattern matching with periodical wildcards in uncertain sequences.

3. Preliminaries

Let $s=\{s[1],s[2],\ldots,s[n\}$ be a sequence that contains $n$ characters chosen from a finite alphabet $\zeta$ . Let $q=q[1]\varepsilon_{1}(N_{1},M_{1})q[2],\ldots\varepsilon_{m-1}(N_{m-1},M_{m-1}% )q[m]$ be a pattern with wildcards, where a wildcard can match any character of alphabet $\zeta$ . The pattern contains m ( $m\leqslant n$ ) characters chosen from the same alphabet $\zeta$ . The gap $\varepsilon_{i}(N_{i},M_{i})(N_{i}\leqslant M_{i},1\leqslant i\leqslant m-1)$ represents the local length constraint $[N_{i},M_{i}]$ of wildcards between $q[i]$ and $q[i+1]$ . For example, $\varepsilon_{1}(9,12)$ matches any characters with the length 9, 10, 11, or 12 between $q[1]$ and $q[2]$ .

Definition 1. The pattern $q$ is called a subsequence of $s$ , iff there exists an occurrence of $q$ in $s$ . An occurrence $\textit{occ}=(k_{1},k_{2},\ldots,k_{m})$ is a position sequence of $q$ in $S$ with the gap constraints. There exists $q_{i}$ at position $k_{i}$ in $S$ , $1\leqslant i\leqslant m$ , $1\leqslant k_{i}\leqslant n$ . It has to satisfy the Gap-Length Constraints (local length constraints) as follows:

$\displaystyle\left\{{{\begin{array}[]{*{20}c}{s_{k_{i}}=q_{i}}\\ {k_{i-1}<k_{i}}\\ {N_{i-1}+1\leqslant k_{i}-k_{i-1}\leqslant M_{i-1}+1}\\ \end{array}}}\right.$ (1)

The pattern $q$ may match in the sequence $s$ in several places. If $q$ is a subsequence of $s$ , it should match in $s$ in at least one place. In this paper, we tackle the problem of pattern matching in uncertain sequences. The significant difference between uncertain sequences and certain ones is that we do not definitely know whether a character appears in $s$ or not, that is to say, the occurrence of the character of $q$ is probabilistic.

Definition 2. An uncertain sequence $s=\{s[1],s[2],\ldots,s[n]\}$ is a sequence that contains n characters chosen from a finite alphabet $\zeta$ . Each character $x$ ( $x\in\zeta$ ) in $s[j](1\leqslant j\leqslant n)$ is associated with a probability $P(s[j]=x)$ , which indicates the likelihood that character $x$ is present in $s[j]$ . The sequence $s$ can be expressed by a probability matrix with the size of $|\zeta|\times n$ , where $\sum_{x\in\zeta}P(s[j]=x)=1$ , $1\leqslant j\leqslant n$ .

To interpret sequence uncertainty, the Possible World Semantics (or PWS in short) is often used [21]. Conceptually, a sequence is viewed as a set of deterministic instances (called possible worlds), each of which contains a set of characters. For example, a possible world for Table 2a consists of the characters {AACA}, existing with a probability of 0.1 $\times$ 0.9 $\times$ 0.95 $\times$ 1 $=$ 0.0855. Any query evaluation algorithm for a probabilistic sequence has to be correct under PWS. That is, the results produced by the algorithm should be the same as if the query is evaluated on every possible world [21].

Although PWS is intuitive and useful, evaluating queries under this notion is costly. This is because a probabilistic sequence has an exponential number of possible worlds. For example, length of the uncertain sequence in Table 2a is 4, and it has 4 ${}^{3}=$ 64 possible worlds at most. Except for the possible world, whose possibility is zero, the number of possible worlds of the uncertain sequence in Table 2a is 4 $\times$ 2 $\times$ 2 $=$ 16, and the possible worlds are enumerated in Table 2b. Performing query evaluation or data mining under PWS can thus be technically challenging.

Definition 3. The probability that a pattern with periodical wildcards $q$ is a subsequence of an uncertain sequence $s$ is called subsequence matching probability, and it is denoted as $P(q\subseteq s)$ .

Our task is to calculate the subsequence matching probability, and we will describe our algorithm in Section 4.

Table 2

An uncertain sequence and its possible worlds

(a): An uncertain sequence of length 4
	S[1]	S[2]	S[3]	S[4]
A	0.1	0.9	0	1
C	0.3	0	0.95	0
G	0.1	0	0	0
T	0.5	0.1	0.05	0

(b): Possible worlds of the uncertain sequence
W	Characters in w	Prob.	W	Characters in w	Prob.
w1	AACA	0.0855	w9	GACA	0.0855
w2	AATA	0.0045	w10	GATA	0.0045
w3	ATCA	0.0095	w11	GTCA	0.0095
w4	ATTA	0.0005	w12	GTTA	0.0005
w5	CACA	0.2565	w13	TACA	0.4275
w6	CATA	0.0135	w14	TATA	0.0225
w7	CTCA	0.0285	w15	TTCA	0.0475
w8	CTTA	0.0015	w16	TTTA	0.0025

4. Calculating subsequence matching probabilities

Now, we will illustrate how to calculate subsequence matching probabilities under possible world semantics. Let $s(w)$ be the instantiated sequence in possible world W, the subsequence matching probabilities can be computed by the following formula.

$\displaystyle P(q\subseteq s)=\sum\limits_{w\in W:q\subseteq s(w)}P(w)$ (2)

Example 1. Given the uncertain sequence $s$ in Table 2, and a pattern with wildcards $q=$ C[0,2]A, the probabilities that $q$ is a subsequence of $s$ is:

$\displaystyle P(q\subseteq s)=P(W1)+P(W3)+P(W5)+P(W6)+P(W7)+P(W8){}+P(W9)+P(W1% 1)+P(W13)+P(W15)=0.96$

Because the number of possible worlds will increase exponentially, it is costly to calculate sequence matching probabilities under possible world semantics. We will propose a dynamic programming algorithm for this task.

4.1 Our dynamic programming approach CoDP

Definition 4. For pattern $q$ with wildcards and uncertain sequence $s$ , $q(i)=q[1]\varepsilon_{1}(N_{1},M_{1})q[2],\ldots$ $\varepsilon_{i-1}(N_{i-1},M_{i-1})q[i]$ and $s(j)=\{s[1],s[2],\ldots s[j]\}$ , where $i\in[1,m]$ and $j\in[1,n]$ .

Definition 5. $P_{q(i),j}$ is defined as the probability that $q(i)$ is the subsequence of $s(j)$ , where $i\in[1,m]$ and $j\in[1,n]$ .

The treatment of periodical wildcards is a central feature of our approach. In our algorithm, the pattern has no repetitive characters, that is to say, $q[i]\neq q[j]\forall i\neq j,1\leqslant i\leqslant m,1\leqslant j\leqslant m$ . And the gap between two characters of the pattern is the same, i.e., $\varepsilon_{i}(N_{i},M_{i})=\varepsilon_{j}(N_{j},M_{j})\forall i\neq j,1% \leqslant i\leqslant m-1,1\leqslant j\leqslant m-1$ . To develop a dynamic programming method that can handle the gap constraint, we introduce the notion of a tail gap [19].

Definition 6. Given a pattern $q(i)=q[1]\varepsilon_{1}(N_{1},M_{1})q[2],\ldots\varepsilon_{i-1}(N_{i-1},M_{i% -1})q[i]$ and an uncertain sequence $s(j)=\{s[1],s[2],\ldots s[j]\}$ , where $i\in[1,m]$ and $j\in[1,n]$ , and given an occurrence $\textit{occ}_{i}=(k_{1},k_{2},\ldots,k_{i})$ . The tail gap, denoted as $\vee_{i,j}$ , is defined as $\vee_{i,j}=j-k_{i}$ .

Example 2. For $\textit{occ}_{3}=(1,3,5)$ and $s(6)=\{s[1],s[2],\ldots s[6]\}$ , $\vee_{3,6}=6-5=1$ .

Definition 7. Given a positive integer y, an uncertain sequence $s(j)=\{s[1],s[2],\ldots s[j]\}$ and an occurrence $\textit{occ}_{i}=(k_{1},k_{2},\ldots,k_{i})$ , where $s(j)\subseteq s$ , we say $\textit{occ}_{i}\subseteq s(j)$ fulfills the tail gap constraint iff $\textit{occ}_{i}$ satisfies $\vee_{i,j}\leqslant y$ .

Definition 8. $P_{q(i),j}^{\vee y,x}$ is defined as the probability that $q(i)$ is a subsequence of $s(j)$ with tail gap constraint $\vee y$ and Gap-Length Constraints.

The main idea of our approach CoDP is to split the problem of computing $P_{q(i),j}$ at the first j timestamps into subproblems of computing the frequentness probabilities at the first j-1 timestamps. Given a pattern with wildcards $q(i)=q[1]\varepsilon_{1}(N_{1},M_{1})q[2],\ldots\varepsilon_{i-1}(N_{i-1},M_{i% -1})q[i]$ and an uncertain sequence $s(j)=\{s[1],s[2],\ldots s[j]\}$ , where $i\in[1,m]$ and $j\in[1,n]$ . If we assume that $s[j]=q[i]$ , then $P_{q(i),j}$ is equal to the probability $P_{q(i-1),j-1}$ . Otherwise, $P_{q(i),j}$ is equal to the probability $P_{q(i),j-1}$ . In our algorithm, the gap is allowed in a match. Thus, techniques for handling the gap constraint are required, which leads to Lemma 1.

Lemma 1. If $s[j]=q[i]$ , $P_{q(i),j}$ with Gap-Length Constraints is equal to the probability $P_{q({i-1}),j-1-N}$ that satisfies the tail gap constraint $\vee y=M-N$ . Otherwise, $P_{q(i),j}$ is equal to the probability $P_{q(i),j-1}$ that satisfies Gap-Length Constraints.

The occurrence of $q$ in $s$ should satisfy the gap-length constraints. In lemma 1, if $s[j]=q[i]$ , we will find $q[{i-1}]$ at the first $j-1-N$ positions of $s(j)$ to meet the minimum gap constraint. Because the gap-length constraints include both minimum and maximum gap, the position that matches $q[{i-1}]$ also needs to satisfy the tail gap constraint.

By splitting the problem in this way, we can use the recursion in Eq. (3), which tells us what these probabilities are, to compute $P_{q(i),j}$ by means of the paradigm of dynamic programming.

$\displaystyle P_{q(i),j}^{M}=P({s[j]\neq q[i]})*P_{q(i),j-1}^{M}+P({s[j]=q[i]}% )*P_{q({i-1}),j-1-N}^{\vee y}$ (3)

where

$\displaystyle P_{q(0),j}^{\vee y}=1\quad\forall 1\leqslant j\leqslant n,\quad{% P_{q(i),j}^{\vee y}=P_{q(i),j}^{x}=0}\quad{\forall i+(i-1)\ast N>j}$

However, Eq. (3) is not suitable for calculating the exact probability that the pattern $q$ is a subsequence of the uncertain sequence $s$ , because some special needs should be taken into account. In Table 2, for the pattern with wildcards $q=C[{0,1}]A$ , because $P({s[4]=q[2]})=1$ , we will match $q[1]$ at the second and third positions of $s$ according to Eq. (3). But $q$ also matches the first two positions of the instances in $w5$ and $w6$ , that is to say, we should match $q$ at the first three positions even though $P({s[4]=q[2]})=1$ , otherwise, the probability we have calculated will be smaller than the exact probability. So, formula 3 should be modified as follows.

$\displaystyle P_{q(i),j}^{M}=P({s[j]\neq q[i]})*P_{q(i),j-1}^{M}+P({s[j]=q[i]}% )*({P_{q({i-1}),j-1-N}^{vy}+P_{q(i),j-1}^{M}})$ (4)

But there is still something incorrect in Eq. (4). In Table 2, for the pattern $q=C[{0,1}]A$ , the probability computed by Eq. (4) will be larger than the exact probability, because $w5$ will be counted twice. In order to get the exact probability, we will subtract the probability that $q({i-1})$ exists at the first $j-1-N$ positions with the tail gap constraint, and $q(i)$ exists at the first $j-1$ positions of $s(j)$ at the same time. Lemma 2 shows how to use the dynamic programming scheme to compute $P_{q(m),n}^{M}$ .

Lemma 2. Entry:

$\displaystyle\vee y=M-N,\quad i=m,\quad j=n$ $\displaystyle P._{q(i),j}^{M}=P({s[j]\neq q[i]})*P_{q(i),j-1}^{M}+P({s[j]=q[i]})$ $\displaystyle\quad*({P_{q({i-1}),j-1-N}^{vy}+P_{q(i),j-1}^{M}-\textit{common}})$ (5)

$\displaystyle 1<i<m$ $\displaystyle P_{q(i),j}^{vy}=P({s[j]\neq q[i]})*P_{q(i),j-1}^{vy-1}+P({s[j]=q% [i]})$ $\displaystyle\quad*({P_{q({i-1}),j-1-N}^{vy}+P_{q(i),j-1}^{vy-1}-\textit{% Common}})$ (6)

$\displaystyle i=1$ $\displaystyle P_{q(i),j}^{vy}=P({s[j]\neq q[i]})*P_{q(i),j-1}^{vy-1}+P({s[j]=q% [i]})*P_{q({i-1}),j-1}^{vy}$ (7)

$\displaystyle P_{q(0),j}^{\vee y}=1\quad\forall 1\leqslant j\leqslant n,\quad{% P_{q(i),j}^{\vee y}=P_{q(i),j}^{x}=0}\quad{\forall i+(i-1)\ast N>j}$ (8)

Lemma 2 is explained as follows. Equation (4.1) is the entry of our dynamic programming approach CoDP with $\vee y=M-N$ , $i=m$ and $j=n$ . Then $P_{q({i-1}),j-1-N}^{vy}$ of Eq. (4.1) is calculated by Eq. (4.1) if $1<i<m$ or by Eq. (4.1) if $i=1$ . These equations are calculated recursively. The recursion termination conditions are shown in Eq. (8). In Eqs (4.1) and (4.1), the label common represents a function. The function common is used to calculate the probability that $q({i-1})$ exists at the first $j-1-N$ positions with the tail gap constraint, and $q(i)$ exists at the first $j-1$ positions of $s(j)$ at the same time. To get the exact probability, bit sequences are adopted in the function. Details of the bit sequences will be discussed briefly.

4.2 Bit sequence representation of occurrences

In function common, for each occurrence, a bit sequence is constructed. If the character $q[i]$ appears at the $j^{th}$ position of sequence $s$ , the $j^{th}$ bit is set to be $q[i]$ .

For example, in Table 2, for the pattern $q=C[{0,1}]A$ , the probability that $q$ is a subsequence of the uncertain sequence $s$ can be calculated as follows according to Eq. (4.1).

$\displaystyle P_{q(2),4}^{1}=P({s[4]\neq q[2]})*P_{q(2),3}^{1}+P({s[4]=q[2]})*% ({P_{q(1),3}^{v1}+P_{q(2),3}^{1}-\textit{common}})=P_{q(1),3}^{v1}+P_{q(2),3}^% {1}-\textit{common}$ (9)

To satisfy the tail gap constraint, $q(1)$ will exist at the third position of $s$ , and the bit sequence $\{00C\}$ is constructed. The positions that $q(2)$ exists in $s(3)$ are the first and the second, and the bit sequence is $\{CA0\}$ . To compute the probability, we will use bitwise OR operations on the bit sequences and get a new one $\{CAC\}$ . The new bit sequence means there are possible worlds that $q(1)$ and $q(2)$ exist at the same time, and the probability is 0.3 $\times$ 0.9 $\times$ 0.95 $=$ 0.2565. The bit sequences, which represent the positions that $q(2)$ exists in $s(4)$ , are $\{00CA,CA00\}$ . We can calculate the exact probability by using bit sequences. Figure 1 shows an example of how to use our dynamic programming approach CoDP to calculate $P_{q(3),6}^{2}$ .

Table 3

An uncertain sequence

	S[1]	S[2]	S[3]	S[4]	S[5]	S[6]
A	0	0	0.1	0	0.2	0
C	0.3	0.4	0.2	0.1	0.4	0
G	0.7	0.6	0.3	0.9	0.2	1
T	0	0	0.4	0	0.2	0

4.3 An example

As shown in Fig. 1, $q=C[1,2]A[1,2]G$ and the sequence $s$ is shown in Table 3. There are $m+1$ layers in total. In each layer, $j\in[{i+({i-1})\ast N,n-({m-i})\ast({N+1})}]$ . In the internal layer, such as $i=$ 1 and $i=$ 2, in each column, only those $P_{q(i),j}^{\vee y},y\in[{\vee y_{\min},\vee y_{\max}}]$ need to be calculated and stored. The rightmost node of each row gets $\vee y_{\max}$ for its column, where $\vee y_{\max}=\min({j-i-({i-1})\ast N,M-N})$ . Whilst the leftmost node of each row gets $\vee y_{\min}$ for its column, where $\vee y_{\min}=\max({g-({n-({m-i})*({N+1})-j}),0})$ , $g=M-N$ . Except for the bottom layer, bit sequences are stored with $P_{q(i),j}^{\vee y},y\in[{\vee y_{\min},\vee y_{\max}}]$ to get the exact probability.

Figure 1.

Details of how to calculate $P_{q(3),6}^{2}$ .

5. Applications and extensions

In this section, we will sketch an extension that leverages our solution for computing the subsequence matching probability.

Since the difference between substrings matching and pattern matching with wildcards is that the former allows no gaps while the latter allows limited gaps, our approach can also be used to calculate substring matching probability after it is slightly modified.

Definition 9. The probability that $q$ is a substring of an uncertain sequence $s$ is called substring matching probability.

Lemma 3 shows how to use our modified dynamic programming scheme, called CoDPstr, to compute substring matching probability, denoted as $P_{q(m),n}$ .

Lemma 3. Entry:

$\displaystyle i=m,\quad j=n$ $\displaystyle P_{q(i),j}=P({s[j]\neq q[i]})*P_{q(i),j-1}+P({s[j]=q[i]})*({P_{q% ({i-1}),j-1}+P_{q(i),j-1}-\textit{common}})$ (10)

$\displaystyle 1\leqslant i<m$ $\displaystyle P_{q(i),j}=P_{q({i-1}),j-1}*P({s[j]=q[i]})$ (11)

$\displaystyle P_{q(0),j}=1\quad\forall 1\leqslant j\leqslant n,\quad{P_{q(i),j% }=0}\quad{\forall i>j}$ (12)

Equation (5) is the entry of our modified dynamic programming approach CoDPstr with $i=m$ and $j=n$ . Then $P_{q({i-1}),j-1}$ of Eq. (5) is calculated by Eq. (5) if $1\leqslant i<m$ . The recursion termination conditions are shown in Eq. (12). In Eq. (5), the label common stands for the probability that $q({i-1})$ and $q(i)$ exists at the first $j-1$ positions at the same time. Details of how to compute the value of common are shown in Algorithm 1.

Algorithm 1 Computing value of common

Input: string $q$ , matrix of uncertain sequence prob, array $k$

Output: the value of common

1. for $j=m$ ; $j<=n$ ; $j++$ do

2. if $j<m+m-k$ [1] then

3. Common $=$ 0

4. else

5. if $k$ [1] $==$ 0 then

6. Common $=P_{q({m-1}),j-1}\ast P_{q(m),j-m}$

7. else

8. if $j<m+m$ then

9. Common $=P_{q({m-1}),j-1}\ast P_{q({m-k[1]}),j-m}$

10. else

11. $co=P_{q({m-1}),j-1}\ast P_{q(m),j-m}$

12. for $i=$ 1; $i<m$ ; $i++$ do

13. if $k$ [i] $==$ 0 then

14. break;

15. else

16. $aa=P_{q(m),j-m+k[i]}-P_{q(m),j-m+k[i]-1}$

17. if $aa!=$ 0 then

18. for $l=k$ [i] $+$ 1; $l<=m-1$ ; $l++$ do

19. $aa=aa*\textit{prob}[q[l]][j-m+l]//\textit{prob}[q[l]][j-m+l]$ is the probability that

the $l^{th}$ element of pattern $q$ exists at the $j-m+l^{th}$ position of sequence $s$

20. $co=co+aa$

21. Common $=co$

In Algorithm 1, four cases should be taken into account to compute the value of common. The first case is that $j$ is not big enough to let $q({m-1})$ and $q(m)$ exist at the same time, as shown in lines 2–3, and the value of common is 0. Lines 5–6 is the second case, $j$ is bigger and there is no overlap between the former and the latter part of $q$ , for example $q=$ ATTGAC. Lines 8–9 is the third case, there is overlap between the former and the latter part of $q$ , for example $q=$ ACTGAC, and $j$ is smaller than 2*m. The value of common is the product of the probability that $q({m-1})$ exists at the first $j-1$ positions and the probability that $q({m-k[1]})$ exists at the first $j-m$ positions of sequence $s$ . The fourth case is shown in lines 11–21. In Algorithm 1, the array $k$ is used to record the positions, where there is overlap between the former and the latter part of $q$ . Details of how to get these positions are shown in Algorithm 2.

Algorithm 2 Finding overlapping positions

Input: string $q$

Output: array $k$

1. $l=$ 0;

2. for $i=$ 0; $i<m+1$ ; $i++$ do

3. $k$ [i] $=$ 0

4. for $i=$ 2; $i<=m$ ; $i++$ do

5. if ch[i] $!=$ ch[1] then

6. break

7. if $i>m$ then

8. for $j=$ 1; $j<m$ ; $j++$ do

9. $k$ [j] $=m-j$

10. else

11. for $i=m-1$ ; $i>=1$ ; $i--$ do

12. if $q$ [i] $==$ $q$ [m] then

13. kk $=m-1$

14. for $j=i-1$ ; $j>=$ 1; $j--$ do

15. if $q$ [j] $==$ $q$ [kk] then

16. $kk--$

17. else

18. break

19. if $j==$ 0 then

20. $k$ [l] $=$ $i$

21. $l++$

In Algorithm 2, array $k$ is initialized as 0 at first. In lines 4–9, $k[j]=m-j,1\leqslant j\leqslant m-1$ , if all the characters in string $q$ are the same. For example, if $q=$ CCCC, $k=\{3,2,1\}$ , it indicates that the former three characters of $q$ are the same as the latter three ones, the former two letters are the same as the latter two, and the first is the same as the last. In lines 11–21, we scan the string $q$ in reverse. If $q$ [i] $=$ $q$ [m], we will check whether the former $i$ characters are the same as the latter $i$ ones. If yes, the position $i$ will be stored in array $k$ . For example, if $q=$ CACACA, $k=\{4,2\}$ .

Use our modified dynamic programming scheme CoDPstr, as shown in Lemma 3, we can compute substring matching probability in linear time.

6. Experiments

In this section, we evaluate the performance of our proposed algorithms. All the experiments were carried out on the Windows 7 operating system, on a machine with a 3.7 GHz Intel(R) Core(TM) i3-4170 processor and 4GB memory. The programs were written in C++ and compiled on Microsoft Visual Studio 2013. All the results are the average of ten runs.

6.1 Evaluation of algorithm CoDP

We test the scalability of our proposed algorithm CoDP on a real-world uncertain DNA sequence database [22]. The database contains 593 uncertain sequences where $\zeta=$ {A,C,G,T}. Each uncertain sequence is relatively short (average size 10.79). Among all the sequences, the longest sequence’s size is 30 and the shortest one’s size is 5.

Figure 2 shows the average time needed to calculate the subsequence matching probability on the real-world dataset. the pattern is $q(3)=q[1]\varepsilon_{1}(N_{1},M_{1})q[2],\ldots\varepsilon_{i-1}(N_{i-1},M_{i% -1})q[3]$ and the length of wildcards is different. In Fig. 2a the upper bound of the gap varies from 0 to 5, and the time increases with it. Because when the upper bound increases, the length of the gap also increases, and the time overhead is proportional to the length of the gap. In Fig. 2b the lower bound of the gap varies from 0 to 5, and the elapsed time decreases with it. Because when the lower bound increases, the length of the gap decreases.

Figure 2.

The average time needed to compute the subsequence matching probability on the real-world dataset.

Figure 3.

The average time needed to compute the substring matching probability on the first synthetic dataset.

6.2 Comparison between our method CoDPstr and DPstr

We compare our modified method CoDPstr with the approach for substring matching from [19], which we call it DPstr and it is proposed by Li et al., on three datasets. The first dataset has several synthetic uncertain sequences. For each uncertain sequence $s$ , the probability $P({s[j]=x})$ is randomly drawn from [0,1], where $j\in[{1,n}]$ , $x\in\zeta$ , $\zeta=$ {A,C,G,T} and $\sum\nolimits_{x\in\zeta}{P({s[j]=x})}=1$ . The second dataset is the same with the one used in Sections 6.1. The third dataset is the synthetic dataset about weather. For each uncertain sequence $s$ , the probability $P({s[j]=x})$ is randomly drawn from [0,1], where $j\in[{1,n}]$ , $x\in\zeta$ , $\zeta=$ {F, r, R, O, C, s, S, T} and $\sum\nolimits_{x\in\zeta}{P({s[j]=x})}=1$ . The letters ‘F’, ‘r’, ’R’, ’O’, ’ C’, ‘s’, ‘S’ and ‘T’ represent fine day, light rain, heavy rain, overcast, cloudy, light snow, heavy snow and sleet respectively.

Figure 4.

The average time needed to compute the substring matching probability on the real-world dataset.

Figure 5.

The average time needed to compute the substring matching probability on the weather synthetic dataset.

First, we give efficiency results obtained by utilizing different methods of computing the substring matching probability. Then, we discuss the performance and utility of these two substring matching algorithms.

Figure 3 shows the average time needed to compute the substring matching probability on the first synthetic dataset, across different choices of $q$ and s. Figure 3a compares the elapsed time for computing $p_{q(m),10^{4}}(O)$ of our approach against the method proposed by Li et al. The size of the query substring $q$ varies from 5 to 1000. In Fig. 3b, the size of the uncertain sequence $s$ varies from $n=$ 5*10 ${}^{5}$ to $n=$ 10 ${}^{7}$ , and the size of the query substring $q$ is $m=$ 10.

Figure 4 compares the average time overhead of these two methods on a real-world dataset, which is needed to compute the substring matching probability. The size of the query substring $q$ varies from 1 to 8, because the average size of the 593 uncertain sequences is 10.79.

Figure 5 shows the average time needed to calculate the substring matching probability on another synthetic weather dataset, across different choices of $q$ and $s$ . Figure 5a compares the elapsed time for computing $p_{q(m),90}(O)$ of our approach against the method DPstr. The size of the query substring $q$ varies from 3 to 60. In another words, we would like to inquire about how the weather has fluctuated for three days, five days and so on over the past three months. In Fig. 5b, the size of the uncertain sequence $s$ varies from $n=$ 30 to $n=$ 360, and the length of the query substring $q$ is $m=$ 7. According to the results $p_{q(7),n}(O)$ , we can learn about the weather changes for a week in the past month to year. In Fig. 5c, the size of the query substring $q$ varies from 1 to 9.

In Figs 3–5, the red dot line represents the time overhead needed to compute the substring matching probability using our approach CoDPstr. The blue diamond line is the elapsed time of the algorithm DPstr.

As we can see, the elapsed time of our approach is much less than Li’s on both two synthetic datasets and one real-world dataset. The two approaches both use dynamic programming scheme, but subsequences matching process of them is different. In DPstr, four scenarios need to be considered at each step: forward matching, backward matching, tail matching and reset. The more scenarios need to be considered, the more time will elapse. Furthermore, each scenario is complicated and time consuming. In our approach CoDPstr, only one scenario should be taken into account at each step when $i<m$ . So our approach takes much less time than Li’s, and the time gap between these two methods becomes bigger when the size of query pattern $q$ increases. The time gap also increases with the size of uncertain sequences.

7. Conclusions

In this paper, efficient algorithms for computing subsequence matching probabilities with periodical wildcards in uncertain sequence databases are studied, which, to the best of our knowledge, has not been studied before. Then we extend our algorithm to handle substring matching problem. We experimentally showed that our proposed dynamic computation technique is able to effectively avoid the problem of “possible world explosion”. We plan to study the calculation of matching probabilities for other uncertain data models.

Footnotes

Acknowledgments

This work was supported by the National Key Research and Development Program of China under grant No. 2016YFB1000901, the National Natural Science Foundation of China (NSFC) under grants Nos. 61202227 and 61602004, and the US National Science Foundation (NSF) under grant IIS-1613950.

References

Pisanti

Crochemore

Grossi

and Sagot

M.-F.

, Bases of motifs for generating repeated patterns with wild cards, IEEE/ACM Trans Comput Biol Bioinform 2 (2005), 40–50.

B.-W.

and Lee

, Meta similarity, Appl Intell 35(3) (2011), 359–374.

Pasquier

Sanhes

Flouvat

and Selmaoui-Folcher

, Frequent pattern mining in attributed trees: algorithms and applications, Knowledge and Information Systems 46 (2016), 491–514.

Deshpande

Guestrin

Madden

S.R.

Hellerstein

J.M.

and Hong

, Model-Driven Data Acquisition in Sensor Networks, in: Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004, pp. 588–599.

Chen

W.S.

Wang

and Sun

M.T.

, Leveraging Spatio-Temporal Redundancy for RFID Data Cleansing, in: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indiana, USA, 2010, pp. 51–62.

Lei

P.-R.

, A framework for anomaly detection in maritime trajectory behavior, Konwledge and Information Systems 47 (2016), 189–214.

Aggarwal

C.C.

and Yu

P.S.

, A Survey of Uncertain Data Algorithms and Applications, IEEE Trans. Knowl. Data Eng. 21(5) (2009), 609–623.

Guo

Xie

and Wu

, Pattern matching with wildcards and gap-length constraints based on a centrality-degree graph, Appl Intell 39 (2013), 57–74.

Xie

and Zhu

, Document-Specific Keyphrase Extraction Using Sequential Patterns with Wildcards, in: 2014 IEEE International Conference on Data Mining (ICDM), Shenzhen, China, 2014, pp. 1055–1060.

10.

Wang

Ren

et al., Mining sequential patterns with periodic wildcard gaps[J], Applied Intelligence 41(1) (2014), 99–116.

11.

Jiang

and Wu

, Strict approximate pattern matching with general gaps, Appl Intell 42 (2015), 566–580.

12.

Aggarwal

C.C.

Wang

and Wang

, Frequent Pattern Mining with Uncertain Data, KDD’09, Paris, France, 2009, 29–38.

13.

Leung

C.K.-S.

MacKinnon

R.K.

and Tanbeer

S.K.

, Fast Algorithms for Frequent Itemset Mining from Uncertain Data, in: 2014 IEEE International Conference on Data Mining (ICDM), Shenzhen, China, 2014, pp. 893–898.

14.

Xia

Wang

Nadungodage

C.H.

and Prabhakar

, Sequential pattern mining in databases with temporal uncertainty, Knowledge and Information Systems 51 (2017), 821–850.

15.

Tong

Chen

and Ding

, Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data, in: IEEE 28th International Conference on Data Engineering, 2012, pp. 270–281.

16.

Wang

and Chen

, Finding Probabilistic Prevalent Colocations in Spatially Uncertain Data Sets[J], IEEE Transactions on Knowledge & Data Engineering 25(4) (2013), 790–804.

17.

Lee

and Yun

, A new efficient approach for mining uncertain frequent patterns using minimum data structure without false positives[J], Future Generation Computer Systems 68 (2017), 89–110.

18.

Zhao

Yan

and Ng

, Mining Probabilistically Frequent Sequential Patterns in large Uncertain Databases, IEEE Transactions on Knowledge and Data Engineering 26(5) (2014), 1171–1184.

19.

Bailey

Kulik

and Pei

, Efficient Matching of Substrings in Uncertain Sequences, in: Proceedings of the 14th SIAM International Conference on Data Mining (SDM’14), pp. 767–775.

20.

and Li

, Approximate Substring Matching over Uncertain Strings, Proceedings of the VLDB 2011 Endowment 4(11) (2011), 772–782.

21.

Dalvi

and Suciu

, Efficient query evaluation on probabilistic databases, The VLDB Journal 16(4) (2007), 523–544.

22.

Bryne

Valen

Tang

Marstrand

Winther

Da Piedade

Krogh

Lenhard

and Sandelin

, Jaspar, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update, Nucleic acids research, 2008.

Efficient pattern matching with periodical wildcards in uncertain sequences

Abstract

Keywords

1. Introduction

Table 1 An uncertain sequence of length 8

3. Preliminaries

6.1 Evaluation of algorithm CoDP

Footnotes

Acknowledgments

References

Table 1
An uncertain sequence of length 8