Suffix array for multi-pattern matching with variable length wildcards

Abstract

Approximate multi-pattern matching is an important issue that is widely and frequently utilized, when the pattern contains variable-length wildcards. In this paper, two suffix array-based algorithms have been proposed to solve this problem. Suffix array is an efficient data structure for exact string matching in existing studies, as well as for approximate pattern matching and multi-pattern matching. An algorithm called MMSA-S is for the short exact characters in a pattern by dynamic programming, while another algorithm called MMSA-L deals with the long exact characters by the edit distance method. Experimental results of Pizza & Chili corpus demonstrate that these two newly proposed algorithms, in most cases, are more time-efficient than the state-of-the-art comparison algorithms.

Keywords

Pattern matching suffix array wildcards

1. Introduction

Research on pattern matching is an essential topic in many fields such as big data, knowledge graph, text mining, bioinformatics, and information retrieval. However, compared to exact pattern matching, approximate pattern matching has a wider range of applications. Some examples are finding patterns in a given text to detect text similarity in plagiarism detection [1] and text indexing [2], as well as finding subsequences in a given DNA sequence to detect mutations in bioinformatics [3], etc. Approximate pattern matching refers to finding a pattern in a given text that allows a finite or infinite integer number of errors in the pattern [4]. The error, also known as gap [5], does not care characters [6] or wildcards [7]. In this paper, the error is called wildcards, and the length of wildcards is variable but limited.

There are three possibilities for the problem of wildcards and pattern matching. (1) Fixed-length wildcards, e.g., a***b represents that the characters ‘a’ and ‘b’ only have three errors. (2) Variable-length wildcards, e.g., a* [2, 3] b indicates that the characters ‘a’ and ‘b’ have two or three errors. (3) Wildcards of arbitrary length. It presents that the length of errors can be infinite, e.g., a*[2, $\infty$ ] b illustrates that the minimum length of errors is two while the maximal length is infinite. Sometimes, the problem of the infinite length can be transformed into the length of the given text.

Some exact multi-pattern matching methods have been proposed based on the classic exact single-pattern matching algorithm [8, 9, 10, 11]. However, these methods are simply extended from single-pattern matching. For example, $k$ matches are repeated, where $k$ is the number of patterns in a multi-pattern set. In addition, these methods are not suitable for approximate multi-pattern matching with wildcards. This paper studies the problem of approximate multi-pattern matching that can be applied to the fields of full-text retrieval or full-text indexing [12], DNA sequences alignment in bioinformatics [13], network security detection [14] and so on. Specifically, in order to carry out relevant professional research, it is necessary to compare the translations of different books in certain subject areas, or the occurrence of a keyword or a synonym with the same meaning in the text. Manual completion is time-consuming, labor-intensive, and error-prone. Therefore, the research on approximate multi-pattern matching with wildcards is of great significance.

Two efficient algorithms MMSA-S and MMSA-L based on the suffix array data structure have been proposed [15]. Once the given sequence remains unchanged, the suffix array can be reused. The problem lies in how to achieve approximate pattern matching with wildcards in the suffix array. According to the length of exact characters in the pattern, two algorithms have been respectively proposed by the dynamic programming method and the edit-distance method. When the length is short, the MMSA-S algorithm searches for possible suffixes by the dynamic programming method, and when the length is long, the MMSA-L algorithm computes the possible position intervals by the edit-distance method. Combined with the properties of the suffix array, a large number of impossible positions are excluded, which greatly improves the efficiency of these two algorithms. Based on the experimental results in DNA and Protein sequences, the MMSA-S and MMSA-L algorithms are superior to comparison algorithms in most cases.

The contribution of this work consists of two aspects:

•
To design and implement two efficient MMSA algorithms for multi-pattern matching with variable wildcards. These algorithms are evaluated experimentally in the dataset of Pizza & Chili corpus. Moreover, these algorithms are more efficient than comparison algorithms.
•
To extend the application of the suffix array in approximate multi-pattern matching. It takes advantage of the suffix array when the order of the objects is constant, and improves the algorithm of suffix array to get the longest common prefix (LCP) array and rank array, which improves the research efficiency.

The structure of this paper is as follows. In Section 2, the related work of multi-pattern matching and suffix array are described. In Section 3, the problem of multi-pattern matching with variable-length wildcards is defined, and some preliminaries of suffix arrays are introduced. In Section 4, our algorithms for multi-pattern matching with variable wildcards based on suffix array are illustrated with some pseudo-codes and figures. Section 5 discusses the experimental and analytical results of the two proposed algorithms compared to other algorithms. Section 6 summarizes the challenges and future research issues.
2. Related work

This section introduces the related work of pattern matching with wildcards, approximate multi-pattern matching, and the applications of suffix arrays and suffix trees in multi-pattern matching or approximate pattern matching.

The problem of pattern matching with wildcards has been studied by many researchers. Fischer et al. [7] first considered the problem of single pattern matching with wildcards or gaps in 1974. However, in their research, the range of wildcards was fixed. In later studies, many methods were applied to single pattern matching with wildcards, gaps or unrelated characters, such as dynamic programming [16], Fast Fourier Transform (FFT) [17] and bit parallelism [18]. In 2002, the pattern matching algorithm with constant wildcards was put forward by Cole et al. [19]. The time complexity was O(nlogn). In 2007, Zhu et al. [20] proposed the algorithm for pattern matching with flexible wildcards. In 2011, Haapasalo et al. [21] proposed an algorithm based on the Ahot-Corasick automaton method to solve pattern matching with arbitrary length wildcards. In 2012, Bille et al. [22] studied an algorithm for finding certain substring in a given text, and the ending position of the substrings was fixed. In these studies, the length of wildcards varied from fixed to variable, or even infinite. Furthermore, these studies focused on the problem of single-pattern matching with wildcards.

However, there are few studies on the problem of multi-pattern matching with wildcards, especially when the length of wildcards is variable. Based on the classic exact single-pattern matching algorithms, some exact multi-pattern matching methods have been proposed, such as MultiBDM [23] and Dawg-Match [24]. However, compared with the exact multi-pattern matching, approximate multi-pattern matching has been less studied, and its research problem is more complicated [25]. In 1996, Muth et al. [26] proposed the MultiHash algorithm for multi-pattern matching with one error. In 1997, Baeza-Yates et al. [27] designed an extension of PEX algorithm, known as MultiPEX algorithm, which matched the exact pieces splitting the patterns. These algorithms were not specific to the problem of multi-pattern matching with wildcards. In 2007, the TARA algorithm [28] was proposed by bit parallelism for the fixed-length wildcard in multi-pattern matching. Nonetheless, the method of bit parallelism is efficient only when the length of the pattern is small. In 2011, three sub-algorithms based on the method of Fast Fourier Transforms (FFT) were proposed by Zhang et al. [29] for multi-pattern matching with fixed wildcards. These algorithms are applicable even if the number of patterns is small. In 2018, Biswas et al. [30] provided a linear spatial index of the ranked document retrieval for multiple-pattern queries through theoretical analysis.

Suffix array was constructed based on the trie and suffix tree data structures. The trie was derived from the word of retrieval and was proposed by Edward Fredkin [31] in 1960. It is a variant of the Hash tree and stores all sub-strings in an ordered tree structure. Its retrieval efficiency is higher than that of the Hash tree. In 1973, Weiner [32] first proposed a new data structure suffix tree based on the trie structure, which stores all the suffixes of a given sequence in the tree structure, and the path from the root node to the leaf node is a suffix. As an efficient and popular data structure in exact pattern matching, suffix tree was then used for string retrieval, string matching, and natural language processing, etc. However, few existing studies focus on the pattern matching with wildcards based on suffix tree. In 2005, the inexact-suffix tree was introduced by Chattaraj et al. [33] to detect the extensible patterns. However, the time complexity of this algorithm was very large. In 2009, Ukkonen [34] introduced an approach based on suffix tree algorithm to find the equivalent representation in gapped or non-gaped motifs. In 2014, Bille [35] put forward two methods to deal with the problem of pattern matching with wildcards through the suffix tree. They constructed a suffix tree to store all suffixes, including all possible positions of wildcards. However, these methods were described theoretically. Time complexity increased with the number of wildcards in the pattern. Therefore, these methods mentioned above are difficult to implement.

Similar to suffix tree, the suffix array has more achievements in the single-pattern matching and exact pattern matching than that in the multi-pattern matching and approximate pattern matching. In 2006, Rahman et al. [36] presented an algorithm to find patterns with variable length gaps through the suffix array method. However, the algorithm was just a theoretical description, and the pattern had fixed length gaps. In 2014, Shrestha et al. [37] not only provided a detailed description of SA-IS algorithm [38], but also described techniques that suffix arrays adapt to inexact matching in biological application. In this work, it directly supports the exact ‘subset’ matching by adopting the lexical ordering of suffixes in the suffix array. Some subset matchings allow certain characters to be ignored in predefined position-specific way by the edit-distance method. However, this work can only be applied to genomic sequence with short inexact length. In 2016, Thankachan et al. [39] proposed a suffix model to cope with the approximate sequence matching problem involving a selected bounded set of perturbed suffixes. However, this method was limited to theoretical analysis and lacked practical demonstration. In 2017, Hon et al. [40] studied two dictionary matching problems with one gap and one missing substring based on the index data structures, such as suffix trees and suffix arrays. They had theoretically described the solution in terms of space efficiency and succinct space. In 2018, two algorithms for multi-pattern matching with variable-length wildcards based on the suffix tree were presented in our previous work [41].

Inheriting the advantages of index suffix data structures, two new algorithms based on suffix array have been proposed, which are more efficient and wildly applicable than the method based on suffix tree, especially when all suffixes are sorted in lexicographic order.

3. Preliminaries

In this section, some necessary background information will be presented for the understanding of the new algorithm MMSA (Multi-pattern Matching based on Suffix Array), which involves techniques related to the pattern matching with wildcards, the multi-pattern matching, and the suffix array.

3.1 Pattern matching with wildcards

The basic problem of pattern matching with wildcards is to find all possible positions that match the exact characters of the pattern and satisfy the range constraints of wildcards in the object sequence. Let the pattern $P=p_{0}*[l_{0},h_{0}]p_{1}\ldots*[l_{i-1},h_{i-1}]pi\ldots*[l_{m-1},h_{m-1}]p_% {m}$ and the object sequence $S=s_{0}s_{1}\ldots s_{j}\ldots s_{n}$ , where $p_{i},s_{j}\in\Sigma$ (0 $\leqslant i\leqslant m$ , 0 $\leqslant j\leqslant n$ ), $*[l_{i-1},h_{i-1}]$ is denoted as the range of wildcards, $l_{i-1}$ is the minimum value and $h_{i-1}$ is the maximum value between $p_{j-1}$ and $p_{j}$ . These two numbers must be integers. According to the relationship between $l_{i-1}$ and $h_{i-1}$ , there are different problems as following in the research of pattern matching with wildcards.

Pattern matching with fixed-length wildcards. When 0 $\leqslant l_{i-1}=h_{i-1}$ , and $h_{i-1}\neq\infty$ , the range of wildcards is a constant and fixed length.

Pattern matching with periodic-length wildcards. When 0 $\leqslant l_{i-1}\leqslant h_{i-1}$ , and $l_{0}=l_{1}\ldots=l_{i-1}\ldots=l_{m-1}$ , and $h_{0}=h_{1}\ldots=h_{i-1}\ldots=h_{m-1}$ , it indicates that the range of wildcards is a periodic length.

Pattern matching with arbitrary-length wildcards. When 0 $\leqslant l_{i-1}\leqslant h_{i-1}$ , and $h_{i-1}=\infty$ , the range of wildcards is infinite, and sometimes it can be converted to the length of the object sequence.

Pattern matching with variable-length wildcards. When 0 $\leqslant l_{i-1}\leqslant h_{i-1}$ , it does not belong to any of the above special cases, indicating that the range of wildcards is in variable length, which is the most common case and the above special problem can be converted to this problem.

Definition 1. The gap constraint size of wildcard is $G_{i}=h_{i}-l_{i}+1(0\leqslant i\leqslant m)$ . The minimum length of the pattern $P_{\textit{min}}=m+1+\sum_{i=0}^{m-1}l_{i}$ (0 $\leqslant i\leqslant m$ ), and the maximum length of the pattern $P_{\textit{max}}=m+1+\sum_{i=0}^{m-1}h_{i}$ (0 $\leqslant i\leqslant m$ ).

Definition 2. Given a pattern $P=p_{0}*[l_{0},h_{0}]p_{1}\ldots*[l_{i-1},h_{i-1}]pi\ldots*[l_{m-1},h_{m-1}]p_% {m}$ , pattern matching with wildcards refers to finding all occurrences of $P$ in the object sequence $S=s_{0}s_{1}\ldots s_{j}\ldots s_{n-1}$ and the occurrence $T=(t_{0},t_{1}\ldots t_{k}\ldots t_{m})$ must satisfy the following equations:

$\displaystyle\left\{\begin{array}[]{l}{s_{t_{k}}=p_{k}\left({0\leqslant k% \leqslant m}\right)}\\ {l_{k}\leqslant t_{k+1}-t_{k}-1\leqslant h_{k}}\\ {P_{\textit{min}}\leqslant t_{m}-t_{0}+1\leqslant P_{max}\leqslant n}\\ \end{array}\right.$

where 1 $\leqslant t_{1}\leqslant t_{2}\leqslant\ldots\leqslant t_{k}\leqslant\ldots% \leqslant t_{m}\leqslant n$ .

Example 1. Suppose a pattern with wildcards $P$ and an object DNA sequence $S$ is given, as shown in Fig. 1.

Figure 1.

An example of pattern matching with wildcards.

For the first character group CA, the starting positions are 3, 15, 18, respectively; for the second character group GA, the starting positions are 0, 7, 13, respectively; the minimum value of wildcards is 1 while the maximum value is 3. $P_{\textit{min}}=3+1+1=5$ , $P_{\textit{max}}=3+1+3=7$ , $G_{0}=3-1+1=3$ . According to Defi.2, the positions of $p_{1}$ must be bigger than those of $p_{0}$ . When the positions of $p_{0}$ are 15, 18, and $p_{1}$ is 0, these positions do not satisfy this condition. When $p_{0}=$ 18, $p_{0}+P_{\textit{min}}=18+5>n$ , the position is eliminated. When $p_{0}=$ 3, $p_{1}=$ 13, and the gap is 8, these positions do not satisfy the length of this wildcard. Therefore, these positions are eliminated as well. When $p_{0}=$ 3, $p_{1}=$ 7, and the gap is 2, the positions satisfy the demands. Therefore, the satisfied matching occurrence of the pattern $P$ in $S$ is (3, 7).

3.2 Multi-pattern matching with wildcards

Multi-pattern matching with wildcards is more complicated than single-pattern matching with wildcards. The following relationships between patterns and matching methods are required to be considered.

•
To satisfy all patterns;
•
To satisfy any pattern;
•
To satisfy pattern A but not pattern B;
•
To match one pattern at a time;
•
To match multiple patterns simultaneously;

This work only considers the problem of matching each pattern with variable-length wildcards respectively at one time, and the output of any pattern can be satisfied.

Definition 3. Let $\Phi=\{P^{1},P^{2},\ldots P^{k}\}$ be the pattern set, and $k$ is the number of patterns. Each pattern $P^{k}=p_{0}[l_{0},h_{0}]p_{1}\ldots[l_{i-1},h_{i-1}]p_{i}\ldots*[l_{m-1},h_{% m-1}]p_{m}$ has variable-length wildcards. The object sequence is $S=s_{0}s_{1}\ldots s_{j}\ldots s_{n-1}$ . The problem of multi-pattern matching with variable-length wildcards is to find all occurrences of each pattern with variable-length wildcards $P^{k}$ in the object sequence $S$ . All patterns are stored in the pattern set $\Phi$ and match the same object sequence $S$ .

Example 2. Suppose a pattern set $\Phi=\{P^{1},P^{2},P^{3}\}$ , each pattern $P^{k}$ is a variable-length wildcard pattern and an object DNA sequence $S$ , as shown in Fig. 2.

Figure 2.
An example of multi-pattern matching with wildcards.

According to Example 1, the satisfying matching occurrence of pattern $P^{1}$ in $S$ is (3, 7). Similarly, the satisfying occurrence of pattern $P^{2}$ in $S$ is (5, 9); the satisfying occurrences of the pattern $P^{3}$ in $S$ are (1, 4), (5, 8), (16, 19). Therefore, the number of occurrences of the pattern set $\Phi$ in object sequence $S$ is 5, and the satisfying occurrences are {{(3, 7)}, {(5, 9)}, {(1, 4), (5, 8), (16, 19)}}.
3.3 Suffix tree

Let $S=s_{0}s_{1}\ldots s_{i}\ldots s_{n-1}$ be an object sequence of $n$ characters from an alphabet set $\Sigma$ . According to Ukkonen [42], ‘$’ is added to the last character in the sequence $S$ , denoting the end marker. $S$ is called the object sequence. The suffix tree of $S$ is a tree structure, and each path from the root to the leaf represents a suffix of $S$ . Suffix $S[i]=s_{i}s_{i+1}\ldots s_{n-1}{\$}$ denotes a suffix of the i-th position of $S$ , and represents the path from the root to the i-th leaf node in the suffix tree (S$).

Lemma 1. The suffix tree of the object sequence $S$ has $n$ characters and ‘$’, and it has $n+1$ leaves. Each edge of this suffix tree is denoted with a suffix of $S$ , and every internal node, except for the root node, has at least two child nodes. Each leaf node $i$ of the suffix tree is denoted with the starting position in $S$ .

Lemma 2. Each child node of the root node must begin with different characters, which are one of the elements of the alphabet set $\Sigma$ or the terminal symbol ‘$’. The size of the alphabet set $|\Sigma|$ adds ‘$’ equal to the number of branches from the root node. If a character does not exist in sequence $S$ , it will not be in the alphabet set $|\Sigma|$ and the suffix tree (S$).

Lemma 3. The subsequence $T_{1}$ of the object sequence $S$ must be a prefix of another subsequence $T_{2}$ in the suffix tree (S$). The number of occurrences of $T_{1}$ must equal with the number of non-leaf child nodes of $T_{2}$ . If the subsequence $T_{3}$ is the deepest non-leaf node from the root nodes in the suffix tree (S$), $T_{3}$ must be the longest common subsequence in $S$ .

3.4 Suffix array

According to Manber et.al [15], the suffix array is an array in lexicographic order. Let $S=s_{0}s_{1}\ldots s_{i}\ldots s_{n-1}$ be an object sequence of $n$ characters with the alphabet set $\Sigma$ . The terminal symbol ‘$’ also indicates the end of the sequence. In essence, the suffix array SA[i] includes all the paths and leaf-nodes of the suffix tree with the array structure, rather than the tree structure. In addition, the suffix array SA[i] stores all suffixes of $S$ in lexicographic order. Therefore, suffixes with the same prefix are stored in a contiguous region of SA[i].

Definition 4. Given an object sequence $S=s_{0}s_{1}\ldots s_{i}\ldots s_{n-1}$ , and the suffix $\textit{Suf}(S_{i})=s_{i}s_{i+1}$ $\ldots s_{n-1}{\$}$ , the array $R[j]$ stores the rank of lexicographic order of all suffixes of $\textit{Suf}(S_{i})$ . Therefore, $R[j]$ is an array of integers ranging from $1$ to $n$ . The suffix array SA[i] stores the starting positions of all suffixes in the object sequence $S$ , and these suffixes are sorted lexicographically. The array LCP[t] stores the length of the Longest Common Prefix (LCP) between SA[i-1] and SA[i].

Lemma 4. Let the symbol ‘ $\prec$ ’ denote the lexicographic order. Therefore, in the suffix array SA[i], $\textit{SA[i]}\prec\textit{SA}[i+1]$ , and 0 $\leqslant i\leqslant n-1$ . In addition, when the suffix $\textit{Suf}(S_{i})$ has rank $j$ in the lexicographic order, $R[j]=i$ , the suffix array SA[i] stores the i-th starting position of suffix Suf(Si), so SA[i]=j. Therefore, the rank $j$ of suffix is Suf(SA[i]).

Example 3. As shown in Fig. 3, the object sequence $S=$ gatcaatgaggtggacacca$, with the suffix tree for this sequence on the left and the suffix array on the right. From the suffix tree, the size of the alphabet set $|\Sigma|$ is four, and the terminal symbol is ‘$’. The prefix ‘ac’ has two leaf-nodes, illustrating that this prefix has occurred twice in $S$ . The length of the longest common prefix (LCP) is 2, including ‘ac’, ‘at’, ‘ca’, ‘ga’, ‘gg’, ‘tg’.

Figure 3.

An example of the suffix tree and suffix array for a sequence.

Compared with the suffix tree, all suffixes are stored lexically in the suffix array. For example, the suffix ‘a$’ stores the starting position of the suffix, so SA[1] $=$ 19. The array element R[19] $=$ 1, indicating that the rank of this character ‘ $a$ ’ in this suffix array is $1$ . LCP[1] $=$ 0 represents that this suffix ‘a$’ does not have a common prefix with the previous suffix ‘$’. When ‘0’ occurs in the LCP array, it indicates that there is a new character of the alphabet set.

In addition, LCP[4] $=$ 2 demonstrates that the suffix ‘acca$’ has two common prefixes, and the former suffix is ‘acacca$’. Similarly, the common prefix ‘ac’ has two leaf-nodes in the suffix tree, indicating that the common prefix has two subsequences. Based on these properties, the suffix array plays an important role in multi-pattern matching with variable-length wildcards.

4. Algorithm design

This section introduces the idea of the algorithm MMSA for multi-pattern matching with variable-length wildcards based on suffix array. First, this study gives the construction and pre-processing steps of the suffix array. The suffix array is constructed according to the algorithm SA-IS of Nong’s method [38], in which the rank array and the longest common prefix array are added. Subsequently, the two algorithms, MMSA-S and MMSA-L, are designed according to the length of the pattern. Finally, the complexities of the two algorithms are analyzed.

4.1 Suffix array construction

Different from the suffix tree, the suffix array is not simply built according to the suffix tree from the root to leaves by order. Several algorithms are presented in linear time. The time and space complexities of these constructing algorithms vary with the sorting method. In this paper, the algorithm MMSA is based on the suffix array constructing algorithm SA-IS by Nong Ge et al. [38]. Induced sorting is applied to SA-IS algorithm. According to some properties, the character $S[i]$ is defined as the S-type or L-type. By scanning the type array, all the leftmost S-type (LMS) substrings and sort these LMS-substrings were found. The time complexity of the SA-IS algorithm is ${\rm O}(n)$ and the space complexity is ${\rm O}$ (nlogn).

According to the requirement of this work, the rank array and the longest common prefix (LCP) array are added on the basis of the SA-IS algorithm. The improved algorithm is given in the form of Algorithm 1.

Algorithms 1: Suffix Array Construction SA-IS(S, SA)
Input: S is an objected sequence text Output: SA is the output suffix array of S
1. Add(‘$’,S[n]); 2. for ( $i=$ 0; $i<n-1$ ; i++) do // Classify $S[i]$ as S-type or L-type into $T$ ; 3. if $S[i]<S[i+1]$ or $S[i]=S[i+1]$ 4. $T[i]=S$ ; // $T$ is an array to save the type; 5. else if $S[i]>S[i+1]$ or $S[i]=S[i+1]$ 6. $T[i]=L$ ; 7. end for 8. $T[n-1]=S$ ; // ‘$’ is defined as S-type; 9. Scan T to find all the leftmost S-type substrings in S into P; //P is an array to save all the LMS-substrings; 10. Induced sort all the leftmost S-type substrings using P and B; //B is an array as a bucket; 11. Get a new string S1 with naming each LMS-substrings in S by its bucket index. 12. if each character in S1 is unique 13. then compute SA1 from S1; 14. else SA1 $=$ SA-IS(S1,SA1); // recursive to computer SA1; 15. Induced SA from SA1; 16. for ( $i=$ 0; $i<n$ ; i++) 17. R[SA[i]] $=$ i; //R is an array to save the rank of each character; 18. end for 19. for ( $i=$ 0; $i<n$ ; LCP[R[i++] $=$ k]) do 20. for ( $k ? k$ –:0, $j=$ SA[R[i]-1]; $s[i+k]==s[j+k]$ ; k++); 21. end for 22. end for 23. return SA, R, LCP;

Algorithms 1: Suffix Array Construction SA-IS(S, SA)

Input: S is an objected sequence text Output: SA is the output suffix array of S

1. Add(‘$’,S[n]); 2. for (

i=

i<n-1

; i++) do // Classify

S[i]

as S-type or L-type into

T

; 3. if

S[i]<S[i+1]

S[i]=S[i+1]

T[i]=S

; //

T

is an array to save the type; 5. else if

S[i]>S[i+1]

S[i]=S[i+1]

T[i]=L

; 7. end for 8.

T[n-1]=S

; // ‘$’ is defined as S-type; 9. Scan T to find all the leftmost S-type substrings in S into P; //P is an array to save all the LMS-substrings; 10. Induced sort all the leftmost S-type substrings using P and B; //B is an array as a bucket; 11. Get a new string S1 with naming each LMS-substrings in S by its bucket index. 12. if each character in S1 is unique 13. then compute SA1 from S1; 14. else SA1

=

SA-IS(S1,SA1); // recursive to computer SA1; 15. Induced SA from SA1; 16. for (

i=

i<n

; i++) 17. R[SA[i]]

=

i; //R is an array to save the rank of each character; 18. end for 19. for (

i=

i<n

; LCP[R[i++]

=

k]) do 20. for (

k ? k

–:0,

j=

SA[R[i]-1];

s[i+k]==s[j+k]

; k++); 21. end for 22. end for 23. return SA, R, LCP;

When the object sequence $S$ remains unchanged, the suffix array can be reused for different patterns. It is appropriate for multi-pattern matching with variable-length wildcards.

4.2 Pre-processing

In order to improve the properties of the suffix array in our algorithm, two sub-algorithms have been proposed according to the length of the exact characters in the pattern. The pre-processing steps are different in the two algorithms.

For the MMSA-S algorithm, the length of exact characters in a pattern is short, e.g. a*[0, 3] g, t*[1, 2]cc*[0, 1] a. In the pre-processing phase, it needs to sort all patterns alphabetically according to the first exact character. In addition, it computes the maximum and minimum length of each pattern. According to Definitioms 1 and 2, this pre-processing method will improve the efficiency of multi-pattern matching in the MMSA-S algorithm.

For the MMSA-L algorithm, the length of exact characters in a pattern is long, e.g. aggt *[1,2] ccat, gagat*[1,3] ttag*[0,2] cccg. In the pre-processing step, all exact characters need to be divided into different groups by the wildcards. The groups of all these characters are then sorted alphabetically. If these groups have a common prefix, it will simplify the algorithm MMSA-L. When the pattern is matching, it starts with the last character group of each pattern. There will be no match if the last character group does not exist. By combining the maximum and minimum length of each pattern, some positions that do not satisfy the matching requirements will be excluded according to Definition 2. This pre-processing technique will avoid some unnecessary steps and improve the efficiency of the MMSA-L algorithm.

4.3 The MMSA-S algorithm

The MMSA-S algorithm is suitable for the short exact characters in the pattern. Suppose pattern set $\Phi=\{P^{1},P^{2},\ldots P^{k}\}$ , and each pattern $P^{k}=p_{0}*[l_{0},h_{0}]p_{1}\ldots*[l_{i-1},h_{i-1}]p_{i}\ldots*[l_{m-1},h_{% m-1}]p_{m}$ . In the pre-processing phase, patterns are sorted alphabetically by the first character $p_{0}$ , which reduces the matching time since some patterns are with the same first character $p_{0}$ . Secondly, according to the maximum and minimum length of each pattern, some unsatisfying positions are excluded from all the $p_{0}$ positions in the suffix array, so that the matching range is further narrowed. The rest possible positions of $p_{0}$ are judged by the dynamic programming method. The next exact character $p_{1}$ is searched to judge whether its position satisfies the wildcard range $[l_{0},h_{0}]$ . If the position of $p_{1}$ does not meet the range, the position of $p_{0}$ will be abandoned and the position of next $p_{0}$ will be determined. If the position of $p_{1}$ satisfies the range, the next exact character $p_{2}$ will be searched. This process will be repeated until the last exact character $p_{m}$ is found and all positions meet their wildcard range. The pseudo-code of MMSA-S is shown in Algorithm 2.

Algorithms 2: MMSA-S
Input: SA, R, LCP, Pattern set $\Phi$ Output: Match numbers and positions
1. initialize the arrays and parameters; 2. sort the pattern set 3. for ( $i=$ 0; $i\leqslant k$ ; i++) // $k$ is the size of the Pattern set 4. for ( $j=$ 0; $j\leqslant P$ [i].length; j++) 5. if( $p[j]\in\Sigma$ ) then // $p[j]$ is an element of alphabet; 6. search $p[j]$ in SA ; 7. if (( $p[j]$ .position $+$ Pmax) $\leqslant n$ ) // exclude some unsatisfied position; 8. save temp $=$ SA[ $p[j]$ .position]; 9. else ( $p[j]\in$ * ) // $p[j]$ is the wildcard; 10. save min[j] $=$ lj, max[j] $=$ hj; 11. save next $=$ getchar( $p[j+1]$ ); 12. for ( $t=$ 0; $t\leqslant$ temp.length ;t++) 13. search the next character in the temp; 14. if(next.position match the gap of the min and max) then 15. save the Position; 16. temp $=$ Position; 17. else continue 18. end for 19. save MatchPosition[i] and MatchNumber[i]; 20. end for 21. end for

Algorithms 2: MMSA-S

Input: SA, R, LCP, Pattern set

\Phi

Output: Match numbers and positions

1. initialize the arrays and parameters; 2. sort the pattern set 3. for (

i=

i\leqslant k

; i++) //

k

is the size of the Pattern set 4. for (

j=

j\leqslant P

[i].length; j++) 5. if(

p[j]\in\Sigma

) then //

p[j]

is an element of alphabet; 6. search

p[j]

in SA ; 7. if ((

p[j]

.position

+

Pmax)

\leqslant n

) // exclude some unsatisfied position; 8. save temp

=

SA[

p[j]

.position]; 9. else (

p[j]\in

* ) //

p[j]

is the wildcard; 10. save min[j]

=

lj, max[j]

=

hj; 11. save next

=

getchar(

p[j+1]

); 12. for (

t=

t\leqslant

temp.length ;t++) 13. search the next character in the temp; 14. if(next.position match the gap of the min and max) then 15. save the Position; 16. temp

=

Position; 17. else continue 18. end for 19. save MatchPosition[i] and MatchNumber[i]; 20. end for 21. end for

Figure 4.

An example of the MMSA-S algorithm.

In Fig. 4, an example of suffix array construction is given for the object sequence $S=$ gatcaatgaggtggacacca$, and two patterns $P^{1}=$ a*[1, 2]c and $P^{2}=$ ca*[1, 4]g. The first exact character $`a^{\prime}$ of pattern $P^{1}$ is alphabetically placed before the first exact character $`c^{\prime}$ of $P^{2}$ . Therefore, pattern $P^{1}$ first matches. The character $`a^{\prime}$ appears seven times in the suffix array SA. Therefore, it excludes other suffix sequences. Position 19 is also excluded because it adds the minimum length of the pattern to the length of object sequence, as shown in Definition 2. In the other six suffixes, only two positions 16 and 1 satisfy the wildcard range when the next exact character $`c^{\prime}$ is found. Therefore, the matching occurrences of pattern $P^{1}$ are (16, 18), (1, 3), and the number of occurrence is two.

Next, the pattern $P^{2}$ is matched. The exact character ‘ca’ appears three times in the suffix array SA, and excludes other impossible positions. Positions 18 and 15 are excluded because the suffix does not contain the next exact character $`g^{\prime}$ . Position 3 is preserved because there are two positions of the exact character $`g^{\prime}$ in its suffix, which satisfy the wildcard range [1, 4]. Position 10 of $`g^{\prime}$ is excluded due to its unsatisfying range. Therefore, the matching occurrences of pattern $P^{2}$ are (3, 7), (3, 9), and the number of occurrences is two.

4.4 The MMSA-L algorithm

The MMSA-L algorithm is suitable for the pattern with long exact characters. Suppose pattern set $\Phi=\{P^{1},P^{2},\ldots P^{k}\}$ , and each pattern $P^{k}=p_{0}*[l_{0},h_{0}]p_{1}\ldots*[l_{i-1},h_{i-1}]p_{i}\ldots*[l_{m-1},h_{% m-1}]p_{m}$ . Before matching, these long-exact character groups are split in the pre-processing phase. Groups $p_{0},p_{1}\ldots p_{m}$ can be considered as a common prefix. Some of the longest common prefixes less than the length of exact characters of the LCP array are excluded. In addition, the number of positions in the exact character group after $p_{1}$ should be larger than that in the previous character group $p_{0}$ . After excluding the impossible positions, there are few possible positions left for each exact character group. Next, whether these possible positions satisfy the wildcard range is judged by the edit-distance method. Therefore, the longer the exact character is, the fewer it will appear, and the faster the matching will be. The pseudo-code of the MMSA-L algorithm is shown in Algorithm 3.

Algorithms 3: MMSA-L
Input: SA, R, LCP, Pattern set $\Phi$ Output: Match numbers and positions
1. initialize the arrays and parameters; 2. split these exact characters group and store in the array $P$ ; 3. sort the $P$ ; 4. for ( $i=$ 0; $i\leqslant k$ ; i++) // $k$ is the number of the Pattern set; 5. for ( $j=P[i]$ .length; $j\geqslant$ 0 ;j–) // start from the last exact character group 6. seach $p[j]$ in suffix array; 7. save $p[j]$ .position; 8. if(subp[j].position) $>$ subp[j-1].position) then // exclude impossible position 9. gap $=$ subp[j].get(0)-subp[j-1].get(1); 10. if(lmin $\leqslant$ gap $\leqslant$ lmax) then 11. save the Position; 12. else continue 13. else continue 14. end for 15. save MatchPosition[i] and MatchNumber[i]; 16. end for 17. return MatchPosition and MatchNumber;

Algorithms 3: MMSA-L

Input: SA, R, LCP, Pattern set

\Phi

Output: Match numbers and positions

1. initialize the arrays and parameters; 2. split these exact characters group and store in the array

P

; 3. sort the

P

; 4. for (

i=

i\leqslant k

; i++) //

k

is the number of the Pattern set; 5. for (

j=P[i]

.length;

j\geqslant

0 ;j–) // start from the last exact character group 6. seach

p[j]

in suffix array; 7. save

p[j]

.position; 8. if(subp[j].position)

>

subp[j-1].position) then // exclude impossible position 9. gap

=

subp[j].get(0)-subp[j-1].get(1); 10. if(lmin

\leqslant

gap

\leqslant

lmax) then 11. save the Position; 12. else continue 13. else continue 14. end for 15. save MatchPosition[i] and MatchNumber[i]; 16. end for 17. return MatchPosition and MatchNumber;

Figure 5 shows an object sequence $S=$ gatcaatgaggtggacacca$, and the suffix array is built for this sequence. Two patterns $P^{1}=$ ac*[1,2]ca and $P^{2}=$ tgag*[0,3] gg. In the pre-processing phase, these exact character groups are split and saved as ‘ac’, ‘ca’, ‘tgag’, ‘gg’. The first character group ‘ac’ in pattern $P^{1}$ is located in front of ‘tgag’ in pattern $P^{2}$ , and thus the pattern $P^{1}$ matches first. The character group ‘ac’ appears twice at positions 14 and 16. The character group ‘ca’ appears three times at positions 18, 3 and 15. Since the position of ‘ca’ must be greater than that of ‘ac’, positions 3 and 15 of ‘ca’ are excluded. Combined with the wildcard range, only position 14 of ‘ac’ and position 18 of ‘ca’ are satisfying. Therefore, the matching occurrence of pattern $P^{1}$ is (14, 18).

In pattern $P^{2}$ , the exact character group ‘tgag’ appears only once at position 6 while the group ‘gg’ appears twice at positions 12 and 9. Considering the length of the group ‘tgag’ 4, the end position of the group ‘tgag’ 10 is larger than the starting position of ‘gg’ 9. Therefore, position 9 is excluded. Then, the edit distance method is used to judge whether the remaining possible positions satisfy the wildcard range. If satisfied, the matching results are saved; if not, the next pattern will be matched.

This example demonstrates that if the exact character group is longer, the fewer times it will appear. Meanwhile, it eliminates other impossibilities and improves the efficiency of the algorithm.

Figure 5.

An example of the MMSA-L algorithm.

4.5 Complexity analysis

In this section, the complexities of MMSA-S and MMSA-L will be discussed respectively. In the construction of suffix arrays, the SA-IS method is adopted by both algorithms. The time complexity is ${\rm O}$ (n) and the space complexity is ${\rm O}$ (nlogn) [38]. Here, $n$ is the length of the object sequence $S=s_{0}s_{1}\ldots s_{i}\ldots s_{n-1}$ . Although the rank array and LCP array are added to MMSA-S and MMSA-L, they are still on the same magnitude. In the pre-processing phase, the algorithms MMSA-S and MMSA-L have different procedures. Given the pattern set $\Phi=\{P^{1},P^{2},\ldots P^{k}\}$ , and each pattern $P^{k}=p_{0}*[l_{0},h_{0}]p_{1}\ldots*[l_{i-1},h_{i-1}]p_{i}\ldots*[l_{m-1},h_{% m-1}]p_{m}$ , $k$ is the number of patterns, $m+1$ denotes the length of each pattern, $|\Sigma|$ refers to the size of the alphabet set, and $w$ is the number of wildcards in each pattern.

For MMSA-S, there is one ‘for’ loop to sort $k$ patterns in the pre-processing phase. Therefore, the time complexity is ${\rm O}(k)$ , and the space complexity is ${\rm O}(\textit{km})$ . In the matching phase, there are three ‘for’ loops, the time complexity is ${\rm O}(k(n/|\Sigma|)w)$ , and the space complexity is ${\rm O}(\textit{nlogn})$ . The time complexity of the MMSA-S algorithm is ${\rm O}(n+k+(kw/|\Sigma|)n)$ . Because $k, m,$ and $w$ are much smaller than $n$ , the time complexity is ${\rm O}(n)$ as well, and the space complexity is ${\rm O}(\textit{nlogn})$ .

The MMSA-L algorithm has two ‘for’ loops in the pre-processing phase and two ‘for’ loops in the matching phase. The time complexity of the MMSA-L algorithm is ${\rm O}(n+kw+kw)$ . Similarly, $k$ and $w$ are far less than $n$ ; the time complexity is ${\rm O}(n)$ ; and the space complexity is ${\rm O}(\textit{nlogn})$ .

According to the above analysis, the time complexity is ${\rm O}(n)$ and the space complexity is ${\rm O}(\textit{nlogn})$ for MMSA-S and MMSA-L.

In addition, the complexities of using suffix tree for multi-pattern matching with our previous work [41] is compared. The time and space complexity of MMST is ${\rm O}(n)$ for the construction of suffix tree. The time complexity of MMST for pattern matching is ${\rm O}(n+k(n/|\Sigma|)(\sum_{i=0}^{m-1}G_{i})$ . When $n$ is much larger than other parameters, the time and space complexities of MMST are ${\rm O}(n)$ . However, when the number of the pattern is small, the algorithm MMSA based on suffix array is more suitable for the construction of the suffix array and the ordering of all suffixes. When the number of the pattern $k$ is large, the time complexity of MMSA-S is ${\rm O}(n+k+(kw/|\Sigma|)n)$ and the time complexity of MMSA-L is ${\rm O}(n+kw+kw)$ , which are faster than MMST with the complexity ${\rm O}(n+k(n/|\Sigma|)(\sum_{i=0}^{m-1}G_{i})$ , especially when all suffixes are sorted and some impossibilities are eliminated. When $n$ is much larger than other parameters, the time complexity of MMSA is ${\rm O}(n)$ , and the space complexity of MMSA ${\rm O}(\textit{nlogn})$ is greater than that of MMST.

5. Experiments

In this section, the time performances of the MMSA-S and MMSA-L algorithms were compared with other algorithms through experiments, which were performed on a computer with an Intel Core i7-7700 3.6 GHz CPU, 8 G of DDR3 RAM, and running on a Win7 64-bit operation system. All codes were written in Java and compiled with IntelliJ IDEA CE. Comparison algorithms include WM-gap, BG-gap, ST-TWEC-gap, MMST-S and MMST-L. The WM-gap algorithm was reconstructed according to the Wu-Manber algorithm [11] while the BG-gap algorithm was reconstructed according to the Multiple BNDM algorithm [23]. The reconstruction method was proposed from the thoughts of Salmala [43] and Ukkonen [44], while the ST-TWEC-gap algorithm was recomposed based on the T-TWEC algorithm [45].

The test dataset is the Pizza & Chili corpus [46]. The size of the alphabet in natural language is larger than that of DNA sequences and protein sequences. The matching repetition rate of natural language is lower than that of DNA sequences. Therefore, DNA and PROTEINS datasets were used in this work. The length of the DNA sequence is 2,293,760 while the size of the alphabet $|\Sigma|$ is 4, such as {A, G, C, and T}. The length of the PROTEINS sequence is 1,699,484 while the size of the alphabet $|\Sigma|$ is 20, such as {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.

5.1 Experiments on DNA sequence

Based on the same environment and datasets, performance will be compared when a variable changes. In these experiments, variables include the length of the object sequence, the number of patterns, the length of exact characters in the pattern, and the number of wildcards.

5.1.1 Different length of object sequence

Change the length of the object sequence to 10,000 $\backslash$ 100,000 $\backslash$ 1,000,000 $\backslash$ 2,293,649, and other variables remain unchanged. The number of patterns $k$ is 10. Each pattern is generated by the pattern generator, and the elements are sampled from the alphabet $\Sigma=$ {A, G, C, T}. The length of the exact characters $m$ is 2, and the number of wildcards $w$ is 2. The range of gaps is from 0 to 9. As shown in Fig. 6, when the length of the object sequence is 10,000, the runtime of these algorithms is not much different. The two MMSA algorithms require more time to sort these suffixes and construct the suffix array. However, when the length of the object sequence is large after excluding some impossible cases, the two MMSA algorithms are more efficient than other comparison algorithms. Due to the short length of the exact characters in test patterns, MMSA-S is better than MMSA-L.

Figure 6.

Comparison of different lengths of DNA sequences.

5.1.2 Different number of patterns

Change the number of patterns $k$ to 1 $\backslash$ 5 $\backslash$ 10 $\backslash$ 20 $\backslash$ 50, and the other variables remain unchanged. The length of the object sequence $n$ is 1,000,000. The length of exact characters $m$ is 2, and the number of wildcards $w$ is 2. The range of gaps varies from 0 to 9. As shown in Fig. 7, when the single pattern and the number of patterns are small, the efficiency of the two MMSA algorithms and the two MMST algorithms are not as good as the other three algorithms. Once the suffix array and the suffix tree are constructed, the time performances of these four algorithms will be better than the other three algorithms when the number of patterns is greater than 10. In addition, the two MMSA algorithms based on the suffix array are superior to the two MMST algorithms based on the suffix tree.

Figure 7.

Comparison of different numbers of patterns in DNA sequences.

5.1.3 Different lengths of exact characters in patterns

Change the length of exact characters in a pattern $m$ to 2 $\backslash$ 4 $\backslash$ 6 $\backslash$ 8 $\backslash$ 10, and the other variables remain unchanged. The length of the object sequence $n$ is 1,000,000. The number of patterns $k$ is 10, and the number of wildcards $w$ is 2. The range of gaps varies from 0 to 9. Figure 8 obviously indicates the efficient time-performance of the MMSA-L algorithm. The longer of the exact characters, the more impossible positions are excluded, and the more efficient this method is. The three comparison algorithms, WM-gap, BG-gap and ST-TWEC-gap, require character-by-character matching and take more time. The two MMST algorithms are worse than the two MMSA algorithms. Since the suffix array has sorted all suffixes, the time performance can be improved. The MMSA-L algorithm and the MMST-L algorithm are better than the MMSA-S algorithm and the MMST-S algorithm when the exact characters in the pattern are longer.

Figure 8.

Comparison of different lengths of exact character in DNA sequence.

5.1.4 Different number of wildcards

Change the number of wildcards in each pattern $w$ to 1 $\backslash$ 2 $\backslash$ 3 $\backslash$ 4 $\backslash$ 5, and the other variables remain unchanged. The length of the object sequence $n$ is 1,000,000. The number of patterns $k$ is 10, and the length of exact characters in a pattern $m$ is 2. The range of gaps varies from 0 to 9. Figure 9shows the time performances of these algorithms with different numbers of wildcards. When the number of wildcards is large, the three comparison algorithms WM-gap, BG-gap, and ST-TWEC-gap will take more time. The two MMSA algorithms are better than the two MMST algorithms when the number of wildcards is large. When the first wildcard of a pattern does not satisfy, the two MMSA algorithms will quickly match the next pattern. The greater the number of wildcards, the less likely they are to appear.

Figure 9.

Comparison of different number of wildcards in DNA sequence.

Figure 10.

Comparison of different lengths of Protein sequences.

5.2 Experiments on Protein sequence

The protein sequence is closer to natural language than the DNA sequence. The alphabet size of the protein sequence is 20, which is 26 in English. Compared to the DNA sequence, the protein sequence will be more efficient in natural language because of its larger alphabet size, fewer repeat positions, lower frequency of satisfied pattern matching, and higher efficiency. In this research, the protein sequences were taken from the PROTEINS datasets of Pizza & Chili corpus. The length of this protein sequence is 1,699,484, which consists of 20 alphabets, such as {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}.

The following four comparative experiments have compared the time performance with four variables in seven algorithms. When another environment is the same, it will change one variable at a time. For example, the length of the object sequence, the number of patterns, the length of the exact character in the pattern, and the number of wildcards.

5.2.1 Different lengths of object sequences

As shown in Fig. 10, the length of the object sequence is changed to 10,000 $\backslash$ 100,000 $\backslash$ 500,000 $\backslash$ 1,000,000 $\backslash$ 1,699,484, and the other variables remain the same. Similar to the DNA sequence, each pattern is generated by the pattern generator, the size of the pattern set $\Phi=\{P^{1},P^{2},\ldots P^{k}\}$ $k$ is 10, and the pattern elements are from the alphabet $\Sigma=$ {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. The length of the exact character $m$ is 2, and the number of wildcards $w$ is 2. The range of gaps is random from 0 to 9. As the length of the object sequence increases, the two MMSA algorithms spend more time in constructing the suffix array, but exclude more impossible positions in large size of the alphabet. Therefore, the time performance of the two MMSA algorithms is better than that of the comparison algorithms. When the exact character length $m$ is 2, the time performance of the MMSA-S algorithm is better than that of the MMSA-L algorithm.

5.2.2 Different numbers of patterns

In this experiment, the numbers of patterns $k$ are 1, 5, 10, 20, and 50. The length of the object sequence $n$ is 5,000,000, the number of exact characters $m$ is 2, and the number of wildcards $w$ is 2. The range of gaps is random from 0 to 9. Figure 11 shows the experimental results in this case. With the increase of the number of patterns, the two MMSA algorithms and the two MMST algorithms outperform the other three comparison algorithms. Due to the large size of the alphabet and the small number of matching positions, the two MMSA and the two MMST algorithms take less time in the protein sequence than that in the DNA sequence.

Figure 11.

Comparison of different numbers of patterns in Protein sequences.

5.2.3 Different lengths of exact characters in patterns

To compare the influence of the exact character length $m$ in a pattern, we change $m$ to 2 $\backslash$ 4 $\backslash$ 6 $\backslash$ 8 $\backslash$ 10, and other variables are the same. The length of the object sequence $n$ is 5,000,000. The number of patterns $k$ is 10, and the number of wildcards $w$ is 2. The range of gaps is random from 0 to 9. Similar to the DNA sequence, Fig. 12 also implies that the MMSA-L algorithm and the MMST-L algorithm have better time performances than other comparison algorithms. As the exact character length increases, the algorithm MMSA-L becomes more efficient. It excludes more impossible positions, and only keeps the matching process.

Figure 12.

Comparison of different lengths of exact characters in Protein sequences.

Figure 13.

Comparison of different numbers of wildcards in Protein sequences.

5.2.4 Different numbers of wildcards

As shown in Fig. 13, the number of wildcards in each pattern $w$ is changed to 1 $\backslash$ 2 $\backslash$ 3 $\backslash$ 4 $\backslash$ 5, the length of the object sequence $n$ is 5000000, $k$ is 10, and $m$ is 2. The range of gaps is random from 0 to 9. As the number of wildcards increases, the number of pattern occurrences decreases, and the time consumption of these algorithms in this experiment decreases as well. In the two MMSA algorithms and the two MMST algorithms, if the range of the first wildcard is not satisfied, this pattern matching is completed and the next pattern is matched. The time performances of these four algorithms are better than that of the other three comparison algorithms WM-gap, BG-gap and ST-TWEC-gap.

6. Conclusions and future work

In this paper, two MMSA algorithms based on the suffix array are proposed, which have higher time performances than the other comparison algorithms in multi-pattern matching and approximate pattern matching. According to the properties of the suffix array, all suffixes are sorted alphabetically, and the longest common array is easily available. These attributes can exclude many impossible positions and make the two algorithms more efficient. The MMSA-S algorithm is suitable for the short-exact characters in a pattern by the method of dynamic programming, while the MMSA-L algorithm is applied to the long-exact characters in a pattern by the edit-distance method.

The disadvantages and limitations of these two MMSA algorithms include the following two aspects. Firstly, the suffix array requires more space to store all suffixes than the other comparison algorithms. In addition, it will take some time to construct the suffix array, but these procedures will be very useful in the processes of query and matching. Secondly, in the case of a small number of patterns, the efficiency of the two MMSA algorithms is not as good as that of the other comparison algorithms. These two MMSA algorithms are applicable to a large scale of pattern sets.

This research leaves several problems for further study. One of the important problems is to make full use of the two MMSA algorithms in practical applications. Once the suffix array is constructed for reuse, it is suitable for full-text indexing and full-text searching. In addition, the property of the longest common array in the suffix array is suitable for Chinese word segmentation to eliminate ambiguity in natural language processing. Chinese words can be segmented or disambiguated by taking the frequency of words or phrases in the suffix array as the longest common substring, which will be feasible and more efficient. Therefore, it is our next research issue in the future.

Footnotes

Acknowledgments

This research was supported by the National Key Research and Development Program of China (Grant No. 2016YFB1000900) and the National Natural Science Foundation of China (NSFC) (Grant Nos. 61503116 and 61229301). In addition, we would like to thank the anonymous reviewers who have helped to improve the paper.

References

Navarro

, A guided tour to approximate string matching, ACM computing surveys (CSUR) 33(1) (2001), 31–88.

Gog

Kärkkäinen

Kempa

Petri

and Puglisi

S.J.

, Fixed Block Compression Boosting in FM-Indexes: Theory and Practice, Algorithmica 81(4) (2019), 1370–1391.

Navarro

Baeza-Yates

R.A.

Sutinen

and Tarhio

, Indexing methods for approximate string matching, IEEE Data Eng. Bull. 24(4) (2001), 19–27.

Hon

W.K.

Lam

T.W.

Shah

Thankachan

S.V.

Ting

H.F.

and Yang

, Dictionary matching with a bounded gap in pattern or in text, Algorithmica 80(2) (2018), 698–713.

Rahman

M.S.

Iliopoulos

C.S.

Lee

Mohamed

and Smyth

W.F.

, Finding patterns with variable length gaps or don’t cares, In International Computing and Combinatorics Conference, Springer, Berlin, Heidelberg, 2006, pp. 146–155.

Akutsu

, Approximate string matching with variable length don’t care characters, Ieice Transactions On Information And Systems E Series D 79 (1996), 1353–1354.

Fischer

M.J.

and Paterson

M.S.

, String-Matching and Other Products (No. MAC-TM-41), Massachusetts Inst Of Tech Cambridge Project Mac, 1974.

Knuth

D.E.

, Jr. Morris

J.H.

and Pratt

V.R.

, Fast pattern matching in strings, SIAM Journal on Computing 6(2) (1977), 323–350.

Aho

A.V.

and Corasick

M.J.

, Efficient string matching: an aid to bibliographic search, Communications of the ACM 18(6) (1975), 333–340.

10.

Commentz-Walter

, A string matching algorithm fast on the average, In International Colloquium on Automata, Languages, and Programming, Springer, Berlin, Heidelberg, 1979, pp. 118–132.

11.

and Manber

, A fast algorithm for multi-pattern searching, University of Arizona. Department of Computer Science, 1994, pp. 1–11.

12.

Chiu

S.Y.

Hon

W.K.

Shah

et al., I/O-efficient compressed text indexes: From theory to practice, 2010 Data Compression Conference, IEEE, 2010, pp. 426–434.

13.

and Homer

, A survey of sequence alignment algorithms for next-generation sequencing, Briefings in Bioinformatics 11(5) (2010), 473–483.

14.

Clark

C.R.

and Schimmel

D.E.

, Efficient reconfigurable logic circuits for matching complex network intrusion detection patterns, International Conference on Field Programmable Logic and Applications. Springer, Berlin, Heidelberg, 2003, pp. 956–959.

15.

Manber

and Myers

, Suffix arrays: a new method for on-line string searches, Journal on Computing 22(5) (1993), 935–948.

16.

Min

and Lu

, Pattern matching with independent wildcard gaps, In 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing, IEEE, 2009, pp. 194–199.

17.

Zhang

and Hu

, A faster algorithm for matching a set of patterns with variable length don’t cares, Information Processing Letters 110(6) (2010), 216–220.

18.

Guo

Hong

X.L.

X.G.

Gao

Liu

Y.L.

G.Q.

and Wu

, A bit-parallel algorithm for sequential pattern matching with wildcards, Cybernetics and Systems 42(6) (2011), 382–401.

19.

Cole

and Hariharan

, Verifying candidate matches in sparse and wildcard matching, In Proc. the 34th Annual ACM Sym-posium on Theory of Computing, 2002, pp. 592–601.

20.

Zhu

and Wu

, Mining complex patterns across sequences with gap requirements, In proc. the 20th Int. Joint Conf. Artificial intelligence, 2007, pp. 2934–2940.

21.

Haapasalo

Silvasti

Sippu

and Soisalon-Soininen

, Online dictionary matching with variable-length gaps, In: Proceedings of the 10th International Symposium, SEA Kolimpari, Chania, Crete, Greece. Berlin Heidelberg: Springer, 2011, 76–87.

22.

Bille

Gørtz

I.L.

Vildhøj

H.W.

et al., String matching with variable length gaps, Theoretical Computer Science 443(20) (2012), 25–34.

23.

Raffinot

, On the multi backward dawg matching algorithm (MultiBDM), In Proceedings of the 4th South American Workshop on String Processing Carleton, University Press, 1997, pp. 149–165.

24.

Crochemore

Czumaj

Gasieniec

Lecroq

Plandowski

and Rytter

, Fast practical multi-pattern matching, Information Processing Letters 71(3-4) (1999), 107–113.

25.

Zhou

Zhang

Chow

S.S.

Zhang

and Zhang

, Efficient authenticated multi-pattern matching, In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, ACM, 2016, pp. 593–604.

26.

Muth

and Manber

, Approximate multiple string search, In Annual Symposium on Combinatorial Pattern Matching Springer, Berlin, Heidelberg, 1996, pp. 75–86.

27.

Baeza-Yates

and Navarro

, Multiple approximate string matching, In Workshop on Algorithms and Data Structures, Springer, Berlin, Heidelberg, 1997, pp. 174–184.

28.

Kulekci

M.O.

, Tara: An algorithm for fast searching of multiple patterns on text files, In 2007 22nd international symposium on computer and information sciences, IEEE, 2007, pp. 1–6.

29.

Zhang

Tang

and Bai

, Multi-pattern Matching with Wildcards, JSW 6(12) (2011), 2391–2398.

30.

Biswas

Ganguly

Shah

and Thankachan

S.V.

, Ranked document retrieval for multiple patterns, Theoretical Computer Science 746 (2018), 98–111.

31.

Fredkin

, Trie memory, Communications of the ACM 3(9) (1960), 490–499.

32.

Weiner

, Linear pattern matching algorithm, 14th Annual IEEE Symposium on Switching and Automata Theory, 1973, pp. 1–11.

33.

Chattaraj

and Parida

, An inexact-suffix-tree-based algorithm for detecting extensible patterns, Theoretical Computer Science 335(1) (2005), 3–14.

34.

Ukkonen

, Maximal and minimal representations of gapped and non-gapped motifs of a string, Theoretical Computer Science 410(43) (2009), 4341–4349.

35.

Bille

Gørtz

I.L.

et al., String indexing for patterns with wildcards, Theory of Computing Systems 55(1) (2014), 41–60.

36.

Rahman

M.S.

Iliopoulos

C.S.

Lee

Mohamed

and Smyth

W.F.

, Finding patterns with variable length gaps or don’t cares, In International Computing and Combinatorics Conference, Springer, Berlin, Heidelberg, 2006, pp. 146–155.

37.

Shrestha

A.M.S.

Frith

M.C.

and Horton

, A bioinformatician’s guide to the forefront of suffix array construction algorithms, Briefings in bioinformatics 15(2) (2014), 138–154.

38.

Nong

Zhang

and Chan

W.H.

, Two efficient algorithms for linear time suffix array construction, IEEE Transactions on Computers 60(10) (2011), 1471-1484.

39.

Thankachan

S.V.

Apostolico

Aluru

, A Provably Efficient Algorithm for the k-Mismatch Average Common Substring Problem, Journal of Computational Biology (6) (2016), 472–482.

40.

Hon

W.K.

Lam

T.W.

Shah

Thankachan

S.V.

and Yang

, Dictionary matching with a bounded gap in pattern or in text, Algorithmica 80(6) (2017), 1–16.

41.

Liu

Xie

and Wu

, Multi-pattern matching with variable-length wildcards using suffix tree, Pattern Analysis and Applications 21(4) (2018), 1151–1165.

42.

Ukkonen

, On-line construction of suffix trees, Algorithmica 14(3) (1995), 249–260.

43.

Salmela

Tarhio

and Kytöjoki

, Multipattern string matching with q-grams, Journal of Experimental Algorithmics (JEA) 11 (2007), 1.

44.

Ukkonen

, Approximate string-matching with q-grams and maximal matches, Theoretical computer science 92(1) (1992), 191–211.

45.

Arın

İ.

Erpam

M.K.

and Saygın

, I-TWEC: Interactive clustering tool for Twitter, Expert Systems with Applications 96 (2018), 1–13.

46.

Pizza&Chili corpus: http://pizzachili.dcc.uchile.cl/texts.html.