MADMX: A Strategy for Maximal Dense Motif Extraction

Abstract

We develop, analyze, and experiment with a new tool, called madmx, which extracts frequent motifs from biological sequences. We introduce the notion of density to single out the “significant” motifs. The density is a simple and flexible measure for bounding the number of don't cares in a motif, defined as the fraction of solid (i.e., different from don't care) characters in the motif. A maximal dense motif has density above a certain threshold, and any further specialization of a don't care symbol in it or any extension of its boundaries decreases its number of occurrences in the input sequence. By extracting only maximal dense motifs, madmx reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by madmx.

1. Introduction

The discovery of frequent patterns (motifs) in biological sequences has attracted wide interest in recent years, due to the understanding that sequence similarity is often a necessary condition for functional correlation. Among other applications, motif discovery is an important tool for identifying regulatory regions and binding sites in the study of functional genomics. From a computational point of view, a major complication for the discovery of motifs is that they may feature some sequence variation without loss of functionalities. The discovery process must therefore target approximate motifs, whose occurrences are similar but not necessarily identical. Approximate motifs are often modeled through the use of the don't care character in certain positions, which is a wild card matching all characters of the alphabet, called solid characters (Parida, 2008).

Finding interesting approximate motifs is computationally challenging. The output may explode combinatorially for an increasing number of don't cares and/or a decreasing value of the minimum frequency threshold. This explosion is not mitigated if the discovery targets are just the maximal motifs—a subset of the motifs which implicitly represents the full set. Even if the final output is not too large, partial data during the intermediate steps might lead to memory saturation or to extensive computation.

In this paper, we focus on the discovery of approximate motifs, which contain blocks of solid characters (solid blocks) separated by one or more don't cares. We propose a general approach for controlling the number of don't cares in these motifs. Specifically, we introduce the notion of dense motif, a motif where the fraction of solid characters is above a given threshold. Our density notion is flexible and general, since it allows for arbitrarily long runs of don't cares as long as the fraction of solid characters in the pattern is above the threshold. The rationale is that sparse motifs are less interesting in biological sequences, unlike what happens to frequent itemsets in baskets (where any subsets of correlated items are of interest). We define a natural notion of maximality for dense patterns and devise an efficient algorithm, called madmx (pronounced Mad Max), which performs complete maximal dense motif extraction from an input sequence, with respect to user-specified frequency and density thresholds.

The key technical result at the core of our extraction strategy is a closure property which affords the complete generation of all maximal dense motifs in a breadth-first fashion, through an apriori-like strategy (Agrawal and Srikant, 1994). It starts from a relatively small set of solid blocks, and then repeatedly applies a suitable combining operator, called fusion, to pairs of previously generated motifs. In this fashion, our strategy avoids the generation and consequent storage of intermediate patterns which are not in the output set, which ensures time and space complexities polynomial in the combined size of the input and the output.

The problem of motif discovery and its variants have been addressed by a large body of literature in the last decade (Parida, 2000; Apostolico and Parida, 2004; Pisanti et al., 2005; Apostolico and Tagliacollo, 2007, 2009; Ukkonen, 2007; Morris et al., 2008; Arimura and Uno, 2008; Apostolico et al., 2009), and an excellent survey of known results can be found in Parida (2008). In order to alleviate the computational burden of motif extraction and to limit the output to the most promising or interesting discoveries, some works combine the traditional use of a frequency threshold with restrictions on the flexibility of the extracted motifs, often captured by limitations on the number of occurring don't cares.

In a recent work, Apostolico et al. (2009) studied the extraction of extensible motifs, comprising standard don't cares and extensible wild cards. The latter are spacers of variable length that can take different size (within pre-specified limits) in each occurrence of the motif. An efficient tool, called varun, is devised in Apostolico et al. (2009) for extracting all maximal extensible motifs (according to a suitable notion of maximality defined in the article) which occur with frequency above a given threshold σ and with upper limits D on the length of the spacers. varun returns the extracted motifs sorted by decreasing z-score, a widely adopted statistical measure of interestingness. The authors demonstrate the effectiveness of their approach both theoretically, by proving that each maximal motif features the highest z-score within the class of motifs it represents, and experimentally, by showing that the returned top-scored motifs comprise biologically relevant ones when run on protein families and dna sequences. It has to be remarked that the motifs studied in our work are rigid, in the sense that extensible wild cards are not allowed.

An alternative way of limiting the number of don't cares in a motif has been explored in Rigoutsos and Floratos (1998). The authors define 〈L,W〉 motifs, for L ≤ W, where at least L solid characters must occur in each substring of length W of the motif. They propose a strategy for extracting 〈L,W〉 motifs which are also maximal, although their notion of maximality is not internal to the class of 〈L,W〉 motifs. As a consequence, the algorithm is not complete, since it disregards all those 〈L,W〉 motifs that are not subsumed by a maximal 〈L,W〉 one.

We performed a number of experiments on madmx to assess the biological significance of maximal dense motifs and to compare madmx against its most recent and close competitor varun. For the first objective, we used madmx to extract maximal dense motifs from a number of human dna fragments. We compared the output set against those in RepBase (Jurka et al., 2005), the largest repository of repetitive patterns for eukaryotic species, using repeatmasker (Smit et al., 1996), a popular tool for masking repetitive dna. The experiments show that all of our returned motifs are occurrences of patterns in RepBase, and fully characterize the family of sine/alu repeats (and partially the line/l1 family). This provides evidence that the notion of density, when applied to rigid motifs, captures biological significance.

Next we compared the z-score performance of madmx and varun. We ran both algorithms on several families of dna fragments, limiting varun to the generation of rigid motifs and setting the parameters so as to obtain comparable output sizes, with motifs listed by decreasing z-score. The experiments show that the top-m highest-ranking motifs returned by madmx almost always feature higher z-scores than the corresponding top-m ones returned by varun, even for large values of m, with only a modest increase in running time, which may be partly due to the fact that coding of madmx is yet to be optimized.

This article is organized as follows. In Section 2, several technical definitions and properties of motifs with don't cares are given. Section 3 proves the closure property at the base of madmx and provides a high-level description of the algorithm. In Section 4, the experimental validation of madmx is presented.

2. Preliminary Definitions and Properties

Let Σ be an alphabet of m characters and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$s = s [ 0 ] s [ 1 ] \ldots s [ n - 1 ]$$ \end{document} be a string of length n over Σ. We use \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$s [ i \ldots j ]$$ \end{document} to denote the substring \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$s [ i ] s [ i + 1 ] \cdots s [ j ]$$ \end{document} of s, for i ≤ j. Characters in Σ are also called solid characters. We use \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \circ} \notin \Sigma$$ \end{document} to denote a distinguished character called wild card or don't care character. Let ε denote the empty string. A pattern x is a string in {ε} ∪ Σ ∪ Σ(Σ ∪ { ◯ })*Σ. However, whenever necessary, we will assume that patterns are implicitly padded to their left and right with arbitrary sequences of don't care characters.

Given two patterns x, y we say that y is more specific than x, and write x ⪳ y, iff for every i ≥ 0 either x[i] = y[i] or x[i] = ◯. If x ⪳ y and x ≠ y, we write that x ≺ y. Given two patterns x, y, we say that x occurs in y at position ℓ or, alternatively, that y contains x iff \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x \preceq y [ \ell \ldots \ell + \mid x \mid - 1 ]$$ \end{document} . If x ≺ y we say that y strictly contains x. For a string s, the location list \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal L}_x$$ \end{document} of a pattern x in s is the complete set of positions at which x occurs in s. We refer to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$f (x) = \mid { \cal L}_x \mid$$ \end{document} as the frequency of pattern x in s. (Note that f (ε) = n.) As in Ukkonen (2007), the translated representation of the location list \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal L}_x = \{ l_0, l_1, l_2, \ldots, l_k \}$$ \end{document} is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\tau{ (\cal L}_x) = \{ l_1 - l_0,l_2 - l_0, \ldots, l_k - l_0 \}$$ \end{document} . Given two patterns x, y, we say that y subsumes x in s if f (x) = f (y) and y contains x. As a consequence, y subsumes x if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\tau ({ \cal L}_x) = \tau ({ \cal L}_y)$$ \end{document} . A pattern x is maximal if it is not subsumed by any other pattern y. (We observe that this notion of maximality coincides with that of Pisanti et al., 2005.) Given a pattern x, its maximal extension \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} is the maximal pattern that subsumes x, which can be shown to be unique (Pisanti et al., 2005).

In what follows, we call solid block a string in Σ⁺ and a don't care block a string in {◯}⁺. Furthermore, given a pattern x, the number of don't care characters contained in x is denoted by dc(x).

Definition 1

The density δ(x) of x is: δ(x) = 1 − dc(x)/|x|. Given a (density) threshold ρ, 0 < ρ ≤ 1, we say that a pattern x is dense if δ(x) ≥ ρ.

Note that a solid block is a dense pattern with respect to every threshold ρ.

We concentrate our attention on dense patterns that are not subsumed by any other dense pattern: they are the most interesting dense representatives in the equivalence classes induced by the relation that puts any two dense patterns x and y together if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\tau ({ \cal L}_x) = \tau ({ \cal L}_y)$$ \end{document} , as defined below.

Definition 2

A dense pattern x is a maximal dense pattern in s if it is not subsumed by any dense pattern x′ ≠ x.

Observe that a maximal dense pattern x needs not be a maximal pattern in the general sense, since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} might be a nondense pattern. However, every dense pattern x is subsumed by at least one maximal dense pattern. In fact, it is easy to see that all of the maximal dense patterns that subsume x are dense substrings of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} , namely, those that contain x and are not substrings of any other dense substring of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} . We want to stress that there might be several maximal dense patterns that subsume x. As an example, for ρ = 2/3, the dense pattern x = B in the string S = AdBeCfAgBhC is subsumed by maximal dense patterns A ◯ B and B ◯ C, while \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x) = { \rm A} \circ { \rm B} \circ { \rm C}$$ \end{document} is not dense.

Definition 3

Given a frequency threshold σ and a density threshold ρ, a pattern x is a dense maximal motif in s if x is a maximal dense pattern in s with respect to ρ, and f (x) ≥ σ. A dense maximal motif for ρ = 1 is also referred to as maximal solid block.

Problem of interest

We are given an input string s, a frequency threshold σ, and a density threshold ρ. Find all the maximal dense motifs in s.

In the example above, the maximal dense motifs for σ = 2, ρ = 2/3 are A ◯ B and B ◯ C. In the rest of the article, we will omit referencing the input string s when clear from the context.

3. An Algorithm for Maximal Dense Motif Extraction

In this section, we describe our algorithm, called madmx for maximal dense motif extraction. The algorithm adopts a breadth-first apriori-like strategy (Agrawal and Srikant, 1994), similar in spirit to the one developed in Apostolico et al. (2009), using maximal solid blocks as building blocks. Indeed, an important property of maximal dense patterns, which we will exploit in our mining strategy, is that all of their solid blocks are maximal solid blocks. The following proposition extends a similar result holding for arbitrary maximal patterns (Ukkonen, 2007; Pisanti, 2002).

Proposition 1

Let x be a maximal dense pattern with respect to a density threshold ρ, and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$b = x [ i \ldots j ]$$ \end{document} be any solid block in x such that x[i − 1] = x[j + 1] = ◯, for j ≥ i. Then, b is a maximal solid block.

Proof

Let us assume by contradiction that b is not a maximal solid block. This means that there must exist another solid block b′ ≠ b which subsumes b, that is, such that f (b) = f (b′) and there exists ℓ with 0 ≤ ℓ ≤ |b′| − |b|, for which \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$b = b^{ \prime} [ \ell \ldots \ell + \mid b \mid - 1 ]$$ \end{document} . Note that it must be |b′| > |b|, hence we have that ℓ > 0 or ℓ + |b| < |b′|. W.l.o.g, assume that ℓ > 0, whence b′[ℓ − 1] ≠ ◯ (the case ℓ + |b| < |b′| is analogous). Consider now x′, which is equal to x except that the ◯ symbol preceding b inside x is replaced by b′[ℓ − 1] in x′. Note that x′ is a dense pattern since x is so. Since f (x) = f (x′), because f (b) = f (b′), and x′ contains x by construction, then x′ subsumes x, which contradicts the maximality of x. ▪

madmx begins by extracting the maximal dense motifs from the maximal extensions of the maximal solid blocks. It then operates by repeatedly combining together, in a suitable fashion, pairs of maximal dense motifs, and extracting less frequent maximal dense motifs from these combinations. The combining operation is called fusion and is a key notion for the algorithm. Below, we first define fusion for characters, and then extend it to patterns.

Definition 4

Given three characters \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$c, c_1, c_2 \in \Sigma \cup \{ \circ \}$$ \end{document} , we say that c is the fusion of c₁ and c₂, and write c = c₁ ▿ c₂, if one of the following holds:
1. c = c₁ = c₂;

2. c₁ = ◯, c = c₂ ≠ ◯;

3. c = c₁ ≠ ◯, c₂ = ◯.

Observe that c₁ ▿ c₂ is not defined when c₁ and c₂ are different solid characters. The above notion of fusion generalizes to patterns as follows.

Definition 5

Given three patterns x, y, z and an integer d, we say that z is the d-fusion of x and y, and write z = x ▿_d y, if z can be obtained by removing the leading and trailing don't care characters from the pattern m defined as m[i] = x[i + d] ▿ y[i], for all indices i.

Observe that x ▿_d y is defined only when the involved individual character fusions are defined.

The breadth-first strategy adopted by madmx crucially relies on the following theorem, which highlights the structure of dense motifs.

Theorem 1

Let x be a maximal dense motif with dc(x) > 0. Then:
(a) there exists a maximal solid block b ⪳ x such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x) = { \cal M} (b)$$ \end{document} , or

(b) there exist two maximal dense motifs y ₁ ,y ₂ such that the following conditions hold:

\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x) = { \cal M} (y_1 \bigtriangledown_d y_2)$$ \end{document} for some d;

there are two maximal solid blocks b ₁ , b ₂ in x and an integer \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{d} > 0$$ \end{document} such that b ₁ is a maximal solid block in y ₁ , b ₂ is a maximal solid block in y ₂ , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$b_1 \circ^{ \hat{d}} b_2 \preceq y_1 \bigtriangledown_dy_2$$ \end{document}

f (x) < min{f (y₁), f (y₂)};

For the proof of Theorem 1, we need to define another type of pattern combination, namely the operation of merge between two patterns, which is similar to the one introduced in Pisanti et al. (2005). Given two characters c₁, c₂, we define the operator ⨁ between them such that c₁ ⨁ c₂ = ◯ , if c₁ ≠ c₂, and c₁ ⨁ c₂ = c₁ = c₂, otherwise.

Definition 6

Given two patterns x, y and an integer d, the d-merge of x and y is the pattern z = x ⨁_d y which can be obtained by removing all leading and trailing don't cares from the pattern m defined as m[i] = x[i + d] ⨁ y[i] for all i.

We want to stress the difference between the notions of merging and d-fusing: the merge of two patterns x, y is always well defined and more general than x, y, while the d-fusion of x, y may not exist and, if it does, is more specific than x, y.

For the proof of Theorem 1, we also need the property established by the following lemma.

Lemma 1

Let x and y be maximal patterns, and d be an integer such that z = x ⨁_d y is a nonempty pattern. Then z is maximal. Moreover, if z ≠ x (resp., z ≠ y) then f (z) > f (x) (resp., f (z) > f (y)).

Proof

First we prove that z is maximal. By contradiction, suppose that this is not the case. Then, there exists a position i such that z[i] = ◯ can be replaced by a solid character c without decreasing the frequency of the pattern. (Note that the position of the substitution can be to the left of the first character in z or to the right of the last character in z.) Since z is contained both in x and y, every occurrence of x and y in the string gives raise to an occurrence of z. Hence, every occurrence of x (resp., y) in the string, contains c in its (i + d)th (resp., ith) position. Therefore, by maximality of x and y, it must be z[i] = x[i + d] = y[i] = c, which is a contradiction. The relations between the frequencies of x,y and z follow trivially from their maximality. ▪

We are now ready to prove Theorem 1.

Proof

Given a pattern x and two nonnegative integers i ≤ j, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ i \ldots j ]$$ \end{document} denote the pattern obtained by removing all the leading and trailing don't care characters from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x [ i \ldots j ]$$ \end{document} . We first show that there exists an index s₁, with 0 < s₁ < |x| − 1, such that x[s₁] = ◯ and both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ 0 \ldots s_1 - 1 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_1 + 1 \ldots \mid x \mid - 1 ]$$ \end{document} are dense. As a first case, assume that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\delta (x [ 0 \ldots j ]) \geq \rho$$ \end{document} , for every 0 ≤ j < |x|, and let s₁ < |x| − 1 be the position in x of the rightmost don't care. Thus, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\delta (x [ 0 \ldots s_1 - 1 ]) \geq \rho$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\delta (x [ s_1 + 1 \ldots \mid x \mid - 1 ]) = 1 \geq \rho$$ \end{document} . Otherwise, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$s_1 = \min \{ s: \delta (x [ 0 \ldots s ]) < \rho \}$$ \end{document} . Then, it must be that x[s₁] = ◯ and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\delta (x [ 0 \ldots s_1 - 1 ]) \geq \rho$$ \end{document} . Moreover, since x is dense and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\delta (x [ 0 \ldots s_1 ]) < \rho$$ \end{document} , we must have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\delta (x [ s_1 + 1 \ldots \mid x \mid - 1 ]) \geq \rho$$ \end{document} , or otherwise \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} \begin{align} \delta (x) & = 1 - \frac { dc (x [ 0 \ldots s_1 ]) + dc (x [ s_1 + 1 \ldots \mid x \mid - 1 ]) } { \mid x \mid } \\ & = \frac { \left(\mid x [ 0 \ldots s_1 ] \mid - dc (x [ 0 \ldots s_1 ]) \right) + \left(\mid x [ s_1 + 1 \ldots \mid x \mid - 1 ] \mid - dc (x [ s_1 + 1 \ldots \mid x \mid - 1 ]) \right) } { \mid x \mid } \\ & < \frac { \rho (\mid x [ 0 \ldots s_1 ] \mid + \mid x [ s_1 + 1 \ldots \mid x \mid - 1 ] \mid) } { \mid x \mid } = \rho \end{align} \end{document}

which contradicts the assumption on δ(x).

We call the pair of dense patterns \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ 0 \ldots s_1 - 1 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_1 + 1 \ldots \mid x \mid - 1 ]$$ \end{document} a level-1 decomposition of x (observe that many such decompositions may exist). Now, consider the following iterative process to obtain a sequence of decompositions of x of increasing level, starting from the level-1 decomposition. Initially, set i = 1, ℓ₁ = 0 and r₁ = |x| − 1:
If both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_i \ldots s_i - 1 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_i + 1 \ldots r_i ]$$ \end{document} have frequency strictly greater than f (x), or at least one of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_i \ldots s_i - 1 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_i + 1 \ldots r_i ]$$ \end{document} is a solid block with frequency equal to f (x), then the process terminates.

Otherwise, let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$y = x^* [ \ell_{i + 1} \ldots r_{i + 1} ]$$ \end{document} be one (arbitrarily chosen) of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_i \ldots s_i - 1 ]$$ \end{document} or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_i + 1 \ldots r_i ]$$ \end{document} which is not a solid block and has frequency equal to f (x). Since y is dense, there exists an index s_i+₁ such that ℓ_i+₁ < s_i+₁ < r_i+₁ and both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_{i + 1} \ldots s_{i + 1} - 1 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_{i + 1} + 1 \ldots r_{i + 1} ]$$ \end{document} are dense. Call these two patterns the level-(i + 1) decomposition of x. Set i = i + 1 and go to Step 1.

Suppose that the decomposition process ends by finding a solid block b in x, hence a maximal solid block by Proposition 1, with f (b) = f (x). Then, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (b) = { \cal M} (x)$$ \end{document} and the theorem follows. Otherwise, let j be the last level of the decomposition. Then, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$ f (x) < \min \{ f (x^* [ \ell_j \ldots s_j - 1 ]), f (x^* [ s_j + 1 \ldots r_j ]) \} $$ \end{document} . In the latter case, as explained in Section 2 (after Definition 2), we can determine two maximal dense patterns y₁, y₂ such that y₁ subsumes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_j \ldots s_j - 1 ]$$ \end{document} and y₂ subsumes \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_j + 1 \ldots r_j ]$$ \end{document} . Therefore, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (y_1) = { \cal M} (x^* [ \ell_j \ldots s_j - 1 ]), { \cal M} (y_2) = { \cal M} (x^* [ s_j + 1 \ldots r_j ])$$ \end{document} , and f (x) < min{f (y₁), f (y₂)}. Let b₁ (resp., b₂) be the last (resp., the first) solid block of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_j \ldots s_j - 1 ]$$ \end{document} (resp., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_j + 1 \ldots r_j ]$$ \end{document} ). Observe that by construction b₁ and b₂ are maximal solid blocks and there exists an integer \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{d}$$ \end{document} such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$b_1 \circ^{ \hat{d}} b_2$$ \end{document} is a substring of x.

Next, we show that there exists an integer d such that the d-fusion y₁ ▿_dy₂ is well defined, contains \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$b_1 \circ^{ \hat{d}} b_2$$ \end{document} , and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (y_1 \bigtriangledown_d y_2) = { \cal M} (x)$$ \end{document} . We proceed as follows. First, we show that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} contains both y₁ and y₂. To this end, let us “align” \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (y_1)$$ \end{document} with respect to the occurrence of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_j \ldots s_j - 1 ]$$ \end{document} , which is contained in both, and let p the integer such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x) [ i + p ]$$ \end{document} is aligned with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (y_1) [ i ]$$ \end{document} . Now, assume for the sake of contradiction, that there exists an index j such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (y_1) [ j ]$$ \end{document} corresponds to a position of y₁, it is solid and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (y_1) [ j ] \neq { \cal M} (x) [ j + p ]$$ \end{document} . This implies that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$z = { \cal M} (x) \oplus_p { \cal M} (y_1) \neq { \cal M} (y_1)$$ \end{document} . Moreover, z contains \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_j \ldots s_j - 1 ]$$ \end{document} , and, by Lemma 1, it is maximal and has frequency strictly greater than f (y₁), which is impossible because we have chosen y₁ such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x^* [ \ell_j \ldots s_j - 1 ]) = { \cal M} (y_1)$$ \end{document} and therefore \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$f (x^* [ \ell_j \ldots s_j - 1 ]) = f (y_1)$$ \end{document} . A similar argument shows that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} contains y₂.

Since y₁ and y₂ are contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} , there must exist a d such that y₁ ▿_d y₂ is also contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} , and can be aligned with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} in such a way to match the blocks b₁ and b₂ of y₁ and y₂ with the corresponding blocks in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} . Moreover, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$f (y_1 \bigtriangledown_d y_2) \geq f ({ \cal M} (x)) = f (x)$$ \end{document} . However, since y₁ ▿_d y₂ contains both \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_j \ldots s_j - 1 ]$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ s_j + 1 \ldots r_j ]$$ \end{document} , it contains also \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x^* [ \ell_j \ldots r_j ]$$ \end{document} , which, by the decomposition process, has frequency equal to f (x). Therefore, f (y₁ ▿_d y₂) ≤ f (x), and the theorem follows since f (y₁ ▿_d y₂) = f (x). ▪

In essence, Theorem 1 guarantees that we can find any maximal dense motif x either within \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (b)$$ \end{document} , for some maximal solid block b, or by d-fusing two higher-frequency maximal dense motifs y₁, y₂, for some d, finding \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$z = { \cal M} (y_1 \bigtriangledown_d y_2)$$ \end{document} and then possibly “trimming” z on both sides to obtain x. Also, the theorem shows that in the latter case the trimmed sequence must contain at least one maximal solid block b₁ of y₁ and one maximal solid block b₂ of y₂. Moreover, we can disregard those d-fusions y₁ ▿_d y₂ for which no pair of maximal solid blocks b₁ of y₁ and b₂ of y₂ exists such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$b_1 \circ^{ \hat{d}} b_2$$ \end{document} is contained in y₁ ▿_d y₂ for some \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{d} > 0$$ \end{document} .

Algorithm madmx, whose pseudocode is reported in Figure 1, implements the strategy inspired by Theorem 1. It employs three (initially empty) sets previous, current, and next. In line 2, the algorithm first stores the maximal solid blocks b in s for the given frequency σ in the set blocks (see Section 2). Then, it extracts all of the appropriate maximal dense motifs from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (b)$$ \end{document} in lines 3–6, using the function extractMaximalDense, as implied by Theorem 1(a). Finally, lines 7–16 implement the strategy as implied by Theorem 1(b). (In line 10, a fusion y₁ ▿_d y₂ is considered valid if it satisfies the second property of Theorem 1(b).) In practice, the function extractMaximalDense( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} ) produces all the maximal dense patterns contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x)$$ \end{document} and that contains x. The second property of Theorem 1(b) guarantees that even with this restriction all the maximal dense motifs will be produced in output.

FIG. 1.
Pseudocode of algorithm madmx.

3.1. Efficient implementation of MADMX

An important issue for the efficiency of madmx is that it needs to compute the exact frequency of each generated pattern. For what concerns the fusion operation, we observe that x₁ ▿_d x₂ is valid if and only if there exist \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\ell_1 \in{ \cal L}_{x_1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\ell_2 \in{ \cal L}_{x_2}$$ \end{document} such that ℓ₁ − ℓ₂ = d; thus, a simple computation on the pairs \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$(\ell_1, \ell_2) \in { \cal L}_{x_1} \times { \cal L}_{x_2}$$ \end{document} is sufficient to yield the frequencies of all the valid fusions of two patterns.

Let z = x₁ ▿_d x₂. Observe that while the exact frequency of a maximal dense pattern w extracted from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (z)$$ \end{document} is equal to f (z) in case w contains z in its entirety, for a maximal dense pattern w extracted from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (z)$$ \end{document} which does not contain z we can only conclude that f (w) ≥ f (z). In this latter case, naively determining the actual frequency may be computationally expensive. Therefore, in the course of the algorithm we generate two classes of maximal dense motifs: those whose exact frequencies are known, and those for which only a lower bound to their frequencies is known.

We modify function extractMaximalDense so to label each generated motif as either final or tentative depending on whether its frequency is exact or only estimated through a lower bound. Note that for each tentative dense motif w Theorem 1 ensures that there exist two maximal dense motifs x₁, x₂ and a valid d-fusion x₁ ▿_d x₂ such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$f ({ \cal M} (x_1 \bigtriangledown_d x_2)) = f (w)$$ \end{document} . Hence, we are assured that if w is generated from x₁ and x₂ and the true frequencies of x₁ and x₂ are known, the estimated frequency for w is also the true one, even if w is labeled as tentative.

Note that algorithm madmx may generate the same maximal dense motif w several times, from fusions of different pairs of patterns. The algorithm can be modified in such a way that each time a tentative motif w is (re)generated, if its exact frequency f (w) is inferred then w is (re)labeled final, otherwise, the lower bound to f (w) is updated, if necessary. A further modification to the algorithm consists in requiring that x₁ and x₂ in Lines 8 and 9 of the pseudocode be final. Whenever the set current contains no final motifs, we can label as final the motif in current with the highest lower bound to its frequency, and continue with the generation. This is justified by the following proposition.

Proposition 2

Suppose that at some point during the execution of madmx all motifs in the set current are labeled tentative, and let w be the motif belonging to current with the highest lower bound ℓ on its frequency. Then f (w) = ℓ.

Proof

For the sake of contradiction, assume that f (w) ≠ ℓ. In particular, it must be f (w) > ℓ. From Theorem 1, we know that there must be two dense motifs x₁, y₁ with min {f (x₁), f (y₁)} > f (w) and an integer d such that w can be obtained, with its exact frequency, from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $${ \cal M} (x_1 \bigtriangledown_d y_1)$$ \end{document} . If both x₁ and y₁ have already been moved to the previous list from Algorithm madmx, we have f (w) = ℓ. The only possibility is then that at least one of x₁ and y₁ has not been moved to previous. Let x₁ be this dense motif. Then x₁ is either a tentative motif or has not been generated by any fusion yet. Applying the same reasoning to x₁, we have that there exists two dense motifs x₂, y₂ such that at least one of them (let say x₂) has not been put in previous, min {f (x₂), f (y₂)} > f (x₁) and x₁ can be obtained, with its real frequency, from a valid fusion of x₂, y₂. Iterating this reasoning, we can find a sequence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$x_1, x_2, \ldots$$ \end{document} of dense motifs such that
∀i, x_i has not been put in previous,

f (x_i+₁) > f (x_i), and

x_i is derived from the fusion of x_i+₁ with another pattern.

Theorem 1 implies that this sequence must be finite, and that the last element of this sequence, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\tilde{x}$$ \end{document} , is either a solid block or can be found in the maximal extension of a solid block. Therefore, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\tilde{x}$$ \end{document} has been generated by the algorithm (lines 3–5) with its correct frequency, thus it is in previous, that is a contradiction. ▪

A crude upper bound on the running time of madmx can be derived by observing that, for each pair of dense maximal motifs in output, the time spent during all the operations concerning that pair is (naively) O (n³), where n is the length of the input string. If P patterns are produced in output, the overall time complexity is O (n³P²).

4. Experimental Validation of MADMX

We developed a prototype-based implementation of madmx in C++ also including an additional feature which eliminates, from the set of initial maximal solid blocks, those shorter than a given threshold min_ℓ. The purpose of the latter heuristics is to speed up motif generation driving it towards the discovery of (possibly) more significant motifs, with the exclusion of spurious, low-complexity ones. (The code is available for download at www.dei.unipd.it/wdyn/?IDsezione=4534.)

We performed two classes of experiments to evaluate how significant is the set of motifs found using our approach. The first class of experiments is described in Section 4.1. It compares the motifs extracted by madmx with the known biological repetitions that are available in RepBase (Jurka et al., 2005)—a very popular genomic database—using the repeatmasker tool described in Smit et al. (1996). The second class of experiments is described in Section 4.2 and aims at comparing the motifs extracted by madmx with those extracted by varun using the same z-score metric employed in Apostolico et al. (2009) for assessing their relative statistical significance.

4.1. Evaluating significance by known biological repetitions

RepBase (Jurka et al., 2005) is one of the largest repositories of prototypic sequences representing repetitive dna from different eukaryotic species, collected in several different ways. RepBase is used as a reference collection for masking and annotating repetitive dna through popular tools such as repeatmasker (Smit et al., 1996). The latter screens an input dna sequence s for simple repeats and low complexity portions, and interspersed repeats using RepBase. Sequence comparisons are performed through Smith-Waterman scoring (Smith and Waterman, 1981). repeatmasker returns a detailed annotation of the repeats occurring in s, and a modified version of s in which all of the annotated repeats are masked by a special symbol (N or X). With the current version of RepBase, on average, almost 50% of a human genomic dna sequence will be masked by the program (Smit et al., 1996).

Most of the interspersed repeats found by repeatmasker belong to the families called sine/alu and line/l1: the former are Short INterspersed Elements that are repetitive in the dna of eukaryotic genomes (the Alu family in the human genome); the latter are Long Interspersed Nucleotide Elements, which are typically highly repeated sequences of 6K–8K bps, containing rna polymerase II promoters. The line/l1 family forms about 15% of the human genome.

We have conducted an experimental study using madmx and repeatmasker on Human Glutamate Metabotropic Receptors hgmr 1 (410277 bps) and hgmr 5 (91243 bps) as input sequences. We have downloaded the sequences from the March 2006 release of the UCSC Genome database (http://genome.ucsc.edu), with genomic coordinates hg18_dna range=chr6:146385472–146805427 (hgmr 1) and hg18_dna range=chr11:87872389–88443761 (hgmr 5). repeatmasker version was open-3.2.7, sensitive mode, with the query species assumed to be homologous; it ran using blastp version 2.0a19MP-WashU, and RepBase update 20090120.

The experiments to assess the biological significance of the maximal dense motifs extracted by madmx involved three separate stages.

In the first stage, we ran repeatmasker on the input sequences hgmr 1 and hgmr 5, searching for interspersed repeats using RepBase. We considered the summary (tbl file) provided by repeatmasker and one of its output files (.out) containing the list of found repeats: for each occurrence, the .out file provides the substring \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$s [ i \ldots j ]$$ \end{document} of the input sequence s which is locally aligned with (a substring of ) the repeat and is annotated with extra information (which part of the repeat is aligned, its Smith-Waterman score, and so on).

In the second stage, we ran madmx on the same DNA sequences, with density threshold ρ = 0.8, frequency threshold σ = 4, and min_ℓ = 15. In order to filter out simple repeats and low complexity portions, which are dealt with by repeatmasker without resorting to RepBase, we modified madmx eliminating periodic maximal solid blocks (with short periods), which are the seeds of simple repeats. Then, we identified the occurrences of the motifs returned by madmx in the input sequences, using repeatmasker as a pattern matching tool (i.e., replacing RepBase with the set of motifs returned by madmx as the database of known repeats). The underlying idea behind this use of repeatmasker was to employ the same local alignment algorithms, so to make the comparison fairer.

In the third stage, we cross-checked the intervals associated with the occurrences of the RepBase repeats against those associated with the occurrences of our motifs. Specifically, we mapped the motif occurrences in s (seen as intervals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$[ i^{ \prime} \ldots j^{ \prime} ]$$ \end{document} found by madmx) into the repeats (seen as intervals \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$[ i \ldots j ]$$ \end{document} found by repeatmasker) using an approximate notion of interval inclusion. Specifically, a motif occurrence \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$[ i^{ \prime} \ldots j^{ \prime} ]$$ \end{document} is mapped into a repeat \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$[ i \ldots j ]$$ \end{document} whenever [i′, j′] ⊆ [i − δ, j + δ] and [i, j] ⊆ [i′ − δ, j′ + δ], for a very small constant δ. Surprisingly, through the above mapping madmx was able to identify and characterize all of the intervals of the known sine/alu repeats in hgmr 1 and hgmr 5 (respectively, 56 repeats plus an extra unclassified one for hgmr 1, and 20 repeats plus an extra unclassified one for hgmr 5). The remaining occurrences of the motifs permitted to identify 29 repeats out of 78 of the line/l1 family in hgmr 1.

4.2. Evaluating significance by statistical z-score ranking

The z-score is the measure of the distance in standard deviations of the outcome of a random variable from its expectation. Consider a dna sequence s of length n as if it were generated by a stationary i.i.d. source with equiprobable symbols. An approximation to the z-score for a motif of length m that contains c solid characters and appears f times in s is given by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$Z = \frac { f - (n - m + 1) \times p } { \sqrt { (n - m + 1) \times p \times (1 - p) \, } }$$ \end{document} , where p = (1/4)^c. This metric was used in Apostolico et al. (2009) to assess the significance of the motifs extracted by varun and to rank them in the output.

We employed the code for varun provided by the authors to extract the rigid motifs from the dna sequences analyzed in Apostolico et al. (2009). We then ran madmx on the same sequences (provided to us by the authors of Apostolico et al., 2009) using the same frequency parameters, and setting the minimum density threshold ρ in such a way to obtain a comparable yet smaller output size. In this fashion, we tested the ability of madmx to produce a succinct yet significant set of motifs, by virtue of its more flexible notion of density.

The results are shown in Table 1. For varun we used D = 1, thus allowing at most one don't care between two solid characters, and ran madmx with min_ℓ = 1, so to obtain the complete family of maximal dense motifs. In the table, there is a row for each sequence (identified in the first column). Each sequence, whose total length is reported in the second column, is obtained as the concatenation of a number of smaller subsequences, reported in the third column. On each sequence, both tools were run with the same frequency threshold σ, and the table reports for both the output size in terms of the number of motifs returned and the execution time in seconds. Also, for madmx, the table reports the density threshold ρ used in each experiment.

Table 1.
Results of the Comparison with varun

varun madmx Best top-m z-scores

Name Length No. σ |Output| Time ρ |Output| Time m = 10 m = 50 m = 100 m* \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat{m}$$\end{document}

ace2 500 1 2 1866 3s 0.7 1762 18s 10 50 100 1571 1067

ap1 500 1 2 1555 1s 0.7 1304 5s 10 50 100 392 13

gal4 3000 6 4 9764 12s 0.67 7606 67s 10 49 99 16 16

gal4^() 3000 6 4 9764 12s 0.65 11733 191s 10 50 100 9764 301

uasgaba 1000 2 2 4586 30s 0.70 4194 90s 10 50 100 175 175

For each experiment, we compared the best top-m z-scores, with m = 10, 50, and 100, as follows. Note that, in general, the top-m motifs found by madmx and varun differ. Thus, we let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$z_{M}^{i}$$ \end{document} (resp., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$z_{V}^{i}$$ \end{document} ) be the z-score of the ith motif in decreasing z-score order obtained by madmx (resp., varun). For each m, the table reports how many times it was \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$z_{M}^{i} > z_{V}^{i}$$ \end{document} , for 1 ≤ i ≤ m. Also, column m (resp., column \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$\hat{m}$$ \end{document} ) gives the maximum m such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$z_{M}^{i} \geq z_{V}^{i}$$ \end{document} (resp., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6} \begin{document} $$z_{M}^{i} > z_{V}^{i}$$ \end{document} ) for every 1 ≤ i ≤ m.

Even when madmx is calibrated to yield a slightly smaller output, the quality of the motifs extracted, as measured by the z-score, is higher than those output by varun. Indeed, for sequences ace2 and uasgaba, a very large prefix of the top-ranked motifs extracted by madmx features strictly greater z-scores of the corresponding top-ranked ones extracted by varun. For all of the four sequences, at least the thirteen top-ranked motifs have this property. To shed light on the slightly worse performance of madmx on gal4, we re-ran madmx with a different density threshold, so to obtain a slightly larger output (see row gal4^()). In this case, the top-301 motifs extracted by madmx have z-score strictly greater than the corresponding motifs extracted by varun, while the execution time still remains acceptable.

For all runs, the top z-score of a motif discovered by madmx is considerably higher than the one returned by varun. Specifically, on ace2 our best z-score is 387,763 versus 12,027 of varun; on ap1, we have 12,027 versus 1,490; on gal4 it is 75 versus 28; on gal4^() it is 150 versus 28; on uasgaba we have 134,532 versus 67,059. This reflects the high selectivity of madmx, which is to be attributed mostly to adoption of a more flexible density constraint.

We must remark that madmx (in its current nonoptimized version) is slower than varun, but it still runs in time acceptable from the point of view of a user. To further investigate the tradeoff between execution time and significance of the discovered motifs, we repeated the experiments running madmx with min_ℓ = 2 and ρ = 0.65, for all sequences. The running time of madmx was almost halved, while the small output produced still featured high quality. In fact, for sequences ace2, ap1, and uasgaba the top-100 motifs extracted by madmx have z-score greater or equal than the corresponding ones returned by varun.

We also have attempted a comparison between varun and madmx on longer sequences (such as hgmr 1) at higher frequencies (since, unfortunately, varun does not seem to be able to handle low frequencies on very long sequences). Even allowing a higher number of don't cares between solid characters (D = 2) for the motifs of varun, all of the top-m z-scores featured by the motifs extracted by madmx are greater than or equal to the corresponding scores in the ranking of varun, with m reaching the size of varun's output. In fairness, we remark that varun was designed to work at its best on protein sequences, while madmx's main target are dna sequences. Hence, these two tools should be regarded as complementary. Moreover, varun has the advantage of retrieving flexible motifs, while madmx focuses only on rigid ones.

5. Conclusion

In this article, we introduced the notion of density to single out the most relevant motifs during pattern discovery in biological sequences. We showed how to perform the efficient extraction of maximal dense motifs and experimented the corresponding software tool madmx on real data sets. The efficiency of our approach relies on the newly defined operation of fusion, which avoids the generation of nonmaximal motifs during the intermediate steps of the extraction. The experimental results give evidence of the efficiency and the quality of the motifs returned by madmx.

			varun	madmx	Best top-m z-scores
ace2	500	1	2	1866	3s	0.7	1762	18s	10	50	100	1571	1067
ap1	500	1	2	1555	1s	0.7	1304	5s	10	50	100	392	13
gal4	3000	6	4	9764	12s	0.67	7606	67s	10	49	99	16	16
gal4^(*)	3000	6	4	9764	12s	0.65	11733	191s	10	50	100	9764	301
uasgaba	1000	2	2	4586	30s	0.70	4194	90s	10	50	100	175	175

Footnotes

Acknowledgments

We wish to thank Alberto Apostolico and Matteo Comin for providing the code and the sequences used in , and giving valuable insights on varun; Ben Raphael for suggesting the use of repeatmasker; and Roberta Mazzucco and Francesco Peruch for coding madmx. A preliminary version of this work has been presented in WABI 2010. Support for R.G., A.P., N.P., and G.P. was provided, in part, by MIUR of Italy under Project AlgoDEEP prot. 2008TFBWL4. Support for A.P. and G.P. was provided, in part, by the University of Padova under the Strategic Project STPD08JA32 and Project CPDA099949/09. Support for E.U. was provided, in part, by NSF awards IIS-0325838 and DMI-0600384, and ONR Award N000140610607. Part of this work was done while F.V. was affiliated to Dipartimento di Ingegneria dell'Informazione, Università di Padova, Italy.

Disclosure Statement

No competing financial interests exist.

References

Agrawal

, Srikant

1994. Fast algorithms for mining association rules. Proc. 20th VLDB, 487–499.

Apostolico

, Comin

, Parida

2009. VARUN: discovering extensible motifs under saturation constraints. IEEE Trans. Comput. Biol. Bioinform. http://doi.ieeecomputersociety.org/10.1109/TCBB.2008.123.

Apostolico

, Parida

2004. Incremental paradigms of motif discovery. J. Comput. Biol., 11:15–25.

Apostolico

, Tagliacollo

2007. Optimal offline extraction of irredundant motif bases. Lect. Notes Comput. Sci., 4598:360–371.

Apostolico

, Tagliacollo

2008. Incremental discovery of the irredundant motif bases for all suffixes of a string in O (n² log n) time. Theor. Comput. Sci., 408:106–115.

Arimura

, Uno

2008. Mining maximal flexible patterns in a sequence. Lect. Notes Comput. Sci., 4914:307–317.

Jurka

, Kapitonov

V.V.

, Pavlicek

et al. 2005. Repbase update: a database of eukaryotic repetitive elements. Cytogenet. Genome Res., 110:462–467.

Morris

, Nicolas

, Ukkonen

2008. On the complexity of finding gapped motifs. CoRR abs/0802.0314.

Parida

2000. Some results on flexible-pattern discovery. Lect. Notes Comput. Sci., 1848:33–45.

10.

Parida

2008. Pattern Discovery in Bioinformatics. Chapman & Hall/CRC: Boca Raton, FL.

11.

Pisanti

2002. Segment-based distances and similarities in genomic sequences [Ph.D. dissertation] University of Pisa: Italy.

12.

Pisanti

, Crochemore

, Grossi

et al. 2005. Bases of motifs for generating repeated patterns with wild cards. IEEE Trans. Comput. Biol. Bioinform., 2:40–50.

13.

Rigoutsos

, Floratos

1998. Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics, 14:55–67.

14.

Saha

, Bridges

, Magbanua

Z.V.

et al. 2008. Empirical comparison of ab initio repeat finding programs. Nucleic Acids Res., 36:2284–2294.

15.

Smit

A.F.A.

, Hubley

, Green

1996. RepeatMasker Open-3.0. www.repeatmasker.org. 2011 January 15.

16.

Smith

T.F.

, Waterman

M.S.

1981. Identification of common molecular subsequences. J. Mol. Biol., 147:195–197.

17.

Ukkonen

2007. Structural analysis of gapped motifs of a string. Lect. Notes Comput. Sci., 4708:681–690.

			varun			madmx			Best top-m z-scores
Name	Length	No.	σ	\|Output\|	Time	ρ	\|Output\|	Time	m = 10	m = 50	m = 100	m*	\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat{m}$$\end{document}
ace2	500	1	2	1866	3s	0.7	1762	18s	10	50	100	1571	1067
ap1	500	1	2	1555	1s	0.7	1304	5s	10	50	100	392	13
gal4	3000	6	4	9764	12s	0.67	7606	67s	10	49	99	16	16
gal4^(*)	3000	6	4	9764	12s	0.65	11733	191s	10	50	100	9764	301
uasgaba	1000	2	2	4586	30s	0.70	4194	90s	10	50	100	175	175