Efficient Computation of Approximate Gene Clusters Based on Reference Occurrences

Abstract

Whole genome comparison based on the analysis of gene cluster conservation has become a popular approach in comparative genomics. While gene order and gene content as a whole randomize over time, it is observed that certain groups of genes which are often functionally related remain co-located across species. However, the conservation is usually not perfect which turns the identification of these structures, often referred to as approximate gene clusters, into a challenging task. In this article, we present an efficient set distance based approach that computes approximate gene clusters by means of reference occurrences. We show that it yields highly comparable results to the corresponding non-reference based approach, while its polynomial runtime allows for approximate gene cluster detection in parameter ranges that used to be feasible only with simpler, e.g., max-gap based, gene cluster models. To illustrate further the performance and predictive power of our algorithm, we compare it to a state-of-the art approach for max-gap gene cluster computation.

1. Introduction

With the increasing availability of completely sequenced and assembled genomes, gene-order based comparisons of whole genomes have recently become an important field in comparative genomics. It is well known that genomes evolve not only on the level of nucleotide sequence but also by means of large-scale rearrangements operations, such as inversions and transpositions, as well as changes of the gene content. Focusing on this large-scale structure, genomes are usually modeled as strings of integers so that genes belonging to the same gene family are represented by the same integer. If no selective pressure was acting on whole genome evolution, gene order and gene content would randomize over time. In reality, we observe a low overall gene order conservation between species that is contrasted by a number of small, well-conserved segments, often referred to as gene clusters (Fig. 1).

FIG. 1.

Visualization of gene order conservation between the two γ-proteobacteria Escherichia coli (4183 genes) and Buchnera aphidicola (564 genes). The boxes indicate areas of conserved gene content/gene order.

Such local aberrations from genome randomization are known to provide highly informative signals for functional analysis (Price et al., 2005; Rogozin et al., 2004; Wolf et al., 2001; Yanai et al., 2002; Huynen and Snel, 2000; Overbeek et al., 1999; Tamames et al., 1997). When comparing a large number of genomes, the identification of these structures can be a challenging task, as conservation patterns may be highly variable across species. Due to micro-rearrangements, gene order can vary across cluster occurrences, and due to gene insertions and gene losses, cluster occurrences may be interrupted by genes that do not belong to the cluster, and contain only a subset of the clustered genes. To cope with such variations, different approximate gene cluster models have been proposed in the last years. We briefly review the most common ones and illustrate their predictive power in Figure 2.

FIG. 2.

Approximate occurrences of an artificial gene cluster predicted under different gene cluster models. Gray-shaded genes belong to the predicted cluster; dotted genes indicate intermitting genes; the remaining genes are located outside the predicted cluster occurrences. (a) Using the common intervals based model, the cluster is not detected. Only for a subset of the genomes, parts of the cluster are detected, e.g., genes 1, 2, and 3 are clustered in genomes 1 and 2. (b) For a gap size of at least four, the max-gap approach clusters genes 2 and 7, the only ones present in all four occurrences. (c) Under the median gene cluster model, genes 1, 2, 3, 4, 5, 6, 7, and 8 form a cluster that underwent in total six gene losses, one gene insertion and one neutral gene duplication that are distributed over its four approximate occurrences. (d) The predicted median gene cluster has no reference occurrence, but the same approximate occurrences are reported when choosing the occurrence in genome 1 as reference. However, due to the absence of gene 8 there, three additional insertions are needed in the other occurrences.

One of the first formal gene cluster models are common intervals (Uno and Yagiura, 2000; Heber and Stoye, 2001a,b; Schmidt and Stoye, 2004; Didier et al., 2007; Didier, 2003), which allow for variable gene order and multiple gene copies within cluster occurrences, but not for differences in the set of contained genes. Due to this restriction, the computation of gene clusters under the common intervals model runs in polynomial time with respect to the maximum genome length n.

To represent more diverse conservation patterns, the so-called max-gap clusters (Bergeron et al., 2002; He and Goldwasser, 2005), also known as gene teams or homology teams, were developed. This widely used gene cluster model covers gene insertions but lacks a proper treatment of incomplete conservation of gene content. Its basic idea is to allow for an arbitrary number of gaps in every cluster occurrence, each up to a fixed length, that can be filled with intermitting genes. Gene losses are treated only indirectly: Any gene that is lost in a cluster occurrence is counted as an intermitting gene in those occurrences where it is still present. Therefore, the set of genes representing a gene cluster reduces to a minimal consensus: the genes that occur in all cluster occurrences. In consequence, it may become necessary to tolerate artificially high gap sizes to bridge seemingly intermitting genes. These phenomena are illustrated in Figure 2b. The asymptotic complexity of identifying max-gap clusters increases exponentially with the number of compared genomes, but practical running times were shown to be feasible (Ling et al., 2009).

To improve the treatment of gene losses, a number of set distance based models arose recently, for example, median gene clusters (Rahmann and Klau, 2006; Böcker et al., 2009), that extend the concept of common intervals towards approximate conservation of gene content. The basic idea is to define a maximum distance δ between a consensus gene set and its approximate occurrences which can be freely distributed over gene losses and insertions located anywhere in the approximate cluster occurrences. A reported gene cluster is not a minimal consensus but a set of genes that is optimized in the sense that the total distance to its approximate occurrences is minimized. In principle, this approach allows for the detection of gene clusters with diverse conservation patterns in a large number of genomes. Moreover, it is tolerant of errors in gene homology assignment which is an essential preprocessing step of most gene cluster detection tools. In some cases, it may even give hints at overlooked homologies; for example, in Figure 2c, the replacement of gene 3 with gene 9 in the fourth genome may suggest that the two genes are in fact unrecognized homologs. The cost for these generalizations is a search space that grows exponentially, either with the distance threshold δ (Rahmann and Klau, 2006) or the number of compared genomes k (Böcker et al., 2009). Practical computation times are feasible for many, but not all interesting parameter ranges. An alternative set distance based approach to gene cluster modeling are center gene clusters which were also briefly introduced in Böcker et al. (2009). This model differs from median gene clusters in the way the consensus gene set of approximate cluster occurrences is defined. It is based on minimizing pairwise distances, not the total distance. The complexity issue described above is not affected by this modification. In both models, the exponential search space is due to the optimality criterion imposed on the consensus gene set. However, if a gene cluster is not represented by an optimal consensus but a closely related set with a reference occurrence in the given genomes (i.e., one without intermitting or missing genes), the search space becomes polynomially bounded. An example of such a reference based gene cluster can be seen in Figure 2d. The computation of reference based conservation patterns was studied by Amir et al. (2007). However, it is possible to construct a counter example for which their graph-based, O(kn³ + output size) time and O(kn³) space, algorithm does not detect the complete solution set. (For details, see the Appendix of this article.)

In this article, we show that the complete set of reference-based approximate gene cluster occurrences can be identified in time O(k²n²(δ + 1)²), δ ≪ n, using O(kn²) space. To assess the relevance of the reference-based gene cluster model and the performance of our algorithm, we compare it to the center gene cluster model and to mcmusec (Ling et al., 2009), the state-of-the-art approach to max-gap gene cluster computation.

2. Basic Definitions And Notation

We model a genome as a string over a finite alphabet \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\Sigma = \{ 1, \ldots, \sigma \}$$\end{document} of gene family ids, such that genes belonging to the same homology family are represented by the same integer. Given a string S, we denote its length by |S| and refer to its ith character by S[i], 1 ≤ i ≤ |S|. We say that character \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in \Sigma$$\end{document} occurs at position i in S if and only if S[i] = c. To capture the character content of S regardless of sequential arrangement and multiple character occurrences, we define the character set of S as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S) = \{ S [i] | 1 \leq i \leq | S | \}$$\end{document} . To simplify notation, we assume that a string S is extended on both ends by a terminal character \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$0 \notin \Sigma$$\end{document} , i.e. S[0] = S[|S| + 1] = 0. We define S[i, j] to be the substring of S that starts with its ith and ends with its jth character, 1 ≤ i ≤ j ≤ |S|. The corresponding index interval [i, j] is called a location of C ⊆ Σ if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C = { \cal CS} (S [i, j])$$\end{document} . We distinguish different types of intervals in a string S. We call an interval [i, j] with j ≥ i left-maximal if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [i - 1] \notin { \cal CS} (S [i, j])$$\end{document} , right-maximal if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [j + 1] \notin { \cal CS} (S [i, j])$$\end{document} and maximal if it is both left- and right-maximal. Given a set of strings \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal S} = \{ S_1, \ldots, S_k \}$$\end{document} , k ≥ 2, we call a k-tuple of maximal intervals ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$[i_1, j_1], \ldots, [i_k, j_k]$$\end{document} ) common intervals of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal S}$$\end{document} if and only if there is a C ⊆ Σ with: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} C = { \cal CS} (S_1 [i_1, j_1]) = \ldots = { \cal CS} (S_k [i_k, j_k]). \end{align*} \end{document}

Such character sets correspond to clusters with perfectly conserved gene content. The above definitions are illustrated in Figure 3.

FIG. 3.

Each of the gray-shaded intervals is maximal and a location of the character set {4, 6, 8, 9}. The interval combinations ([8, 11], [1, 4], [7, 11]) and ([8, 11], [8, 11], [7, 11]) are common intervals of S₁, S₂ and S₃.

To quantify differences in the gene content of approximate gene cluster occurrences, we use the symmetric set distance, which defines the distance between two sets C and C′ as the cardinality of their symmetric difference: D(C, C′) = |C \C′| + |C′\C|. This measure constitutes a metric and therefore meets all intuitive notions of a distance measure such as validity of the triangle inequality. We extend the concept of character set locations towards approximate conservation: Given an integer δ ≥ 0, we say an interval [i, j] in a string S is a δ-location of a character set C if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D (C, { \cal CS} (S [i, j])) \leq \delta$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C \cap { \cal CS} (S [i, j]) \neq \emptyset$$\end{document} . The number of δ-locations of a C ⊆ Σ can grow quadratically with the string length n, even for δ = 0. This means that a gene cluster can not only have a large number of approximate occurrences, but also many overlapping ones. To avoid this redundancy, we introduce an optimality criterion for placing approximate cluster occurrences in a genomic area. Like in the definition of common intervals, we require interval maximality. Also, it is intuitive to forbid intervals that start/end next to a gene belonging to the gene cluster. Clearly, extending the interval to incorporate this gene yields a better approximate occurrence. Finally, it is not desirable to place an approximate occurrence such that one or both of its border genes do not belong to the gene cluster. Excluding such genes increases the compactness of the occurrence. However, to avoid conflict with interval maximality, we relax this condition slightly and allow such genes if they are copies of an intermitting gene occurring more centrally in the approximate occurrence. For an illustration of these concepts, assume we are given a gene set C = {1, 4, 8, 9}, the genomes shown in Figure 3 and a distance threshold δ = 2. Interval [8, 11] in genome S₂ defines a δ-location of C, but this approximate occurrence can be improved by extending the borders to incorporate gene “1” on both ends. Interval [1, 4] also defines a non-optimal δ-location of C. It can be improved by excluding the left-most gene. For interval [7, 11] in genome S₃ the compactness criterion conflicts with interval maximality. If we exclude gene “6” at position 11, we get a more compact occurrence, but the interval would be no longer maximal. Therefore, we prefer interval [7, 11] over [7, 10] when it comes to finding the optimal placement of C in this region of genome S₃.

To formalize these optimality objectives, we define two subtypes of maximal intervals: Given a character set C, we say a maximal interval [i, j] in S with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S [i, j]) \cap C \neq \emptyset$$\end{document} is closed with respect to C, or C-closed for short, if and only if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [i - 1] \notin C$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [j + 1] \notin C$$\end{document} . For the second subtype of maximal intervals, we define the left-most essential position i* of [i, j] with respect to C as the smallest index i′, i ≤ i′ ≤ j, such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [i^{ \prime}] \in C$$\end{document} . Analogously, we define the right-most essential position j* of [i, j] with respect to C as the largest index j′, i ≤ j′ ≤ j, such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [j^{ \prime}] \in C$$\end{document} . Interval [i*, j*] is called the C-essential subinterval of [i, j]. The characters at positions i* and j* are called left-most essential character, respectively right-most essential character, with respect to C. The second subtype of maximal intervals comprises all maximal intervals [i, j] with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S [i, j]) \cap C \neq \emptyset$$\end{document} for which \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S [i, j]) = { \cal CS} (S [i^*, j^*])$$\end{document} holds. Maximal intervals fulfilling this property are called compact with respect to C or C-compact for short. If a maximal interval [i, j] is both closed and compact with respect to C, we call it optimal with respect to C, or C-optimal for short. A C-optimal δ-location of C is called an optimal δ-location. One can show that every region of approximate conservation is still detected:

Observation 1

Given an interval [i, j] in a string S and a character set C. If [i, j] is a δ-location of C for a δ ≥ 0, then there is also a δ-location that is C-optimal and contains the C-essential subinterval of [i, j].

Proof

We prove this by giving a construction algorithm: First, we identify the C-essential subinterval [i, j] which exists because \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S [i, j]) \cap C \neq \emptyset$$\end{document} in each δ-location [i, j] of C. Then, we extend [i, j] on both ends to [i′, j′] until \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [i^{ \prime} - 1] \notin C \cup { \cal CS} (S [i^, j^])$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [j^{ \prime} + 1] \notin C \cup { \cal CS} (S [i^, j^])$$\end{document} . This happens at the latest when the extended string boundaries are reached, i.e., i′ = 1, respectively j′ = |S|. By construction such a [i′, j′] is C-optimal and contains [i, j]. It is also a δ-location of C: By definition \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S [i^, j^]), C) \leq D ({ \cal CS} (S [i, j]), C)$$\end{document} and in the following interval extension only two types of characters are added: those already contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S [i^, j^])$$\end{document} , which are neutral with respect to the distance to C, and elements from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C \setminus { \cal CS} (S [i^, j^])$$\end{document} , which reduce the distance. Therefore, also the distance constraint is met. ■

The construction used in the proof of Observation 1 is illustrated in Figure 4. Optimal δ-locations can still overlap, but compared to general δ-locations there is a tighter upper bound for their number:

FIG. 4.
Construction of the C-optimal interval [4, 16] from the interval [2, 13] via its C-essential subinterval [5, 13] for C = {1, 3, 4, 6, 8, 9, 10, 11}. The distance between C and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [2, 13])$$\end{document} equals 5 and decreases to 2 when [2, 13] is replaced by [4, 16].

Observation 2

The number of optimal δ-locations of a character set C ⊆ Σ in a string S of length n is in O(n(δ + 1)) for all δ ≥ 0.

Proof

We count how many optimal δ-locations of C can have the same left-most position a in S. For that purpose, we show that every two such intervals [a, b₁] and [a, b₂], b₁ < b₂, contain a different number of characters from Σ \ C. We have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [b_1 + 1] \notin C$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [b_1 + 1] \notin { \cal CS} (S [a, b_1])$$\end{document} , while \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [b_1 + 1] \in { \cal CS} (S [a, b_2])$$\end{document} . However, it also holds that every \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in{ \cal CS} (S [a, b_1]) \setminus C$$\end{document} is contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S [a, b_2])$$\end{document} . Therefore, the number of characters from Σ \ C cannot be the same in the two intervals. Moreover, a δ-location of C can have only between 0 and δ characters from Σ \ C. Thus, there are at most δ + 1 optimal δ-locations of C with left-most position a and O(n(δ + 1)) optimal δ-locations of C in the complete string. ■

We now have all the prerequisites to define the problems studied in this paper. At first, we study the detection of optimal δ-locations:

Problem 1

Given two strings S₁ and S₂ over an alphabet Σ and a distance threshold δ, find for each maximal [i, j] in S₁ all optimal δ-locations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [i, j])$$\end{document} in S₂.

For δ = 0, this problem is equivalent to common intervals detection. For reference-based approximate gene cluster detection in multiple genomes the definition can be extended as follows:

Problem 2

Given a set of strings \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal S} = \{ S_1, \ldots, S_k \}$$\end{document} over an alphabet Σ, a distance threshold δ and a quorum parameter q ≤ k, find each C ⊆ Σ with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C = { \cal CS} (S_{ \ell} [i_ \ell, j_ \ell])$$\end{document} for some 1 ≤ ℓ ≤ k, 1 ≤ i_ℓ ≤ j_ℓ ≤ |S_ℓ| that has δ-locations in at least q different strings and report all its optimal δ-locations in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S_1, \ldots, S_k$$\end{document} .

We call character sets of the form of C conserved reference sets and their optimal δ-locations reference-based approximate common intervals.

3. Computation Of Optimal δ-Locations

The algorithm presented in the following adopts the basic search strategy of the Connecting Intervals (CI) Algorithm for the computation of common intervals (Schmidt and Stoye, 2004). Therefore, we begin with a short review of this algorithm.

3.1. The connecting intervals algorithm

The CI Algorithm uses two static data structures that are computed in a preprocessing step: an array of length |Σ| called Pos that lists for each character \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in \Sigma$$\end{document} its occurrences in S₂ from left to right, and a |S₂| × |S₂| table named Num that stores for every interval in S₂ how many different characters are contained, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \sc Num} [i, j] = | { \cal CS} (S_2 [i, j]) |$$\end{document} . For an example, see Figure 5.

FIG. 5.
Example of data structures Pos and Num for alphabet Σ = {1, 2, 3, 4, 5, 6, 7} and two strings S₁ = 1 2 4 6 4 3 1 5 6 2 and S₂ = 5 7 3 2 4 3 2 6 5 3 6 4.

The basic idea of the CI Algorithm is that while going systematically through all maximal intervals [i, j] in S₁, referred to as reference intervals in the following, one iteratively generates and extends marked intervals in S₂ that consist only of characters occurring in the current reference character set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [i, j])$$\end{document} using array Pos for their identification. An example of this procedure is given in Figure 6.

FIG. 6.
Iterative generation of reference intervals in S₁ for start position 2 and corresponding interval marking in S₂ for S₁ = 1 2 4 6 4 3 1 5 6 2 and S₂ = 5 7 3 2 4 3 2 6 5 3 6 4.

Common intervals are retrieved by comparing the character content of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [i, j])$$\end{document} and the marked intervals in S₂. Since by construction the character sets of the marked intervals are subsets of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [i, j])$$\end{document} , this can be tested by comparing their sizes, using the table Num, and the current size of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [i, j])$$\end{document} . Only those intervals in S₂ that were extended by an occurrence of the most recent element of the reference character set need to be considered for this test. Other intervals do not contain this character and thus have a smaller character set. It can be shown that this algorithm detects all common intervals in time O(n²) using O(n²) space.

3.2. Extension of the connecting intervals algorithm

The changes necessary to the CI Algorithm to compute optimal δ-locations are presented together with the pseudocode given in Algorithm 1. The presented algorithm is a further development of the first step in the median gene cluster computation scheme where for a given δ only the existence of a δ-location is tested. To solve Problem 1, we need to enumerate all optimal δ-locations.

Algorithm 1
Computation of optimal δ-locations

1: for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$i = 1, \ldots, | S_1 |$$\end{document} do

2: j ← i, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C} \leftarrow \emptyset$$\end{document}

3: while j ≤ |S₁| and [i, j] is left-maximal do

4: c ← S₁[j]

5: while [i, j] is not right-maximal do

6: j ← j + 1

7: end while

8: for each interval \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$[a, b] \in { \cal C}$$\end{document} do

9: remove [a, b] from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} unless it is an optimal δ-location of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${\cal CS} (S_1 [i, j])$$\end{document}

10: end for

11: for each position p in S₂ with S₂[p] = c do

12: mark position p in S₂

13: find unmarked positions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$l_1, \ldots, l_{ \delta + 1}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$r_1, \ldots, r_{ \delta + 1}$$\end{document} around p

14: for each interval [l_x + 1, r_y − 1] with 1 ≤ x, y ≤ δ + 1 do

15: if [l_x + 1, r_y − 1] is an optimal δ-location and not contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} then

16: add [l_x + 1, r_y − 1] to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document}

17: end if

18: end for

19: end for

20: if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C} \neq \emptyset$$\end{document} then

21: output \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$([i, j], { \cal C})$$\end{document}

22: end if

23: j ← j + 1

24: end while

25: unmark all positions in S ₂

26: end for

We adopt the iterative generation of reference intervals from the original CI Algorithm, where for a fixed i the maximal intervals starting at i are processed one after the other for increasing values of j (lines 1–7). Also the marking of intervals in S₂ that consist only of characters from the current reference character set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C = { \cal CS} (S_1 [i, j])$$\end{document} (line 12) is useful for this purpose: Since approximate locations need to have character sets that intersect with C, these intervals are starting points for detecting C-optimal δ-locations. However, unlike with perfect locations, it is not sufficient to consider only recently extended maximal marked intervals. For δ > 0, intervals that are partially unmarked and/or contain no occurrence of c, the character most recently added to C, can as well be C-optimal δ-locations.

However, we observe that it is not necessary to compute optimal δ-locations from scratch for every reference character set. For a fixed left border i successive reference intervals [i, j′] and [i, j] with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [i, j]) = { \cal CS} (S_1 [i, j^{ \prime}]) \cup \{ c \}$$\end{document} can share some optimal δ-locations as the example in Figure 7 shows. Therefore, we store the optimal δ-locations of one reference character set in a list \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} and test for the following reference character set which of these intervals can be inherited. Only in the case where the left border of the reference interval is shifted, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} is emptied as the generation of reference character sets starts anew (line 2). The optimal δ-locations that can be inherited to the next reference character set can be characterized as follows:

FIG. 7.
Optimal δ-locations of two successive reference character sets in two example strings S₁ and S₂ for distance threshold δ = 3. Shared optimal δ-locations are printed bold.

Observation 3

Let C, C′ ⊆ Σ with C = C′ ∪ {c} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \notin C^{ \prime}$$\end{document} . Every interval [a, b] in a string S over Σ that is an optimal δ-location of C′ is also an optimal δ-location of C if and only if either:
(i) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in { \cal CS} (S [a, b])$$\end{document} , or

(ii) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \notin { \cal CS} (S [a, b])$$\end{document} , c ≠ S[a − 1], c ≠ S[b + 1] and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D (C^{ \prime}, { \cal CS} (S [a, b])) < \delta$$\end{document} .

Proof

(i) Since [a, b] is C′-optimal and contains an occurrence of c, it is C-optimal. Moreover, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D({ \cal CS} (S [a, b]), C) < D ({ \cal CS} (S [a, b]), C^{ \prime}) < \delta$$\end{document} holds. (ii) ⇒ : From [a, b] being C-closed and from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in C$$\end{document} follows directly that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [a - 1] \notin C$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [b + 1] \notin C$$\end{document} . Furthermore, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D (C, { \cal CS} (S [a, b])) \leq \delta$$\end{document} . Removing a single character from C that is not contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S [a, b])$$\end{document} reduces the distance by one. Therefore, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D (C^{ \prime}, { \cal CS} (S [a, b])) < \delta$$\end{document} holds. ⇐ : It follows from c ≠ S[a − 1], c ≠ S[b + 1] and C = C′ ∪ {c} that [a, b] is C-optimal. Furthermore, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D (C^{ \prime}, { \cal CS} (S [a, b])) < \delta$$\end{document} . Adding a single character to C′ increases this distance by at most one so that it cannot exceed δ. Therefore, [a, b] is an optimal δ-location of C. ■

Testing the elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} for the conditions given in Observation 3, we can remove all non-inheritable intervals (line 9). Next, we show how we can compute the optimal δ-locations that cannot be inherited from the previous reference character set. These intervals are characterized as follows:

Observation 4

Let C, C′ ⊆ Σ with C = C′ ∪ {c} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \notin C^{ \prime}$$\end{document} . Every interval [a, b] in a string S over Σ that is an optimal δ-location of C is an optimal δ-location of C′ if and only if either:
(i) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \notin { \cal CS} (S [a, b])$$\end{document} , or

(ii) \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in { \cal CS} (S [a, b]), D ({ \cal CS} (S [a, b]), C) < \delta$$\end{document} and ∃ p_ℓ, p, p_r : a ≤ p_ℓ < p < p_r ≤ b, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [p_{ \ell}] \in C^{ \prime}$$\end{document} , S[p] = c, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S [p_r] \in C^{ \prime}$$\end{document} .

Proof

(i) With [a, b] being C-optimal, S[a − 1] and S[b + 1] are not in C and therefore not in C′. Due to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \notin { \cal CS} (S [a, b])$$\end{document} , [a, b] has the same left- and right-most essential positions with respect to C and C′. Thus, [a, b] is C′-optimal. Also, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S [a, b]), C^{ \prime}) = D ({ \cal CS} (S [a, b]), C) - 1 \leq \delta - 1$$\end{document} holds. Hence, [a, b] is an optimal δ-location of C′.

(ii) ⇒ : From \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S [a, b]), C^{ \prime}) \leq \delta$$\end{document} follows \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S [a, b]), C) < \delta$$\end{document} because \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in C$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in { \cal CS} (S [a, b])$$\end{document} , but \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \notin C^{ \prime}$$\end{document} . Let p_ℓ and p_r be left-most/right-most essential positions of [a, b] with respect to C′. Then S[p_ℓ] ≠ c, S[p_r] ≠ c and ∃ p with p_ℓ < p < p_r and S[p] = c due to C′-optimality and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in { \cal CS} (S [a, b])$$\end{document} . ⇐ : \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S [a, b]), C^{ \prime}) \leq \delta$$\end{document} holds, as \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S [a, b]), C^{ \prime}) = D ({ \cal CS} (S [a, b]), C) + 1$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S [a, b]), C) < \delta$$\end{document} . It follows from p_ℓ < p < p_r that c is contained in the C′-essential subinterval of [a, b]. Since C and C′ differ only in c, [a, b] has to be C′-optimal. ■

An important consequence of this observation is that only intervals with an occurrence of c need to be considered for the computation of non-inheritable optimal δ-locations. Moreover, we can infer that there are exactly two types of them: intervals whose character set has exactly distance δ to C, and intervals with left-most and/or right-most essential character c that contain no “inner occurrences” of c, i.e., positions that are separated from both interval boundaries by characters from C other than c. Only these intervals need to be computed anew for every reference interval.

To detect these non-inheritable optimal δ-locations, we identify for each occurrence p of c in S₂ all intervals around p that contain at most δ different unmarked characters. All other intervals either contain no occurrence of c or have a distance to C greater than δ. To find these intervals, we compute positions to the left and right of p with increasing numbers x, y ≥ 1 of unmarked characters (line 13):
l_x = l_x(p) = max({l | S₂[l, p] contains x different unmarked characters} ∪ {0})

r_y = r_y(p) = min({r | S₂[p, r] contains y different unmarked characters} ∪ {|S₂| + 1}).

By definition, the substrings S₂[l_x + 1, r_y − 1] contain at most x + y − 2 different characters not occurring in S₁[i, j]. (The number is smaller if the same unmarked characters occur left and right of p.) Clearly, not all of them are C-optimal or fulfill the distance constraint, as the example in Figure 8 illustrates.

FIG. 8.
All intervals of the form [l_x + 1, r_y − 1] for p = 12, δ = 3. For intervals that are not an optimal δ-location of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [3, 11])$$\end{document} , the missing property is given.

But together, they form a superset of the C-optimal δ-locations around p. Thus, we only need to test every interval of the form [l_x + 1, r_y − 1] for being an optimal δ-location for the reference interval [i, j] and add it to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} if it passes this test and cannot be inherited from the previous reference character set (line 16).

It follows from Observations 3 and 4 that once all occurrences of c have been processed, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} contains all optimal δ-locations of C. To avoid that intervals with multiple occurrences of c are redundantly inserted, we add a rule by which only intervals in which p is the left-most occurrence of c are added to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} . From the previous considerations, it follows directly that the presented algorithm solves Problem 1.

3.3. Implementation details and data structures

Before we can analyze the time complexity of Algorithm 1, we need to have a closer look at some of the involved operations and the data structures that can be employed for their efficient implementation.

Precomputation of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$l_1, \ldots, l_{ \delta + 1}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$r_1, \ldots, r_{ \delta + 1}$$\end{document}

We begin with the identification of unmarked positions around the newly marked positions in S₂ (line 13). An efficient approach to detect these positions was recently introduced in the context of median gene cluster computations (Böcker et al., 2009). It is based on two basic observations: First, the values l_x and r_y, 1 ≤ x, y ≤ δ + 1, can be precomputed for all positions p in S₂ once the left border i of the next class of reference intervals [i, j], j ≥ i, is fixed. Second, the changes of these values between two successive positions i follow a simple pattern that allows for an efficient update. To see this pattern, we rank the characters in the strings S₁[i,|S₁|] and S₂ by the order of their first occurrence in the concatenated string S₁[i,|S₁|]S₂, initially for i = 1, and after each shift of the left border i. Such a bijection, Rank: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal CS} (S_1 [i, | S_1 |] S_2) \to \{ 1, \ldots, | {\cal CS} (S_1 [i, | S_1 |] S_2) | \}$$\end{document} , can be generated in time O(n) which accumulates to O(n²) for a complete run of the algorithm. At the time the occurrences of c are marked in S₂, the remaining unmarked characters c′ are such that Rank[c′] > Rank[c]. When i is shifted to i + 1, the rank of all characters occurring between i and the next occurrence of c_old = S₁[i] decreases by one while the rank of c_old increases by the number of different characters between the two occurrences.

This leads to the following computation scheme: For i = 1, we simply scan the left and right neighborhood of each position p, look for the first δ + 1 different characters with a rank greater than Rank[S₂[p]] and store these values in two δ × |S₂| tables L and R such that L[p][x] = l_x(p) and R[p][y] = r_y(p) holds. Once i is shifted to i + 1, we update L and R separately for occurrences of c_old = S₁[i] and the remaining positions in S₂. For occurrences of c_old in S₂ the entries in L and R can change completely due to a possibly large change of the rank of c_old. We compute these values anew by going through S₂ once from left to right and once from right to left remembering the last δ + 1 occurrences of characters with rank greater than the new rank of c_old. If a character is read more than once, we only remember its latest occurrence. Once we reach an occurrence of c_old, we fill the corresponding entries in L (respectively, R) with the remembered positions. For all other positions, the entries in L and R change only if the rank of the occurring character is smaller than the new rank of c_old. For these positions, we need to check whether an occurrence of c_old is close enough to become an entry in L and/or R. We test this by going through S₂ once from left to right and once from right to left remembering the most recent occurrence of c_old. Once we reach a position with a character of smaller rank than the new rank of c_old, we go through its δ + 1 entries in L (respectively, R) and insert the remembered occurrence of c_old at the appropriate position. For a complete run of Algorithm 1, the time spent on keeping L and R up to date accumulates to O(n²(δ + 1)).

Identification of left- and right-most essential positions

To test the candidate intervals [l_x + 1, r_y − 1] for being C-compact, we need to know their C-essential subintervals. We observe that the left-most essential position of an interval depends only on its left border and C while the right-most essential position depends only on the right border and C. Therefore, we need to determine for each l_x and r_y only one left-most (respectively right-most) essential position that is valid for all intervals with left border l_x + 1, respectively right border r_y − 1. We precompute and update these values parallel to the values of l_x and r_y using two additional tables L′ and R′ that are of the same format as L and R. In the following, we study only the computation and maintenance of L′. (The operations dealing with R′ are analogous.)

In terms of the character ranking, this task reads as follows: L′ needs to be maintained such that for all 1 ≤ p ≤ |S₂| and 1 ≤ x ≤ δ + 1 entry L′[p][x] corresponds to the closest position to the right of L[p][x] that contains a character with rank smaller or equal to Rank[S₂[p]]. We can initialize L′ for i = 1 at no extra cost during the initialization of L which involves a scan of the neighborhood of each position in S₂. The update of L′ at positions p with S₂[p] = c_old is also straight-forward: During the scan of S₂ for updating L, we track not only the last δ + 1 non-redundant occurrences of characters with rank greater than Rank[c_old] but also for each of them the next position with rank at most Rank[c_old]. We find such a position at the latest when we reach p. Therefore, all entries L′[p][x] for S₂[p] = c_old are set during this process.

For all other p, the entries L′[p][x] can only change if Rank[c_old] > Rank[S₂[p]]. To update L′ at these positions, we precompute for all occurrences of c_old the δ + 1 next positions to the right with strictly decreasing rank lower than Rank[c_old], referred to as lower ranked neighbors in the following. These positions can be identified for all occurrences of c_old in a single scan of S₂ which takes time O(nδ). (Details are left to the reader.) Then we check for each p that is not an occurrence of c_old every L′[p][x] for being an occurrence of c_old, and replace each L′[p][x] for which this is the case by the first lower ranked neighbor of L′[p][x] that has rank at most Rank[S₂[p]]. We always find such a position among the lower ranked neighbors of L′[p][x], because there can occur no more than δ different characters with rank greater than δ between L[p][x] and p. The time spent on the complete update process of L′ is in O(nδ) which accumulates to O(n²δ) for a complete run of the algorithm. For symmetry reasons, the same is true for R′. The space consumption of L′ and R′ is in O(nδ).

Testing intervals for being optimal δ-locations

Next, we show how a candidate interval [l_x + 1, r_y − 1] can be tested for being an optimal δ-location of C (line 15). First, we test for interval maximality which is given if and only if there are no occurrences of S₂[l_x] and S₂[r_y] in S₂[l_x + 1, r_y − 1]. This test can be done in constant time if we compute in a preprocessing for every position p in S₂ the next and previous occurrence of S₂[p] in S₂ and store this information in two static arrays of length |S₂|. Every maximal interval of the form [l_x + 1, r_y − 1] is automatically C-closed, because neither S₂[l_x] nor S₂[r_y] can be contained in C. C-compactness can be tested in constant time by comparing the entries in Num for [l_x + 1, r_y − 1] and its C-essential subinterval. To test the distance constraint, we compute \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$D ({ \cal CS} (S_1 [i, j]), { \cal CS} (S_2 [l_x + 1, r_y - 1]))$$\end{document} . This is equal to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$| C | - | { \cal CS} (S_2 [l_x + 1, r_y - 1]) |$$\end{document} plus twice the number of different unmarked characters in S₂[l_x + 1, r_y − 1]. \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$| { \cal CS} (S_2 [l_x + 1, r_y - 1]) |$$\end{document} can be looked up in Num. Also the value of |C| is directly available if it is tracked during reference interval generation. If we do a systematic enumeration of candidate intervals [l_x + 1, r_y − 1], we can track also the number of unmarked characters at no extra cost, and the complete distance computation can be performed in constant time. Finally, we need to test whether [l_x + 1, r_y − 1] is already contained in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} . Rather than testing condition (ii) of Observation 4, we generate first a separate list of the newly detected C-optimal δ-locations and then merge it with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} . This can be done efficiently if both lists are sorted based on the index positions of their elements, i.e., [a₁, b₁] occurs before [a₂, b₂] if and only if a₁ < a₂ or a₁ = a₂ and b₁ < b₂. We get such a sorting automatically if the occurrences of c in S₂ and the corresponding [l_x + 1, r_y − 1] are always processed from left to right.

The second step where intervals are tested for being optimal δ-locations is in line 9 of Algorithm 1. To identify intervals in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} that contain an occurrence of c, we perform a combined iteration through \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} and Pos[c]. We iterate for the first occurrence p of c through \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} as long as we read elements [a, b] with a ≤ p. We test for each of them whether b ≥ p holds and, if this is the case, whether the interval contains an occurrence of c. As soon as we reach an element with a > p, we replace p by the next occurrence of c that is not located left of a and go on with the testing of the elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} until the end of either Pos[c] or \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} is reached. For the intervals that contain no occurrence of c, condition (ii) of Observation 3 can be tested in constant time if we track for each interval the distance between its character set and the current C. To achieve this, we remember the initial distance when the element is added to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} an increment it by one each time it does not contain an occurrence of the recent c.

3.4. Complexity analysis

We now have all prerequisites to analyze the complexity of Algorithm 1. It follows from the analysis of the CI Algorithm that the generation of reference intervals in S₁ and the marking of character occurrences in S₂ is in O(n²). The first interesting step in Algorithm 1 is the iteration through \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} to remove intervals that are not an optimal δ-location of the new reference character set (lines 8–10). The complete cost for the combined iterations through \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} and Pos[c] is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$O (n^2 + \sum \nolimits_{i \leq j} | { \cal C}_{i, j} |)$$\end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}_{i, j} = { \cal C}$$\end{document} at the time when reference interval [i, j] has been processed after line 22 of Algorithm 1. The first summand is due to the fact that every \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$c \in \Sigma$$\end{document} is O(n) times the most recent element in C and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$n \cdot \sum \nolimits_{c \in \Sigma} | { \sc Pos} [c] | = n^2$$\end{document} , while the second summand stands for the combined length of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} over all O(n²) iterations through \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} in line 8. As all elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} are optimal δ-locations of the previous reference character set, and since there are in total O(n²) reference intervals, it follows from Observation 1 that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\sum \nolimits_{i \leq j} | { \cal C}_{i, j} | \in O (n^3 (\delta + 1))$$\end{document} . We will derive a better bound later.

The next step of the algorithm is the computation of the non-inherited optimal δ-locations (lines 11–19). Every position p in S₂ gets at most O(n) times newly marked, such that in total the for-loop in line 11 is executed O(n²) times. Using tables L and R, the unmarked positions \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$l_1, \ldots, l_{ \delta + 1}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$r_1, \ldots, r_{ \delta + 1}$$\end{document} are immediately available. As we have seen above, the precomputation and maintenance of these data structures is in time O(n²(δ + 1)). It follows the processing of candidate intervals of the form [l_x + 1, r_y − 1]. For each occurrence p, there are O((δ + 1)²) many of these intervals, while there are |Pos[c]| many occurrences of each c in S₂. With every position p in S₂ being marked at most n times, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$O (n \cdot \sum \nolimits_{c \in \Sigma} (| { \sc Pos} [c] | \cdot (\delta + 1) ^2)) = O (n^2 (\delta + 1) ^2)$$\end{document} intervals are considered during the complete run of Algorithm 1. Using the data structures described above, testing a single candidate for being an optimal δ-location of C takes constant time. Finally, the list of new optimal δ-locations is merged with \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} . As both lists are sorted, the time spent on this operation equals the length of the new list plus the length of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} . Since all elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} are optimal δ-locations of the current C, this accumulates to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$O (\sum \nolimits_{i \leq j} | { \cal C}_{i, j} | + n^2 (\delta + 1) ^2)$$\end{document} for the complete run of the algorithm.

To find a tighter bound for \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\sum \nolimits_{i \leq j} | { \cal C}_{i, j} |$$\end{document} , we need to estimate how many intervals are added to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} in total and how long they can persist in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} . We can infer from Observation 4 that there are exactly two types of optimal δ-locations [a, b] that can not be inherited from the previous reference interval: the first type are intervals having exactly distance δ to C, and the second type are intervals with left-most and/or right-most essential character c that contain no “inner occurrences” of c, i.e., positions that are separated from both interval boundaries by characters from C other than c.

Non-inherited optimal δ-locations of the second type are necessarily either of the form [l_x + 1, r₁ − 1] or [l₁ + 1, r_y − 1]. Only such intervals can be C-optimal intervals with right-most essential position p, respectively, left-most essential position p. Since there are only δ + 1 many values of both, x and y, for every newly marked position, only O(δ + 1) many of these intervals exist. Each of them can persist in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} for at most O(δ + 1) iterations of the algorithm. Each time, either the new character of the reference character set is present in the interval which can happen at most δ times, otherwise the distance would have been above δ in the beginning, or the new character is not present. The latter can also happen at most δ times. Afterwards, the distance is above δ due to missing characters alone. With O(n²) newly marked positions, all intervals of this type make up at most O(n²(δ + 1)²) of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\sum \nolimits_{i \leq j} | { \cal C}_{i, j} |$$\end{document} .

Of the first interval type, we have O((δ + 1)²) many for each newly marked position and each can persist O(δ + 1) iterations in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} . However, one can show that every such interval is re-detected at least as often in line 14 of Algorithm 1 as it is inherited “silently,” i.e., without being generated anew in line 14. This is because the distance to the reference character set increases by one, each time an interval is silently inherited. Since the initial distance is δ for intervals of the first type, we have at least one re-detection for every silent inheritance. At the same time, every re-detection of an interval reduces the maximal number of newly detected optimal δ-locations by one, as the old interval occurs in line 14 as one of the O((δ + 1)²) intervals around a newly marked position p. In total, every optimal δ-location of the second type reduces the total number of intervals of this type by the same number as there are re-detections of this interval. As a consequence, also intervals of this type make up at most O(n²(δ + 1)²) of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\sum \nolimits_{i \leq j} | { \cal C}_{i, j} |$$\end{document} . Since all elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C}$$\end{document} were initially either of the first or the second type, we have \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$\sum \nolimits_{i \leq j} | { \cal C}_{i, j} | \in O (n^2 (\delta + 1) ^2$$\end{document} ).

Concerning space complexity, table Num is the most costly data structure. Based on these findings, we claim the following theorem:

Theorem 1

Using the data structures described above, Algorithm 1 solves Problem 1 in time O(n²(δ + 1)²) using O(n²) space.

4. Reference-Based Approximate Common Intervals

The extension of Algorithm 1 to solve Problem 2 is straightforward. It is outlined in Algorithm 2. As the search space of conserved reference sets is limited to character sets that have a location in one of the input genomes, we can reuse the reference interval generation to traverse the search space. We only need to run it one after the other on all strings in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal S}$$\end{document} . This is because for a combination of reference-based approximate common intervals, the best conserved reference set (or possibly the only one for a small enough distance threshold) can be located in any of the k given strings.

Algorithm 2
Outline of pairwise distance constrained reference gene cluster computation in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal S} = \{ S_1, \ldots, S_k \}$$\end{document} for distance threshold δ and quorum parameter q

1: for each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S_{ \ell} = S_1, \ldots, S_k$$\end{document} do

2: for each maximal interval [i, j] in S_ℓ do

3: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C \leftarrow { \cal CS} (S_{ \ell} [i, j])$$\end{document}

4: if C has no location in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S_1, \ldots, S_{ \ell - 1}$$\end{document} and S_ℓ[1, i − 1] then

5: for each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$S_{ \ell^ \prime} = S_1, \ldots, S_k$$\end{document} do

6: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C} [\ell^{ \prime}] \leftarrow$$\end{document} set of optimal δ-locations of C in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${S_{ \ell^ \prime}}$$\end{document}

7: end for

8: if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C} [\ell^{ \prime}] \neq \emptyset$$\end{document} for at least q different \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C} [\ell^{ \prime}] \in \{ { \cal C} [1], \ldots { \cal C} [k] \}$$\end{document} then

9: output C and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal C} [1], \ldots, { \cal C} [k]$$\end{document}

10: end if

11: end if

12: end for

13: end for

To compute during reference interval generation the optimal δ-locations of the current reference character set and to decide whether they are distributed over enough strings, we need to run Algorithm 1 in parallel on all k strings tracking the optimal δ-locations separately for each string. If the number of strings that contain a δ-location of the current reference character set C is at least q, we report C along with its optimal δ-locations.

The correctness of Algorithm 2 follows directly from the correctness of Algorithm 1 and the definition of Problem 2. Since we consider in total k reference genomes and need to apply Algorithm 1 for each of them in parallel on the k genomes, the time complexity for the complete algorithm accumulates to O(k²n²(δ + 1)²), while the space complexity increases to O(kn²).

5. Experimental Results

To illustrate the performance and predictive power of reference-based gene cluster computation, we compared the implementation of our algorithm named rgca with two existing approaches. At first, we analyzed what impact the requirement of a reference occurrence has on approximate gene cluster prediction. To this end, we evaluated our algorithm against the center gene cluster approach (Böcker et al., 2009; Jahn, 2010) which is non-reference based. We chose this gene cluster model over the more widely known median gene clusters. The reason for this is that center gene clusters, like our reference-based approach, impose a pairwise distance constraint between the consensus gene set and its approximate occurrences, while median gene clusters constrain the sum of these distances.

In a second experiment, we evaluated rgc on a larger dataset (that would not be feasible with non-reference based approaches) and compared it to the mcmusec algorithm (Ling et al., 2009), which is currently the fastest approach for max-gap gene cluster prediction.

5.1. Reference-based versus non-reference based approach

The experimental evaluation against the non-reference based center gene cluster approach was done on a data set of five γ-proteobacteria that are summarized in Table 1. The genomes were downloaded from the NCBI database (Pruitt et al., 2006):

Table 1.
Genome Set Used in the Comparison with the Non-Reference Based Approach

Species name refSeq number No. of genes

Buchnera aphidicola APS NC_002528 564

Escherichia coli K12 NC_000913 4183

Haemophilus influenzae Rd NC_000907 1709

Pasteurella multocida Pm70 NC_002663 2015

Xylella fastidiosa 9a5c NC_002488 2680

For grouping genes into homology families, we employed the GhostFam tool (Schmidt and Stoye, 2007). Using the standard parameters, this program distributed the 11,184 genes occurring in the studied genomes into 5086 gene families of sizes between 1 and 63. The computations were performed on an 8 × 2.6 GHz AMD Opteron 8218 Dual-Core processor with 32 GB main memory.

The main focus of our evaluation is on the impact that the requirement of a reference occurrence has on approximate gene cluster prediction. To this end, we estimated for a series of distance thresholds how many of the gene clusters detected with the center gene clusters approach, have a reference-based consensus set that complies with the same distance threshold. Center gene clusters consist like median gene clusters (Böcker et al., 2009) of a combination of approximate cluster ocurrences \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$([i_1, j_1], \ldots, [i_k, j_k])$$\end{document} and a consensus gene set C chosen such that its set distance to the approximate occurrences is minimized. The difference is that the center gene cluster model minimizes not the sum of these distances but the largest pairwise distance: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align} \max_{1 \leq \ell \leq k} D (C, { \cal CS} (S_{ \ell} [i_{ \ell}, j_{ \ell}])) \leq \max_{1 \leq \ell \leq k} D (C^{ \prime}, { \cal CS} (S_{ \ell} [i_{ \ell}, j_{ \ell}])) \ { \rm for \ all} \ C^{ \prime} \subseteq \Sigma. \end{align} \end{document}

Like median gene clusters, center gene clusters do not necessarily have a reference occurrence in the given genomes, which leads to an exponential search space size as basically all interval combinations \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$([i_1, j_1], \ldots, [i_k, j_k])$$\end{document} need to be processed. However, as with median gene clusters, filter techniques can be employed that rule out a large number of combinations without testing them explicitly if it is clear that their consensus will exceed the pre-defined distance threshold δ. (Details of the computational approach can be found in Jahn [2010].)

For the comparison, we ran the center gene cluster program with different combinations of δ and s, a minimum size threshold for the center-based consensus set. To estimate how far off the closest reference occurrence is, we computed for each “successful” interval combination \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$([i_1, j_1], \ldots, [i_k, j_k])$$\end{document} , i.e., one whose center set complies with the distance threshold δ, the smallest distance to a reference-based consensus chosen among the character sets of the k intervals: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align} dist = \min_{1 \leq \ell \leq k} \max_{1 \leq \ell^{ \prime} \leq k} D (CS (S_{ \ell} [i_{ \ell}, j_{ \ell}]), CS (S_{ \ell^{ \prime}} [i_{ \ell^{ \prime}}, j_{ \ell^{ \prime}}])) \end{align} \end{document}

Afterwards, we computed the fraction of successful interval combinations for which such a reference-based consensus set complies with the center distance threshold δ or with a slightly increased threshold. Results are shown in Table 2. We observe that for the vast majority of interval combinations the best reference-based consensus complies with the initial distance threshold. These center gene clusters can be directly computed with our reference based approach. The remaining combinations have a reference-based consensus that complies with a slightly increased distance threshold. In fact, one can show via the triangle inequality that δ needs to double at most to find a center gene cluster with the reference-based approach. However, to find the exact same interval combination, one could not apply the optimality criterion for the placement of approximate cluster occurrences, that we use in our algorithm to reduce output redundancy. If this is still done, the reported interval borders may differ slightly from those for the center gene cluster, as they are optimized with respect to a reference set, not the center-based consensus.

Table 2.
Computation of Approximate Gene Clusters with a Center-Based Consensus

δ = 1 δ = 2 δ = 3 δ = 4

s = 4 running time 4s 59m 16s > 24h > 24h

# non-nested interval combinations 162 2.6 · 10⁴ - -

reference-based consensus for δ 100.0% 99.8% - -

reference-based consensus for δ + 1 - 99.9% - -

reference-based consensus for δ + 2 - 100.0% - -

s = 6 running time 4s 32s 6h 0m > 24h

# non-nested interval combinations 61 856 1.6 · 10⁴ -

reference-based consensus for δ 100.0% 99.6% 97.8% -

reference-based consensus for δ + 1 - 99.9% 99.8% -

reference-based consensus for δ + 2 - 100.0% 100.0% -

s = 8 running time 3s 6s 12m 51s > 24h

# non-nested interval combinations 16 181 2560 -

reference-based consensus for δ 100.0% 100.0% 98.0% -

reference-based consensus for δ + 1 - - 100.0% -

s = 10 running time 3s 5s 14s 3h 21m

# non-nested interval combinations 9 71 1120 1.2 · 10⁴

reference-based consensus for δ 100.0% 100.0% 97.1% 90.3%

reference-based consensus for δ + 1 - - 100.0% 96.8%

reference-based consensus for δ + 2 - - - 99.4%

reference-based consensus for δ + 3 - - - 100.0%

The output size (# non-nested interval combinations) refers to the number of interval combinations that have a center-based consensus for the given distance threshold δ and are not suboptimal in the sense that they are completely nested into a larger and better conserved interval combination. For comparison with the reference based approach, the fraction of interval combinations is given that have a reference-based consensus for the initial δ or slightly higher values. No results are shown for computations that did not finish within 24 hours.

To assess the practical performance of our program for reference-based gene cluster computation, we ran rgc for all distance threshold values between 1 and 14 and a fixed minimum cluster size s = 4. As the minimum cluster size is not used in the main part of the algorithm, but only to filter afterwards too small reference gene clusters, we do not distinguish runtimes for different values of s. As we can see in Table 3, runtimes increase slowly with increasing values of δ and range from 36s to 45s. The number of detected reference-based gene clusters ranges from 339 to 5410 when counting only optimal ones that they are not nested into a larger better conserved cluster. Among the predicted clusters are many known gene clusters like the operons for ATP biosynthesis, histidine biosynthesis, as well as the cell division and cell wall biosynthesis gene cluster, but also some clusters for which no obvious classification exists. An example of such a cluster is given in Figure 9/Table 4. This cluster is also interesting for its conservation pattern: It contains several micro-rearrangements in the different cluster occurrences and shows an incomplete pattern of gene conservation. In fact, only two of the clustered genes (“101” and “319”) are conserved in all five occurrences. The latter property makes the gene cluster hard to detect with max-gap based or simpler gene cluster models.

FIG. 9.
Example of an unclassified gene cluster predicted with the reference-based gene cluster detection algorithm with distance threshold δ = 4. Annotations of the contained genes are given in Table 4.

Table 3.
Runtimes of rgc and Number of Predicted Reference-Based Gene Clusters on a Dataset of Five γ-Proteobacteria

δ = 1 δ = 2 δ = 3 δ = 4 δ = 5 δ = 6 δ = 8 δ = 10 δ = 12 δ = 14

running time 36s 36s 37s 37s 38s 38s 39s 41s 42s 45s

# non-nested RGCs (s = 4) 339 1029 3921 4382 4718 4974 5272 5349 5410 5368

Table 4.
Functional Annotation of the Genes Occurring in the Unclassified Gene Cluster Prediction Shown in Figure 9

Genome Gene ID Annotation

E. coli 324 structural component; Ribosomal proteins - modification

323 factor; Proteins - translation and modification (umk)

322 enzyme; Central intermediary metabolism: interconversions (rrf)

321 factor; Proteins - translation and modification (yaeM)

320 1-deoxy-D-xylulose 5-phosphate reductoisomerase (yaeS)

319 undecaprenyl pyrophosphate synthase

861 enzyme; Fatty acid and phosphatidic acid biosynthesis (yaeL)

657 zinc metallopeptidase (yzzY)

101 conserved protein (ompH)

1273 factor; Basic proteins - synthesis, modification

62 UDP-3-O-(3-hydroxymyristoyl)-glucosamine N-acyltransferase

P. multocida 324 RpS2

323 Tsf

322 PyrH

321 Errf

320 Dxr

319 unknown

861 CdsA

657 unknown

101 unknown

1273 Skp

62 FirA

B. aphidicola 324 30S ribosomal protein S2 (tsf)

323 elongation factor Ts (pyrH)

322 uridylate kinase (frr)

321 ribosome recycling factor (dxr)

320 1-deoxy-D-xylulose 5-phosphate reductoisomerase (uppS)

319 undecaprenyl pyrophosphate synthetase (yaeT)

101 hypothetical protein (dnaE)

X. fastidiosa 321 ribosome recycling factor

319 undecaprenyl pyrophosphate synthetase

861 phosphatidate cytidylyltransferase

320 1-deoxy-D-xylulose 5-phosphate reductoisomerase

657 conserved hypothetical protein

101 outer membrane antigen

62 UDP-3-O-(R-3-hydroxymyristoyl)-glucosamine N-acyltransferase

H. influenzae 324 ribosomal protein S2 (rpS2)

323 elongation factor Ts (tsf)

62 UDP-3-O-(3-hydroxymyristoyl)-glucosamine (lpxD)

1273 outer membrane protein, putative

101 protective surface antigen D15

657 conserved hypothetical transmembrane protein

861 CDP-diglyceride synthetase (cdsA)

319 conserved hypothetical protein

5.2. Comparison to the mcmusec algorithm on a large dataset

The comparison with mcmusec was performed on a dataset of 30 bacterial genomes listed in Table 5. For these genomes, partitioning of genes into homology families is available based on COG (Clusters of Orthologous Groups) numbers (Tatusov et al., 2003).

Table 5.
Data Set of 30 Genomes Used in the Comparison with the mcmusec Program

Species name refSeq number No of genes

Thermotoga maritima MSB8 NC_000853 1858

Haemophilus influenzae Rd KW20 NC_000907 1657

Mycoplasma genitalium G37 NC_000908 477

Mycoplasma pneumoniae M129 NC_000912 689

Escherichia coli str. K-12 substr. MG1655 NC_000913 4132

Helicobacter pylori 26695 NC_000915 1576

Aquifex aeolicus VF5 NC_000918 1529

Helicobacter pylori J99 NC_000921 1489

Chlamydophila pneumoniae CWL029 NC_000922 1052

Rickettsia prowazekii str. Madrid E NC_000963 835

Bacillus subtilis subsp. subtilis str. 168 NC_000964 4105

Borrelia burgdorferi B31 NC_001318 851

Xylella fastidiosa 9a5c NC_002488 2766

Chlamydophila pneumoniae J138 NC_002491 1069

Buchnera aphidicola str. APS (Acyrthosiphon pisum) NC_002528 564

Bacillus halodurans C-125 NC_002570 4066

Lactococcus lactis subsp. lactis Il1403 NC_002662 2321

Pasteurella multocida subsp. multocida str. Pm70 NC_002663 2015

Mycobacterium leprae TN NC_002677 1605

Mesorhizobium loti MAFF303099 NC_002678 6743

Escherichia coli O157:H7 str. Sakai NC_002695 5230

Caulobacter crescentus CB15 NC_002696 3737

Mycobacterium tuberculosis CDC1551 NC_002755 4189

Mycoplasma pulmonis UAB CTIP NC_002771 782

Treponema denticola ATCC 35405 NC_002967 2767

Streptococcus pneumoniae TIGR4 NC_003028 2105

Clostridium acetobutylicum ATCC 824 NC_003030 3672

Streptococcus pneumoniae R6 NC_003098 2043

Rickettsia conorii str. Malish 7 NC_003103 1374

Neisseria meningitidis MC58 NC_003112 2063

Neisseria meningitidis Z2491 NC_003116 2049

Yersinia pestis CO92 NC_003143 3885

Salmonella typhimurium LT2 NC_003197 4425

Listeria innocua Clip11262 NC_003212 2968

Ralstonia solanacearum GMI1000 NC_003295 3440

Fusobacterium nucleatum subsp. nucleatum ATCC 25586 NC_003454 2067

Xanthomonas axonopodis pv. citri str. 306 NC_003919 4312

Staphylococcus aureus subsp. aureus MW2 NC_003923 2632

Helicobacter hepaticus ATCC 51449 NC_004917 1875

Corynebacterium glutamicum ATCC 13032 NC_006958 3057

To evaluate the practical performance of rgc against mcmusec, we ran both programs for a variety of parameter settings. For a better understanding of the results, we first review the three parameters used in max-gap based gene cluster computation:
• minsupp: minimum number of genomes in which a gene cluster must occur

• minSize: minimum number of genes a gene cluster must contain

• maxgap: upper bound of a single gap size in a cluster occurrence

While the first two parameters each have a perfect equivalent in rgc (minsupp is identical with the quorum parameter q and minSize corresponds to a threshold s used as an optional filter in rgc to discard too small predictions), there is no such simple connection between maxgap and δ, the distance threshold used in reference-based gene cluster computation. In the max-gap model, the size of each gap in the cluster occurrences is constrained by maxgap. Therefore, the total number of intermitting genes is only indirectly constrained via the size of the occurrences. (If a cluster occurrence contains h genes from the minimum consensus set, at most h − 1 gaps can occur, each of maximum size maxgap.) As a consequence, to find the same gene clusters, the threshold may have to be chosen higher for rgc than for mcmusec. In exchange, rgc has the potential to find these clusters in a more complete form (Fig. 9), as well as additional clusters that would be overlooked by mcmusec for any parameter setting due to the minimum consensus set problem described in the Introduction.

For the performance evaluation, we conducted four series of runs, each time fixing all but one of the four parameter types: maxgap resp. δ, minsupp resp. q, minSize resp. s and the number of input genomes k. The results are given in Figure 10. Like for the small dataset, we see that the runtime of rgc grows rather moderately with δ, while the increase is much stronger in mcmusec. The quorum parameter has almost no impact on the runtime of rgc. As it is only used in the final step before a predicted gene cluster is reported, this is an expected result. In contrast, mcmusec shows a strong dependency on minsupp which can be explained by its search strategy that finds gene clusters first in a subset of the genomes and then adds step by step additional genomes. We observe the same for the minimum cluster size: In rgc, the runtime is almost independent of s, as it is, like q, only used as a filter right before a predicted gene cluster is reported. With mcmusec, we observe a stronger dependency on minSize. This has again its root in the search strategy of mcmusec, that repeatedly breaks up large cluster candidates into smaller ones until they either fulfill the cluster requirements or fall below the minimum size threshold. The number of input genomes has a large impact on the runtime of both algorithms. Since this parameter directly influences the search space size, e.g., in the rgc algorithm the number of pairwise genome comparisons increases quadratically with k, this is an expected result.

FIG. 10.
Comparison of practical runtimes of rgc and mcmusec on a dataset of 30 bacterial genomes for different parameter settings.

For the qualitative comparison of the two algorithms, we focussed on the 18 “novel” E. coli gene clusters predicted with mcmusec in a dataset of 133 COG annotated genomes (Ling et al., 2009). These clusters are novel in the sense that they are not covered by any multi-gene operon listed in the RegulonDB database (Salgado et al., 2006). Running the rgc program on the dataset with parameters δ = 7 and q = 4 and s = 2 using E. coli as a fixed reference genome, we re-detected all 18 cluster predictions, many of them in a larger and more compact form.

6. Conclusion

In this article, we presented a polynomial-time algorithm for the detection of set-distance based approximate gene clusters in multiple genomes. The limitation of the search space to polynomial size was achieved by the use of a gene cluster model that requires the existence of a reference occurrence. Unlike a previous algorithm, our algorithm detects the complete solution set of reference-based approximate common intervals.

We evaluated the relevance and performance of our approach on real genomic data sets. In an initial study, we have shown that our program yields highly comparable results to the corresponding non-reference based approach, while using only a fraction of the running time. Only for a small fraction of gene cluster predictions, the reference-based cluster version can only be found if the distance threshold is slightly increased. Since runtimes increase only slowly with δ, this is not a problem in practice. We demonstrated the algorithm performance on a large dataset in a comparison with the max-gap based mcmusec approach. In general, both algorithms perform well on a wide range of parameter settings. For both, the limiting factor seems to be the number of input genomes, and for mcmusec additionally the choice of the quorum parameter. As the number of available genomes will increase further in the near future, improvements of the algorithms may become necessary soon. Concerning the prediction quality, we have seen that our approach finds known gene clusters as well as novel putative clusters with complex conservation patterns. As the underlying gene cluster model considers both, gene insertions and gene deletions, in approximate cluster occurrences, our gene cluster predictions do not suffer from the minimum consensus set problem that keep many predicted max-gap clusters small and incomplete.

7. Appendix

A. Incompleteness of the ACI algorithm

As mentioned above, a similar approach to the detection of reference-based approximate gene clusters can be found in the literature, namely the ACI Algorithm by Amir et al. (2007). However, one can show that this graph-based O(kn³ + output size) time and O(kn³) space approach misses a certain subclass of approximate cluster locations. In the following, we sketch this algorithm as far as necessary to understand the incompleteness problem, present an example for which the algorithm fails to report the complete solution set, and categorize the subclass of δ-locations that are detected by the approach. Unfortunately, there is no apparent remedy for this problem in the ACI Algorithm, at least none that would not substantially increase the algorithmic complexity.

A.1 Search strategy of the ACI algorithm

The problem that Amir et al. (2007) intend to solve is as follows:

Problem 3

Given a set of k strings \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal S} = \{ S_1, \ldots, S_k \}$$\end{document} over an alphabet Σ and a distance threshold δ, report for each character set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C = { \cal CS} (S_{ \ell} [i_{ \ell}, j_{ \ell}])$$\end{document} all character sets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C^{ \prime} = { \cal CS} (S_{ \ell^ \prime} [i_{ \ell^ \prime}, j_{ \ell^ \prime}])$$\end{document} with D(C, C′) ≤ δ and all locations of C′.

The ACI Algorithm works in two steps: First, a graph G = (V, E) is built such that
• the set of distinct character sets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C = { \cal CS} (S_{ \ell} [i_{ \ell}, j_{ \ell}])$$\end{document} in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$${ \cal S}$$\end{document} maps to the vertex set \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$V = \{ v_C | C = { \cal CS} (S_{ \ell} [i_{ \ell}, j_{ \ell}]) \}$$\end{document} , and

• edges are drawn between vertices v_C, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$v_{C^ \prime}$$\end{document} if and only if D(C, C′) = 1, labeled with the unique character in (C \ C′) ∪ (C′\ C).

By construction, this graph represents all pairs of character sets in S whose pairwise distance is exactly one. An example of such a graph is given in Figure 11. For the second step of the algorithm, recall from graph theory that a path in a graph G = (V, E) is a sequence of vertices \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$v_0, v_1, \ldots, v_m$$\end{document} such that there is an edge (v_i, v_i+₁), 0 ≤ i < m, between every pair of successive vertices, and the length of the path corresponds to the number of edges in the path. This concept is used in the ACI Algorithm to collect δ-locations from the graph: Beginning from every vertex \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$v_C \in V$$\end{document} , all paths of length δ are traversed in a depth-first manner as long as no edge label is read that occurred earlier in the path.

FIG. 11.
Graph constructed by the ACI Algorithm for the two strings S₁ = 1, 2, 3, 4, 5, 6 and S₂ = 7, 2, 5, 8. The locations of the gray-shaded character sets should be reported as δ-locations of each other for every δ ≥ 2. However, the shortest path between these vertices has length four and contains redundant edge labels. It is therefore not considered in the algorithm.

A.2 Incompleteness of the algorithm

Clearly, every vertex on the traversed paths corresponds to the character set of a δ-location of C. However, not all such character sets are detected by this method, as we can see in Figure 11: The shortest path between the two gray-shaded vertices has length four, although the distance between the corresponding character sets is only two. Therefore, the algorithm will not report the locations of the respective character sets as δ-locations of each other.

We find the same type of problem for all pairs of character sets for which at least one intermediate character set is not represented in the graph. This is no exceptional event, as it can happen with all intervals that share some characters but are not nested. Moreover, we observe that every path between the two gray-shaded vertices contains redundant edge labels, a second reason why the algorithm would not detect the connection between the two character sets.

A simple remedy against the incompleteness of the algorithm is to allow for both, longer paths and redundant edge labels. However, these changes have a negative effect on the asymptotic time complexity. In the original algorithm, the complete cost for the depth-first traversal of paths can be hidden in the output size, as all vertices on these paths are reported as δ-locations of the character set associated with the start vertex. If we allowed for longer paths to cope with missing intermediate character sets, this is no longer possible.

We leave, as an open question, whether a more sophisticated way of modifying the algorithm can be found that preserves its time complexity. For the ACI Algorithm in its current state, we can summarize our findings as follows:

Observation 5

The depth-first search of the ACI Algorithm detects for a given character set C ⊆ Σ only those character sets C′ of δ-locations for which there is a sequence of character sets \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes {10} {9} {7} {6}\begin{document}$$C_0, C_1, \ldots, C_d$$\end{document} , with C₀ = C, C_d = C′, d ≤ δ, and |C_i\C_i+j| + |C_i+j\C_i| = j for all j ≤ δ − i.

Species name	refSeq number	No. of genes
Buchnera aphidicola APS	NC_002528	564
Escherichia coli K12	NC_000913	4183
Haemophilus influenzae Rd	NC_000907	1709
Pasteurella multocida Pm70	NC_002663	2015
Xylella fastidiosa 9a5c	NC_002488	2680

		δ = 1	δ = 2	δ = 3	δ = 4
s = 4	running time	4s	59m 16s	> 24h	> 24h
	# non-nested interval combinations	162	2.6 · 10⁴	-	-
	reference-based consensus for δ	100.0%	99.8%	-	-
	reference-based consensus for δ + 1	-	99.9%	-	-
	reference-based consensus for δ + 2	-	100.0%	-	-
s = 6	running time	4s	32s	6h 0m	> 24h
	# non-nested interval combinations	61	856	1.6 · 10⁴	-
	reference-based consensus for δ	100.0%	99.6%	97.8%	-
	reference-based consensus for δ + 1	-	99.9%	99.8%	-
	reference-based consensus for δ + 2	-	100.0%	100.0%	-
s = 8	running time	3s	6s	12m 51s	> 24h
	# non-nested interval combinations	16	181	2560	-
	reference-based consensus for δ	100.0%	100.0%	98.0%	-
	reference-based consensus for δ + 1	-	-	100.0%	-
s = 10	running time	3s	5s	14s	3h 21m
	# non-nested interval combinations	9	71	1120	1.2 · 10⁴
	reference-based consensus for δ	100.0%	100.0%	97.1%	90.3%
	reference-based consensus for δ + 1	-	-	100.0%	96.8%
	reference-based consensus for δ + 2	-	-	-	99.4%
	reference-based consensus for δ + 3	-	-	-	100.0%

	δ = 1	δ = 2	δ = 3	δ = 4	δ = 5	δ = 6	δ = 8	δ = 10	δ = 12	δ = 14
running time	36s	36s	37s	37s	38s	38s	39s	41s	42s	45s
# non-nested RGCs (s = 4)	339	1029	3921	4382	4718	4974	5272	5349	5410	5368

Genome	Gene ID	Annotation
E. coli	324	structural component; Ribosomal proteins - modification
	323	factor; Proteins - translation and modification (umk)
	322	enzyme; Central intermediary metabolism: interconversions (rrf)
	321	factor; Proteins - translation and modification (yaeM)
	320	1-deoxy-D-xylulose 5-phosphate reductoisomerase (yaeS)
	319	undecaprenyl pyrophosphate synthase
	861	enzyme; Fatty acid and phosphatidic acid biosynthesis (yaeL)
	657	zinc metallopeptidase (yzzY)
	101	conserved protein (ompH)
	1273	factor; Basic proteins - synthesis, modification
	62	UDP-3-O-(3-hydroxymyristoyl)-glucosamine N-acyltransferase
P. multocida	324	RpS2
	323	Tsf
	322	PyrH
	321	Errf
	320	Dxr
	319	unknown
	861	CdsA
	657	unknown
	101	unknown
	1273	Skp
	62	FirA
B. aphidicola	324	30S ribosomal protein S2 (tsf)
	323	elongation factor Ts (pyrH)
	322	uridylate kinase (frr)
	321	ribosome recycling factor (dxr)
	320	1-deoxy-D-xylulose 5-phosphate reductoisomerase (uppS)
	319	undecaprenyl pyrophosphate synthetase (yaeT)
	101	hypothetical protein (dnaE)
X. fastidiosa	321	ribosome recycling factor
	319	undecaprenyl pyrophosphate synthetase
	861	phosphatidate cytidylyltransferase
	320	1-deoxy-D-xylulose 5-phosphate reductoisomerase
	657	conserved hypothetical protein
	101	outer membrane antigen
	62	UDP-3-O-(R-3-hydroxymyristoyl)-glucosamine N-acyltransferase
H. influenzae	324	ribosomal protein S2 (rpS2)
	323	elongation factor Ts (tsf)
	62	UDP-3-O-(3-hydroxymyristoyl)-glucosamine (lpxD)
	1273	outer membrane protein, putative
	101	protective surface antigen D15
	657	conserved hypothetical transmembrane protein
	861	CDP-diglyceride synthetase (cdsA)
	319	conserved hypothetical protein

Species name	refSeq number	No of genes
Thermotoga maritima MSB8	NC_000853	1858
Haemophilus influenzae Rd KW20	NC_000907	1657
Mycoplasma genitalium G37	NC_000908	477
Mycoplasma pneumoniae M129	NC_000912	689
Escherichia coli str. K-12 substr. MG1655	NC_000913	4132
Helicobacter pylori 26695	NC_000915	1576
Aquifex aeolicus VF5	NC_000918	1529
Helicobacter pylori J99	NC_000921	1489
Chlamydophila pneumoniae CWL029	NC_000922	1052
Rickettsia prowazekii str. Madrid E	NC_000963	835
Bacillus subtilis subsp. subtilis str. 168	NC_000964	4105
Borrelia burgdorferi B31	NC_001318	851
Xylella fastidiosa 9a5c	NC_002488	2766
Chlamydophila pneumoniae J138	NC_002491	1069
Buchnera aphidicola str. APS (Acyrthosiphon pisum)	NC_002528	564
Bacillus halodurans C-125	NC_002570	4066
Lactococcus lactis subsp. lactis Il1403	NC_002662	2321
Pasteurella multocida subsp. multocida str. Pm70	NC_002663	2015
Mycobacterium leprae TN	NC_002677	1605
Mesorhizobium loti MAFF303099	NC_002678	6743
Escherichia coli O157:H7 str. Sakai	NC_002695	5230
Caulobacter crescentus CB15	NC_002696	3737
Mycobacterium tuberculosis CDC1551	NC_002755	4189
Mycoplasma pulmonis UAB CTIP	NC_002771	782
Treponema denticola ATCC 35405	NC_002967	2767
Streptococcus pneumoniae TIGR4	NC_003028	2105
Clostridium acetobutylicum ATCC 824	NC_003030	3672
Streptococcus pneumoniae R6	NC_003098	2043
Rickettsia conorii str. Malish 7	NC_003103	1374
Neisseria meningitidis MC58	NC_003112	2063
Neisseria meningitidis Z2491	NC_003116	2049
Yersinia pestis CO92	NC_003143	3885
Salmonella typhimurium LT2	NC_003197	4425
Listeria innocua Clip11262	NC_003212	2968
Ralstonia solanacearum GMI1000	NC_003295	3440
Fusobacterium nucleatum subsp. nucleatum ATCC 25586	NC_003454	2067
Xanthomonas axonopodis pv. citri str. 306	NC_003919	4312
Staphylococcus aureus subsp. aureus MW2	NC_003923	2632
Helicobacter hepaticus ATCC 51449	NC_004917	1875
Corynebacterium glutamicum ATCC 13032	NC_006958	3057

Footnotes

Acknowledgments

The author would like to thank Jens Stoye for his valuable comments on a draft version of this article and Xu Ling for providing the sources of mcmusec for the experimental evaluation. This research was partially funded by DFG grant STO 431/5.

Disclosure Statement

No competing financial interests exist.

a

The program is integrated into the GECKO2 software package (available for download at: ). A stand-alone version of rgc is available upon request.

References

Amir

, Gasieniec

, Shalom

2007. Improved approximate common interval. Inf. Process. Lett., 103:142–149.

Bergeron

, Corteel

, Raffinot

2002. The algorithmic of gene teams. Lect. Notes Comput. Sci., 2452:464–476.

Böcker

, Jahn

, Mixtacki

et al. 2009. Computation of median gene clusters. J. Comput. Biol., 16:1085–1099.

Didier

2003. Common intervals of two sequences. Lect. Notes Bioinform., 2812:17–24.

Didier

, Schmidt

, Stoye

et al. 2007. Character sets of strings. J. Discr. Alg., 5:330–340.

, Goldwasser

M.H.

2005. Identifying conserved gene clusters in the presence of homology families. J. Comput. Biol., 12:638–656.

Heber

, Stoye

2001a. Algorithms for finding gene clusters. Lect. Notes Comput. Sci., 2149:252–263.

Heber

, Stoye

2001b. Finding all common intervals of k permutations. Lect. Notes Comput. Sci., 2089:207–218.

Huynen

, Snel

2000. Gene and context: integrative approaches to genome analysis. Adv. Protein Chem., 54:345–379.

10.

Jahn

2010. Approximate common intervals based gene cluster models[Ph.D. dissertation].Faculty of Technology, Bielefeld University: Bielefeld, Germany.

11.

Ling

, He

, Xin

2009. Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics, 25:571.

12.

Overbeek

, Fonstein

, D'Souza

et al. 1999. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. USA, 96:2896–2901.

13.

Price

, Huang

, Alm

et al. 2005. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res., 33:880.

14.

Pruitt

, Tatusova

, Maglott

2007. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35:D61–D65.

15.

Rahmann

, Klau

G.W.

2006. Integer linear programs for discovering approximate gene clusters. Lect. Notes Bioinform., 4175:298–309.

16.

Rogozin

I.B.

, Makarova

K.S.

, Wolf

Y.I.

et al. 2004. Computational approaches for the analysis of gene neighborhoods in prokaryotic genomes. Brief. Bioinform., 5:131–149.

17.

Salgado

, Gama-Castro

, Peralta-Gil

et al. 2006. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions. Nucleic Acids Res., 34:D394–D397.

18.

Schmidt

, Stoye

2004. Quadratic time algorithms for finding common intervals in two and more sequences. Lect. Notes Comput. Sci., 3109:347–358.

19.

Schmidt

, Stoye

2007. Gecko and GhostFam—rigorous and efficient gene cluster detection in prokaryotic genomes, 165–182. Bergman

Comparative Genomics. Volume 2 of Methods in Molecular Biology. Humana Press: New York.

20.

Tamames

, Casari

, Ouzounis

et al. 1997. Conserved clusters of functionally related genes in two bacterial genomes. J. Mol. Evol., 44:66–73.

21.

Tatusov

, Fedorova

, Jackson

et al. 2003. The COG database: an updated version includes eukaryotes. BMC Bioinform., 4:41.

22.

Uno

, Yagiura

2000. Fast algorithms to enumerate all common intervals of two permutations. Algorithmica, 26:290–309.

23.

Wolf

Y.I.

, Rogozin

I.B.

, Kondraskov

A.S.

et al. 2001. Genome alignment, evolution of procaryotic genome organization, and prediction of gene function using genomic context. Genome Res., 11:356–372.

24.

Yanai

, Mellor

, DeLisi

2002. Identifying functional links between genes using conserved chromosomal proximity. Trends Genet., 18:176–179.