A Novel Technique for Detecting Putative Horizontal Gene Transfer in the Sequence Space

Abstract

Horizontal transfer (HT) is the event of a DNA sequence being transferred between species not by inheritance. This phenomenon violates the tree-like evolution of the species under study turning the trees into networks. At the sequence level, HT offers basic characteristics that enable not only clear identification and distinguishing from other sequence similarity cases but also the possibility of dating the events. We developed a novel, self-contained technique to identify relatively recent horizontal transfer elements (HTEs) in the sequences. Appropriate formalism allows one to obtain confidence values for the events detected. The technique does not rely on such problematic prerequisites as reliable phylogeny and/or statistically justified pairwise sequence alignment. In conjunction with the unique properties of HT, it gives rise to a two-level sequence similarity algorithm that, to the best of our knowledge, has not been explored. From evolutionary perspective, the novelty of the work is in the combination of small scale and large scale mutational events. The technique is employed on both simulated and real biological data. The simulation results show high capability of discriminating between HT and conserved regions. On the biological data, the method detected documented HTEs along with their exact locations in the recipient genomes. Supplementary Material is available online at www.libertonline.com/cmb.

1. Introduction

Horizontally transferred elements (HTEs) are DNA fragments that are transferred between organisms not on a vertical descent basis. The alien fragment, thus, is inserted in the recipient genome.

The phenomenon is ubiquitous among prokaryotes, implying mainly the transfer of genes (coding sequences) and is termed horizontal gene transfer (HGT) (Doolittle, 1999a; Koonin et al., 2001; Nakamura et al., 2004; Ochman et al., 2000). The existence of HTEs in non-coding sequences was less investigated; however, see Liu et al. (2004). Similarity between sequences has been primarily attributed to conservation and, thus, putative functionality (Bejerano et al., 2004; Siepel et al., 2005). However, similarity can naturally be caused by an event of horizontal transfer (HT), when not a particularly conserved sequence (say, non-coding sequence) is transferred. The very fact of the transfer in such cases may suggest some function of the transferred sequence.

HT has a fundamental evolutionary importance as it turns the traditional tree-like evolutionary history into an evolutionary network (Doolittle, 1999a,b; Wolf et al., 2002). Genetically, HT is a primary source of new genes that are acquired by bacteria and archaea and often result in adaptations to new environments and conditions (Daubin and Ochman, 2004). Therefore, identification of HT can shed light on many significant evolutionary processes that cannot be explained by the traditional tree-like approach. From the medical perspective, HT is a substantial factor by which bacteria develop resistance to antibiotics (Ambur et al., 2009; Chen and Novick, 2009; Koonin, 2009).

Currently, there are two prevailing methods for detecting HT. The phylogeny-based approach takes a relatively large set of homologous (originated from a common ancestor) coding sequences, construct their corresponding phylogeny, and contrast it to the phylogeny of their originating species. When conflicts are found between the two trees, they are reconciled by introducing HTEs (Delwiche and Palmer, 1996; Jin et al., 2007; Daubin et al., 2003; Beiko et al., 2005; Lerat et al., 2003). While this approach has the advantage of identifying relatively old events, it is incapable of coping with events residing in non-homologous regions. Moreover, the approach is based on a very stringent assumption of where to seek the events. Finally, it also requires alignment of the sequences and inferring a reliable species tree (two major problems by themselves (Linder and Warnow, 2005).

The composition-based approach contrasts genomic sequences of different compositional structure such as G + C content, dinucleotide frequencies or codon usage biases, striving to infer different origins (Karlin, 2001; Nakamura et al., 2004; Ochman et al., 2000; Ochman, 2003; Garcia-Vallve et al., 2000). This approach suffers from the fact that the species involved might share similar compositional patterns. Moreover, the length of a HTE may be too short to reliably reveal these differences. As concluded in Lawrence and Ochman (2002) and Wang (2001), “atypical G + C content and pattern of codon usage are not reliable indicators of horizontal gene transfer events.”

A recent work, DarkHorse (Podell and Gaasterland, 2007) combines these two approaches by incorporating phylogenetic information into the sequence based search. However, it aims at protein, annotated data, and is parameter dependent as we discuss below. Concluding this part, there is no gold standard for HT detection and each approach can be used either to detect unknown cases or verify rival approaches (Podell and Gaasterland, 2007).

In this work, a novel technique is introduced for identifying putative horizontally transferred sequences between two organisms. It operates in the sequence space and has no similarity to the two previously outlined approaches. It rather complements these approaches and bypasses their drawbacks. The inserted fragment “defines” a HTE-environment where similarity between the copied fragment at the donor and the inserted fragment at the recipient, is very high, whereas the corresponding flanking regions at both genomes, are base-wise unrelated (with respect to the HTE-environment; Fig. 1). Naturally, we are not interested in very old HTEs where the contrast at the HTE borders is not clearly visible. A conclusive identification of these cases is problematic. The goal is to distinguish between three types of inter genomic fragments' similarity: (1) random similarity—one that is expected to occur by chance considering the evolutionary distance between the two genomes; (2) similarity by functional conservation—within a whole family; and (3) HTEs (and rare formally similar events).

Fig. 1.

(a) Two completely identical homologous sequences. (b) After a HT event of GTTG from the upper genome to the lower. Sequences are aligned so that the HTE match, forming the HTE-environment in which the segments corresponding to the HTE at both genomes are similar (or match) and flanking sequences (in the example, GTAC vs. TGCT at the left, and CTAA vs. AATC at the right) are base-wise unrelated.

The technique proposed is not confined only to events between a donor and a direct recipient, as the donors of HTEs have not necessarily survived or their genome is not fully sequenced. Rather, it can also detect HTEs in two recipients in cases of insertion sequences (IS) or plasmids, as well as HTEs between a recipient and a close relative of the donor.

We assume a rate of point mutations that is, at the order of magnitude, approximately fixed along the time course and even across different species of the same family (Doolittle et al., 1996; Gu, 1998; Kumar and Subramanian, 2002). Although there are many biological examples of rate acceleration during evolution (Jensen-Seaman et al., 2004), these do not violate the model as it deals with different orders of magnitude. This constant rate induces a measure of dissimilarity between different species that is proportional to the time since divergence. In the mathematical derivations of this work, the simple but powerful Jukes-Cantor model of sequence evolution (JC) (Jukes and Cantor, 1969) is used. Note, that this simplification is not necessary for the detection itself, as demonstrated in the Results section, but rather for the approximate dating of events.

We now give a brief, general overview of the algorithm: Given two genomes with their lengths, one can calculate the expectancy of each identical region's length. This gives a measure of confidence as to exceptional similarity. Next matching seeds for putative HT are identified such that their size is chosen to have very low probability of having occurred by chance (based on the evolutionary distance between the two genomes). This stage filters out random similarities (but not conserved regions). We comment that filtering random similarity has also appeared in other HGT works, (Darling et al., 2004), although with no rigor. In order to uniquely detect HTEs, we rely on the statistical, as well as biological (Omelchenko et al., 2003), property that a HTE normally is not inserted exactly in its homologous (orthologous) counterpart location. This implies that the flanking regions of a HTE are non-homologous (Fig. 1), strictly distinguishing it from conserved regions. Therefore, a special sliding window algorithm is used to detect these HTE borders, searching for sharp borders (or walls). Finally, as events with non-homologous flanking regions may also be a result of some rare cases of large scale mutational events (e.g., a very conserved region is duplicated or translocated), we discard any putative HTE appearing twice (or more) in a genome.

The method was applied to a set of simulated data as well as to real bacterial genomes. The simulation studies produced a challenging data with a large proportion of conserved regions and multiple HT events. We are interested in HTEs of gene size that is too short to convey some detectable statistical signal, therefore cannot be identified by traditional sequence-based techniques relying on CpG, codon usage, or alike (Karlin, 2001; Nakamura et al., 2004; Garcia-Vallve et al., 2000). Alternatively, a simple Blast search between the genomes does detect these short regions but along with hundreds of other conserved regions. In contrast to these two later approaches, our method demonstrates high capability of detecting these events (low false negative rate) but also filters out the regions where similarity is due to simple conservation. The sensitivity of the proposed method was also checked on known documented HT cases and supported by phylogenetic data. Very accurate locations of these events were found along with other potential HTEs. Comparison to other techniques (Podell and Gaasterland, 2007; Darling et al., 2004) was also done for the task of HT detection. The code is available by request from one author (S.S.).

2. Methods

2.1. Preliminaries

We consider a (evolutionary) model in which the nucleotides along a genome are identically and independently distributed (IID). The value of the nucleotide is the state (we sometimes use just “nucleotide” to denote its state). A single mutation (or just a mutation for short) is the event of a nucleotide changing its value to different one. An evolutionary model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal M}$$\end{document} models the (stochastic) process of mutations occurring at a site as a function of mutation rates α_i,j modeling the rate of transitions from state i to j, and a specified time period t. We use the transition notation in the context of Markov chains and note it has nothing to do with the type of mutation baring the same notation (Felsenstein, 2003) Given \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal M}$$\end{document} , mutation rates [α_i,j], and a time period t, the transition probability p_i,j from nucleotide i to j during t is uniquely defined by an appropriate function (determined by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal M}$$\end{document} ). An evolutionary model \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\cal M}$$\end{document} is said to be time reversible if it is not possible to determine the direction of time given two states of a nucleotide, separated by a time period t. The JC evolutionary model is a reversible, one parameter model, postulating that at any position the rate of substitutions from one state to another, α_i,j is the same - α. Under this model, the expected number of substitutions at a site during t time units is 3αt and is sometimes denoted as (evolutionary) distance.3 The probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s$$\end{document} of seeing the same nucleotide at a position in a sequence mutating at a rate α after t time units is (note that this does not mean the nucleotide has not been substituted): \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}{{\mathbb P}}_s = {\frac {1} {4}} + {\frac {3} {4}} e^{- 4 \alpha t} \tag {1}\end{align*}\end{document}

and the probability to see a different nucleotide at a position, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_d$$\end{document} , is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}{\mathbb P}_d = 1 - {\mathbb P}_s = {\frac {3} {4}} - {\frac {3} {4}} e^{- 4 \alpha t}. \tag {2}\end{align*}\end{document}

We denote \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_d$$\end{document} as the “no transition” and “transition” probabilities resp. and we observe that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s \ge {\frac {1} {4}}$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_d \le {\frac {3} {4}}$$\end{document} .

A HT is the event of a subsequence of a genome, the donor genome, being copied and inserted at some position at another genome, the recipient genome. The right (left) border of a HT is a pair of indices indicating the right (left) endpoints of the HT segment at both genomes.

2.2. First level: perfect matches

Overview

We start here with an overview. In this part, we focus on modern HGT events, that is, recent events that have not yet been dissolved (heavily mutated) in the recipient genome. Consider the following (oversimplified) model: At time 0, some ancestral genome bifurcates into two identical genomes. Next, the two genomes start to accumulate mutations at an equal rate, constant over time, resulting in two contemporary genomes. Mutations are distributed randomly and uniformly along each genome. For the sake of clarity, we assume the following simplification that will be removed later: A position is mutated at most once over both sequences; that is, the same nucleotide is not mutated in both sequences and no two mutations at the same nucleotide at a given sequence. It can easily be seen that a mutated nucleotide cannot maintain its original state. Based on these assumptions, we can analyze the distribution of lengths of unmutated segments in both genomes.

Longest identical segment

We now formalize the idea outlined in the overview. We want to analyze the probabilities of seeing identical segments among the two genomes assuming the scenario above. Note that the time between the two genomes is twice the time since divergence so the “no transition” probability under the JC model (assuming rate α and time since divergence t) is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s = {\frac {1} {4}} + {\frac {3} {4}} e^{- 8 \alpha t}$$\end{document} . Assume G ₁ and G₂ are the two genomes, both of length ℓ and assume s_x is a subsequence of G₁ of length k. Now, there are two possibilities to see s_x at G₂:

Convergence: s_x appears by chance at position i in G₂. This event has probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\frac {1} {4^{k}}}$$\end{document} and by summing over all \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\ell$$\end{document} positions, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\frac {\ell} {4^{k}}}$$\end{document} bounds this probability. (Note that this bound can be greater than 1 for small values of k.)

Retention: There is no transition between any nucleotide of s_x and its homologue at G₁. The probability of this event is the “no transition” probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s$$\end{document} . We require this for every site in s_x. This event occurs with probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_{s}^{k}$$\end{document} .

Now, since \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s \ge {\frac {1} {4}}$$\end{document} , we get that the probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_c$$\end{document} of seeing s_x by chance at G₂ is at most \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${(\ell + 1) {\mathbb P}_s^{k}}$$\end{document} . The latter bounds the probability of a single subsequence occurring by chance at both genomes. Therefore, the expected number of such s_x from G₁ being found by chance at G₂ is at most \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\ell \cdot (\ell + 1) {\mathbb P}_s^k$$\end{document} . We now use Markov's inequality stating that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P} \left[X > \beta \right] < {\mathbb E} [X ] / \beta$$\end{document} (Alon and Spencer, 1992) to bound the probability of seeing a single occurrence (i.e., β = 1). Then for X the number of occurrences of such s_x, we get \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}{\mathbb P} \left[X \ge 1 \right] \le \ell \cdot (\ell + 1) {\mathbb P}_s^k. \tag{3}\end{align*}\end{document}

Suppose we want to bound the probability of finding by chance one or more subsequences of length k at G₂ (i.e., the right hand side of Equation (3)) by δ, then by substituting P_s and simple arithmetic we obtain that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}k \ge - \log \left(\frac {{\ell (\ell + 1)}} {\delta} \right) / \log {\mathbb P}_s. \tag {4}\end{align*}\end{document}

Equation (4) shows the relationship between the probability of finding a perfect match (bounded by δ) and the length of this match (k). This probability is exponentially decreasing with the length of that match. Therefore, by calibrating the probability of finding matches by chance (by setting large enough k), we can eliminate finding matches by chance and this is the subject of the next part.

We end this part with an example: For two mammalian species mutating at an average rate of 2.2 × 10⁻⁹ (Kumar and Subramanian, 2002), diverged 6 million years ago (hence t = 12 × 10⁶), we get “no transition” probability \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s = {\frac {1} {4}} + {\frac {3} {4}} \exp (- 4 \times 2.2 \times 10^{- 9} \times 12 \times 10^6) = 0.925$$\end{document} . Assuming each has genome of 3 * 10⁹ nucleotides, then with probability at most 0.01 we get \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\log \left({\frac {\ell (\ell + 1)} {\delta}} \right) = 48.25$$\end{document} . Then \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$48.25 / - \log\, 0.925$$\end{document} yields that we can see by chance identical segments of k ≥ 619. We, however, work with much shorter genomes and much bigger divergence time, yielding a much smaller k.

Finding significant HTEs

The discussion above dealt with the task of how to decrease the probability of finding matches by chance. Our algorithm for detecting statistically significant HTEs starts by finding seeds of matches of length beyond reasonable probability (e.g., 0.01) of being found by chance. The algorithm builds the k-spectrum of one genome for large enough k. We now move sequentially (left to right) over the other genome. Every k-mer in this genome is checked against the k-spectrum to find k-matches (identical segments of length k). Finally k-matches are expanded (obviously only forward, since backward matches were already found), until a mismatch is found and then reported.

2.3. Second level: coping with mutations

The previous section dealt with probabilities of non-mutated segments as a means for detecting HTEs between two segments. The algorithm in that part finds seeds of identities in both genomes. These seeds are identical regions in the genomes with small likelihood of being created by chance. This section extends the scope to deal with HTEs that underwent mutations. First, we argue that the probability for some seed to be detected is fairly high. Next, the Butterfly algorithm, that extends these seeds over regions where similarity is not complete but very high, is introduced. Therefore, one goal of the Butterfly is to bridge between neighboring perfect match segments of a horizontal transfer. More importantly, the main goal of the Butterfly is to distinguish between HTEs and similarity by conservation as will be detailed in the sequel.

Sufficient conditions for seeds detection

After a HT event has occurred, the HTE is exposed to mutations. HTEs that have occurred sufficiently long time before present are dissolved in both genomes by accumulating mutations in the copied segment, so their similarity diminishes and therefore cannot be identified. In the previous section, a long enough seed was required to discard random matching. However, due to mutations over a too long time since the HTE event, there might not be long enough seeds (as explained above). Therefore, we say that a HTE survives if some k-match is found between the two segments corresponding to it in the two genomes. Our algorithm will fail already in the seeds stage if no seed will be found. This is equivalent to the case when every consecutive k nucleotides in the HTE have at least one mutation. The likelihood of this case is fairly low for relatively recent events.

Given a known mutation rate α, we want to determine what is the age \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat{t}$$\end{document} of HT events that can be recognized with high probability.

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s$$\end{document} be the “no transition” probability at a site, i.e., \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s = {\frac {1} {4}} + {\frac {3} {4}} e^{- 8 \alpha \hat {t}}$$\end{document} . The HTE H is partitioned into consecutive disjoint fragments [H_i], each of length k. Note that if there is a single fragment with no transition between the genomes (i.e., a k-match), a seed will be discovered (Fig. 2). Also note that there is still the possibility that every fragment has a transition but there is a k-match between transitions of neighboring fragments (see Figure S1 and accompanying explanation; Supplementary Material at www.liebertonline.com/cmb). Therefore, the condition of at least one “untouched” segment is sufficient but not necessary.

Fig. 2.

The HTE H is segmented to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {\ell_h} {k}$$\end{document} adjacent segments of length k each ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$h_{0} \ldots h_{7}$$\end{document} in the figure). A transition is marked by a red (dark) bar in one of the genomes. If a certain segment (e.g., h₂) was not exposed to a transition, H will k-survive. Note also that although there was a transition in adjacent segments h₅ and h₆, there is an untouched fragment between two red bars of length > k residing in both h₅ and h₆.

The probability of having a k-match at some H_j is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s^k$$\end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$1 - {\mathbb P}_s^k$$\end{document} otherwise. Denote by δ the probability that no fragment has a k-match, that is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\delta = \left(1 - {\mathbb P}_s^k \right)^{\ell_H / k}. \tag{5}\end{align*}\end{document}

From Equation (5), we get \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s = \left(1 - \delta^{k / \ell_H} \right)^{\frac {1} {k}}$$\end{document} and by the definition of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s$$\end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat{t}$$\end{document} is obtained. Recall, by the discussion above, that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\delta \ge {\mathbb P} {[H \ does \ not \ survive]}$$\end{document} . Therefore, if H is a HTE of length ℓ_H transferred \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat {t}$$\end{document} time units before present, then H will survive with probability at least 1 − δ if \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat {t}$$\end{document} satisfies: \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}\begin{align*}\hat {t} \le - {\frac {1} {8 \alpha}} \log \left[{\frac {4} {3}} \left[\left(1 - \delta^{k / \l_H} \right)^{\frac {1} {k}} - {\frac {1} {4}} \right] \right]. \tag {6}\end{align*}\end{document}

The Butterfly algorithm

We now describe the Butterfly algorithm, which is the heart of the method and serves to locate the HTE borders as well as to discriminate HTEs from conserved regions. We start with an overview of the algorithm. When a segment is inserted to a genome, it creates a HTE-environment. In this environment, the similarity between the copied fragment at the donor and the inserted fragment at the recipient, is very high. However, beyond the borders of the HTE the corresponding flanking regions at both genomes, are base-wise unrelated. Now, suppose we color regions of high similarity by blue and regions of low similarity by red. Hence, we define the notion of a Butterfly to capture the intuition that when the Butterfly sits on a HTE border, one wing is red and another is blue. More formally, a (single) wing of the Butterfly is a pair of sequences of a specified length at two specified positions in the two genomes. The Butterfly has two adjacent, equally long, sliding windows, the end of the left wing is the starting position of the right one. These are the Butterfly's wings. The Butterfly travels along the two genomes by moving sequentially the two adjacent sliding windows, seeking for high contrast between the wing colors. The right mismatch score, is the relative number of mismatches between the two segments of length ℓ composing the right wing. Similarly, the left mismatch score, is when the comparison is done between the ℓ positions composing the left wing.

Alternatively, the mismatch score is the normalized Hamming distance between the two segments in the wing/sliding window.

Algorithm

Procedure Butterfly(p₁,p₂,ℓ,τ)

1. for (i ← 0;; (i + + )

(a) m_r ← mismatch^r(p₁ + i, p₂ + i,ℓ).

(b) m_l ← mismatch^ℓ(p₁ + i, p₂ + i,ℓ).

(d) wall(p₁ + i, p₂ + i,ℓ) ← m_r − m_l.

(e) if (wall(p₁ + i, p₂ + i,ℓ) > τ) record i as a right border and goto 2.

2. for (j ← 0;; j + + )

(a) m_r ← mismatch^r(p₁ − j, p₂ − j,ℓ).

(b) m_l ← mismatch^ℓ(p₁ − j, p₂ − j,ℓ).

(d) wall(p₁ − j, p₂ − j,ℓ) ← m_l − m_r.

(e) if (wall(p₁ − j, p₂ − j,ℓ) > τ) record j as a left border and leave.

3. return j + i + 2ℓ

The Butterfly works similarly to the expansion stage of BLAST (Altschul et al., 1990) by trying to expand a seed to both sides. Its stopping conditions are however different. It starts at a (perfect match) seed found by the exact matching algorithm. Then it moves to the right one position at a time and compares the mismatch score between the two windows. The wall of a Butterfly is the difference between the mismatch scores of the two wings. The Butterfly stops when it finds a wall greater than some threshold parameter τ. (Fig. 3). After encountering a wall at the right, the process repeats from the seed to the left (now the mismatch at the right wing is subtracted from the mismatch at the left).

Fig. 3.

A butterfly starting at positions 19, 13 in the lower and upper genomes resp. The wings are of length 8. This is a right border of seven nucleotides HTE (accumulated one substitution), hence the right mismatch score is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {7} {8}$$\end{document} and the left \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {2} {8}$$\end{document} defining a high enough wall of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\frac {7} {8}} - {\frac {2} {8}} = {\frac {5} {8}}$$\end{document} .

As two string matching algorithms, the BLAST and the Butterfly are similar. However, BLAST is not geared to detect sharp walls appearing at the HTE borders. Indeed, BLAST will detect many conserved regions and will not distinguish between conserved regions and HTEs (cases (2) and (3) above).

Following is the formal algorithm:

The goal of the Butterfly is to locate the borders of a HTE. Assuming we are proceeding rightward, this occurs when the HTE border is exactly between the two wings, such that the right wing is completely outside the HTE and the left wing is completely inside it. Since the two processes under consideration—random similarity at the right (outer) wing and mutation rate at the left (inner) wing—are stochastic, we augment the Butterfly with two parameters to handle the noise generated by these processes. In order to have enough confidence in the decision to stop, the butterfly needs to “see” enough information. A very strong signal but with low confidence can result from too short wings (e.g., a 100% wall resulting from two mismatches at the right wing and two matches at the left can be pure noise). This is controlled by the wing lengths parameter ℓ. On the other hand, the wall threshold parameter τ, controls the stopping condition: A too low threshold can stop the Butterfly prematurely at a false wall inside the HTE. Figure 4 illustrates both scenarios.

Fig. 4.

Wall values of a simulated HTE-environment of length 50 (middle of figure between the two peaks). As can be seen, wing of size 8 produces a smaller signal to noise ratio compared to wing's size 16 hampering the identification of the HTEs border and consequently HTEs identification. Also note that a too small threshold value τ can lead to a false early wall identification inside the HTE as a result of noise (mutations) inside the HTE. (Remark: the wall on the right side is calculated oppositely to the left wall by moving the sliding windows leftward.)

Since the segments beyond the HTE border are base-wise uncorrelated (i.e., non-homologous), the expected relative mismatch score outside of the HTE is \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {3} {4}$$\end{document} . Therefore, we chose a wall of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {1} {2}$$\end{document} , leaving a high enough mismatch score inside the HTE, to prevent premature stopping.

2.4. Third level: discriminating conserved regions

We define a conserved region as two homologous segments in G₁ and G₂ with similarity above \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {1} {4}$$\end{document} (random). Similarly to HTEs, conserved regions can have sub regions of complete similarity, producing seeds by the exact match algorithm what can lead to false positive identification of HTEs. However, as opposed to HTEs, conserved regions have the distinguishing property that the sequences beyond a border of a conserved region are still homologous, and therefore have no clear borders. Figure S1 depicts two outputs of the Butterfly algorithm on real genomic data. In one, the borders are very clear yielding clear walls whereas in the second the walls are not clear. Figure 5 illustrates the difference of the borders at conserved regions versus real HTEs.

Fig. 5.

While the mismatch value at HTE border jumps abruptly from 0 (or nearby values) to \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {3} {4}$$\end{document} (red dashed line), at a conserved region the slope is much more gradual (blue solid line).

The above implies that, if the Butterfly starts at a seed corresponding to a conserved region, it might not hit a wall and might run through excessive regions with expected mismatch of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {3} {4}$$\end{document} at both wings. To remedy this phenomenon, the Butterfly halts the algorithm when the back wing hits a high mismatch score, and discards the putative HTE.

Accounting for large-scale mutational events

Large-scale mutational events (also denoted as genome rearrangement events) are events in which large genomic fragments are either copied or moved to other location in the genome. These events are denoted as duplications and translocations, respectively. Note that in such events, the flanking regions of the copied fragment and the flanking regions of some homologous fragment at another species, are not homologous between themselves, creating a spurious wall. Our method accounts for this as follows: if the translocation event is old enough, there were enough mutations in the homologous segments at both genomes, preventing a seed (first level) to be found in them. If the event is recent, then the translocated fragment is divergent enough from its homolog at the other genome, preventing a seed again. The only case where a seed is detected is when a conserved (in contrast to a ordinary non-conserved) region has been copied recently. To account for this case, we discard all cases where a putative HTE has a similar copy in the genome. The only case that cannot be detected is when a very conserved fragment is being translocated; this however is very rare and cannot be detected by any other method.

3. Results

The method is implemented in software, one program for the seeds (exact match) stage and another for the Butterfly, both in C. A simulation study was conducted in which artificial genomes and HT events were generated. The simulation's results were plugged into our equations in order to set the algorithms parameters. Subsequently, known biological HTEs were analyzed and located in their host genomes. Finally, the method was applied to raw genomes, and our findings were compared with known annotated data.

3.1. Simulation study

The simulation study was performed to evaluate the performance of the new method. The goal was to evaluate the capability of the method to detect HTEs and to discriminate between HTEs and conserved regions. We first clarify the setting. Equation (6) gives a bound on event ages for which we are likely to find seeds. We can not guarantee detection of the events above this age. Also, recall that distance between genomes is irrelevant as regions flanking a HTE are base-wise uncorrelated (expected similarity \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\frac {1} {4}$$\end{document} ). Indeed, simulations with one genome segment being transplanted in another genome yielded expected obvious results of 100% success. This has motivated us to formulate a more challenging input with multiple genomes and conserved regions (forming false seeds) and multiple HTE copies, intermixing with these conserved regions.

Simulated datasets

The r8s (Sanderson, 2003) software was used to produce random birth-death trees (i.e. random topology and branch lengths) over 20 taxa. In order to have enough divergence time between any two species, we normalized branch length to a minimum of d = 0.2 yielding a distance of d = 0.4 between any two species, where d = 3αt (see distance definition in Methods).

In order to obtain bounds on random similarity we need to convert this distance to the parameters of Equation (1). Normally, distance d is defined as the expected number of substitutions and under the JC model this is 3αt. Plugging this into Equation (1) yields \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${\mathbb P}_s = 0.69.$$\end{document} So in order to set a k large enough to avoid random matching, we use Equation (4). The genomes were about 3Mbp long and 90% confidence (δ = 0.1) was chosen, yielding a seed length, Equation (4) of k ≥ 87. Note that this seed would filter out only random similarity in non-conserved regions. However, this will not filter out the similarity among conserved regions.

Next we used Seq-Gen (Rambaut and Grassly, 1997) to evolve sequences according to the JC model, based on the trees generated in the previous step. The sequences were generated on the tree under two modes: normal rate of evolution corresponding to non-conserved regions and slow rate of evolution corresponding to conserved regions. The two sub-segments were concatenated into a segment and the whole process recurred several times to form a conserved/non-conserved alternating genome. Such a type of input generates a large number of putative seeds in the first stage, and the challenge is to discriminate between them and real HT events. Note that a simple similarity-based HT detection algorithm (e.g., Blast) will falsely report on a suspected HTE for every conserved segment between closely related sequences (in a relatively small subtree).

Now, a set of HT events was generated, several copies each. Every copy was inserted at a random location along a randomly chosen sequence (genome) generated at the previous stage. To simulate age of events (so that the inserted sequences acquire mutations in their host genome), we again used Seq-Gen with a star topology, uniform edge lengths tree as a model tree. Edge length (tree depth) varied randomly but in a much shorter range in comparison to the model species tree. Using Equation (6) (limiting age of HT events as a function of HTE length ℓ_h, detection confidence δ and rate of mutation α), if we set the HTE length ℓ_h = 2000, then WHP (δ ≤ 0.1) we can trace HTEs of distance \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat{d} = 3 \alpha \hat{t} \le 0.0135$$\end{document} from one to another. Hence we set the edges of the star tree for the HTE generation (see Methods) to 0.006, yielding \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\hat{d} = 0.012$$\end{document} . This way we guarantee a low false negative rate. We comment that, by the above, we limit our attention only to events recent enough that the algorithm has any chance to detect them. Illustration of a simulated input for a single event and three event copies is shown in Figure 6.

Fig. 6.

A simulated input of three genomes with alternating conserved/non conserved regions, evolved along the tree in the left with three copies of a single HT event inserted at random positions.

Simulation results

As was said above, the goal was to measure the effect of false positive/negative (FP/FN) as a result of varying number of event copies. The number of HTE pairs was counted (this is simply \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland,xspace}\usepackage{amsmath,amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\sum {c_i \choose 2}$$\end{document} where c_i is the number of copies generated for HT event i). In the simulation results, we counted the events reported by our procedure that deviate in length by at most 20 (bases) from the real HTE length (this requirement was quite strict as large proportion of events were discovered but with length not in that range). Any event not in that range was considered as a false positive. Figure 7 reports results for two cases: a single HT event and four HT events (both with varying number of copies for each event). The results show that under a reasonable amount of copies for each event the FN rate is fairly low meaning that most HTEs were found and with the correct length. The rate of FP is low for any level of event copies, meaning that the method discriminates very well between HTEs and conserved regions, moreover that most of the FP events were real HTEs but with incorrect length. We comment that in a separate setting in which the HT star tree was very shallow (meaning very recent events) and longer seeds, we obtained a negligible FP rate (under 1, absolute value) and FN (under 2, percentage of events generated) for all #HT copies.

Fig. 7.

False positive (lower solid line, absolute values) and false negative (upper dashed line, percentage of number of events generated) as a result of the number of HTE copies per event. Top: A single HT event. Bottom: Four HT events.

3.2. Real data results

The results with simulated data described above show that when the time of event is relatively recent, the method detects it with high probability. Moreover, the very low FP rate shows the power of the method in filtering out false seeds resulting from conserved regions. In its current form, our implementation does not handle effects caused by indel events as they create artificial walls.4

Nevertheless, bacterial genomes are comprised mainly from coding material which is less affected by indels. On the other hand, they exhibit high rate of HT activity and therefore are a perfect fit for demonstrating the concept behind the new method.

We started with demonstration of performance of the new method as a tool for detecting and locating the exact borders of the events. We chose an annotated HTE in the form of an insertion sequence IS53 (accession number M83932). Using Blast, we found it in two species of Pseudomonas: Pseudomonas stutzeri (accession number NC_009434) and Pseudomonas syringae (accession number NC_004578).

In P. stutzeri, the IS was matched in two locations, one at 673486.675897 (see Fig. S1 for blast output on P. stutzeri). Hence, we applied our software to these two genomes, seeking to find HTEs with clear borders. The technique identified this insertion and located the exact right (downstream) border of this insertion event (i.e. position 675897; see Butterfly output on these two genomes in Fig. S1). As transposable elements are known to be major vehicles for HT, the above finding testifies on suitability of the technique for detection of HT.

Moreover, this setting of differentially located segments in two genomes fits as input for other techniques for HGT detection relying on the same characteristics. Techniques relying on protein, or protein coding, sequences (Garcia-Vallve et al., 2003, 2000; Podell and Gaasterland, 2007; Karlin, 2001; Nakamura et al., 2004; Ochman et al., 2000; Ochman, 2003) are immediately ruled out, leaving only techniques operating on raw DNA sequences. The best known is Mauve (Darling et al., 2004), which, like ours, seeks for seeds in the first stage but does not employ criteria similar to the Butterfly on the second stage. Indeed, Mauve could not distinguish well between conserved elements and HTEs and produced mixed output (Fig. S1). This should be compared to the well filtered output of the proposed method. Even a detailed search for the specific event of IS53 with appropriate resolution failed: it was not detected (Fig. S1).

Several highly homologous gene pairs were accurately detected as well with different (non-homologous) flanking sequences (genes alaT, pgsA, ppsA, and purL of lengths 700–3900 bp). The accuracy of our method with respect of locating the sharp borders, as compared by the location and length to NCBI RefSeq, is shown in Table 1. These gene pairs appear as differentially translocated in P. stutzeri and P. syringae genomes. The horizontal transfer from one of the genome to another, or from a third party is not excluded as well. Since this work is geared to developing a method for detection of putative HTE, the detailed analysis of the detected elements, with involvements of other genomes will be a subject of a separate study.

Table 1.

Comparison of Location and Length of HT-Putative Genes: NCBI GenBank versus Butterfly

	stutzeri NC_009434
	GenBank		Butterfly
Element	CDS	Length	CDS	Length
alaT	2764257.2765468	1211	2764254.2765453	1199
pgsA	2269251.2269865	614	2269294.2270010	710
ppsA	2215516.2217801	2285	2215416.2217782	2366
purl	289386.1293282	3896	1289368.1293283	3915

The next task was to locate and verify well-documented cases of HTE in bacteria. We started with a HT event in Borrelia (Barbour et al., 2005). The genes hpt, purA and purB appear in the genomes of two species B. hermsii and B. miyamotoi, agents of human relapsing fever, but not in other species of the Borrelia genus (e.g., B. burgdorferi and B. garinii). In Barbour et al. (2005), the authors concluded that these genes were acquired by a horizontal transfer from an unknown donor at some time after diversification of Borrelia from their common ancestor. The possibility of gene loss at the other species was discarded by maximum parsimony arguments. They also constructed the gene trees for the purA and purB genes over sets of 34 and 36 representative species respectively. By these trees, one likely candidate donor would be Fusobacterium nucleatum. Indeed, according to our calculation among 32 protein-coding genes, within 40 kb around the purA of B. hermsii (accession number NC_010673), the genes purA and purB show the largest sequence identity seeds (34 and 22 bases, respectively) when compared to the genome of Fusobacterium (accession number NC_003454). That is, application of our method of searching for the HTE to the pair of the genomes (B. hermsii and Fusobacterium) suggests that the above genes appear, indeed, to be horizontally transferred to B. hermsii. As to the possible donor, the likely candidate is Fusobacterium: the sequence similarity of the three genes between the two species B. hermsii and B. miyamotoi was found to be 88%, whereas similarity between each of these and the respective genes in Fusobacterium is only 69%. Under the assumption that the donor is a close relative of Fusobacterium, this suggests that the HGT event has occurred before the speciation event of the B. hermsii and B. miyamotoi.

One more well-documented case of HT event is the transfer of a 2.5-kb segment that includes gene bioF from Neisseria meningitidis and Haemophilus influenzae RD (Kroll et al., 1998). At the exact match stage, we found four seeds of length 193, 132, 100, and 100. The extension of these by the Butterfly algorithm resulted in a single HTE of length 2547 at locations 1697161–1699708 and 1621851–1624398 in the Neisseria and the Haemophilus genomes respectively (Fig. S1). We note that this is in full accordance with the original observation of Kroll et al. (1998).

4. Discussion

4.1. Conclusion

In this study, we attempted at a desirably rigorous description and handling of horizontal transfer events. Specifically, we introduced the matching seeds idea to sort out random matches. The next level copes with mutations inside the HTE, implemented by the sliding window of the Butterfly algorithm. At the top level, we distinguish HTEs from natural conserved regions by requiring “high walls” between the Butterfly wings.

This article lays out the rigorous statistical/algorithmic groundwork for this new approach. While this seemingly restricted problem of finding HTEs between two genomes might appear simple, in fact it is key to a broader research where a group of organisms is studied. The proposed procedure and software can serve as a subroutine applied to all pairs in the group and therefore run in time quadratic in the group size. By our experiments, for groups of size of a hundred organisms, this should take not more than a day.

The examples on real data serve as a proof of concept for this new approach rather than a thorough search for HTEs among all published pairs of genomes. Moreover, as described in the Introduction and demonstrated in the Results, an extensive benchmark between our method and all existing HT detection methods is problematic as each operates on different inputs, uses different assumptions and aims at other goals. Instead, in the Borrelia case, we showed how a combination with a complementary method, the phylogenetic approach, endows further information on the event.

Our work operates in the sequence space. Mutational events in the sequence space are divided into small-scale mutational events (which are point mutations) and large-scale mutational events (such as duplications, translocations, and HT). There are many articles examining the sequence space, and even beyond—in the phylogenetic space. While these works handle separately each type of event, ours combines these two types of events: small scale and large scale. We believe this novel combination is interesting and potentially important.

4.2. Future research

The research presented here gives rise to several future research directions that we intend to investigate. We list them below. The first question is: Is HT a widespread phenomenon in non-coding sequences, those with no obvious functionality, and if so, among particular regions? Previous studies on HT mainly focused at coding regions (ORFs), and therefore this question has not been investigated sufficiently. A positive answer to this question can add another layer to the growing field of searching functionality in non-coding regions. Our next question is: Is the behavior of the mutational process present in HTEs? Can we infer by it the age of the event? In particular, Amelioration (Lawrence and Ochman, 1997; Ragan, 2001) is a process by which a gene that was transferred horizontally acquires features (e.g., GC content, the percentage of nitrogenous bases on a DNA molecule which are either guanine or cytosine) similar to its new environment. This is particularly true for recent events as this process diminishes in time and therefore relevant to our work. If the mutational process accelerates at HTEs, then calculation of events age should take this in considerations.

Finally, we ask: Are there common patterns of HT among different families? Specifically, horizontal transfer was found also in plants (Bergthorsson et al., 2003) and Archaea (Matte-Tailliez et al., 2002). As the method introduced here can readily be applied to higher organisms, as we showed in our hypothetical example on two mammalians, we intend to use it to seek for HTEs among these families.

Footnotes

Acknowledgments

S.S. is supported by the Young Scientists Fellowship of the Yeshaya Horowitz Association through the Center for Complexity Science.

Disclosure Statement

No competing financial interests exist.

3

We stick to the time-rate notation as it allows further flexibility in other models.

4

This can be overcome by joining adjacent segments found by the Butterfly.

References

Alon

, Spencer

J.H.

1992. The Probabilistic Method. Wiley: New York.

Altschul

S.F.

, Gish

, Miller

et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215:403–410.

Ambur

O.H.

, Davidsen

, Frye

S.A.

et al. 2009. Genome dynamics in major bacterial pathogens. FEMS Microbiol. Rev., 33:453–470.

Barbour

A.G.

, Putteet-Driver

A.D.

, Bunikis

2005. Horizontally acquired genes for purine salvage in borrelia spp. causing relapsing fever. Infect. Immun., 73:6165–6168.

Beiko

R.G.

, Harlow

T.J.

, Ragan

M.A.

2005. Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. USA, 102:14332–14337.

Bejerano

, Pheasant

, Makunin

et al. 2004. Ultraconserved elements in the human genome. Science, 304:1321–1325.

Bergthorsson

, Adams

K.L.

, Thomason

et al. 2003. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature, 424:197–201.

Chen

, Novick.

R.P.

2009. Phage-mediated intergeneric transfer of toxin genes. Science, 323:139–141.

Darling

A.C.E.

, Mau

, Blattner

F.R.

et al. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res., 14:1394–1403.

10.

Daubin

, Ochman

2004. Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res., 14:1036–1042.

11.

Daubin

, Moran

N.A.

, Ochman

2003. Phylogenetics and the cohesion of bacterial genomes. Science, 301:829–832.

12.

Delwiche

C.F.

, Palmer

J.D.

1996. Rampant horizontal transfer and duplication of rubisco genes in eubacteria and plastids. Mol. Biol. Evol., 13:873–882.

13.

Doolittle

R.F.

, Feng

D.F.

, Tsang

et al. 1996. Determining divergence times of the major kingdoms of living organisms with a protein clock. Science, 271:470–477.

14.

Doolittle

W.F.

1999a. Phylogenetic classification and the universal tree. Science, 284:2124–2129.

15.

Doolittle

W.F.

1999b. Lateral genomics. Trends Cell. Biol., 9:M5–M8.

16.

Felsenstein

2003. Inferring Phylogenies. Sinauer Associates: Sunderland, MA.

17.

Garcia-Vallve

, Romeu

, Palau

2000. Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res, 10:1719–1725.

18.

Garcia-Vallve

, Guzman

, Montero

M.A.

et al. 2003. Hgt-db: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res., 31:187–189.

19.

1998. Early metazoan divergence was about 830 million years ago. J. Mol. Evol., 47:369–371.

20.

Jensen-Seaman

M.I.

, Furey

T.S.

, Payseur

B.A.

et al. 2004. Comparative recombination rates in the rat, mouse, and human genomes. Genome Res., 14:528–538.

21.

Jin

, Nakhleh

, Snir

et al. 2007. Inferring phylogenetic networks by the maximum parsimony criterion: a case study. Mol. Biol. Evol., 24:324–337.

22.

Jukes

T.H.

, Cantor

C.R.

1969. Evolution of protein molecules, 21–132. Munro

H.N.

Mammalian Protein Metabolism. Academic Press: New York.

23.

Karlin

2001. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol., 9:335–343.

24.

Koonin

E.V.

, Makarova

K.S.

, Aravind

2001. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol., 55:709–742.

25.

Koonin

E.V.

2009. Darwinian evolution in the light of genomics. Nucleic Acids Res., 37:1011–1034.

26.

Kroll

J.S.

, Wilks

K.E.

, Farrant

J.L.

et al. 1998. Natural genetic exchange between Haemophilus and Neisseria: intergeneric transfer of chromosomal genes between major human pathogens. Proc. Natl. Acad. Sci. USA, 95:12381–12385.

27.

Kumar

, Subramanian

2002. Mutation rates in mammalian genomes. Proc. Natl. Acad. Sci. USA, 99:803–808.

28.

Lawrence

J.G.

, Ochman

2002. Reconciling the many faces of lateral gene transfer. Trends Microbiol., 10:1–4.

29.

Lawrence

J.G.

, Ochman

1997. Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol., 44:383–397.

30.

Lerat

, Daubin

, Moran

N.A.

2003. From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-proteobacteria. PLoS Biol., 1:E19.

31.

Linder

C.R.

, Warnow

2005. Overview of phylogeny reconstruction. Aluru

Handbook of Computational Molecular Biology. CRC Press: Boca Raton, FL.

32.

Liu

, Harrison

, Kunin

et al. 2004. Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol., 5:R64.

33.

Matte-Tailliez

, Brochier

, Forterre

et al. 2002. Archaeal phylogeny based on ribosomal proteins. Mol. Biol. Evol., 19:631–639.

34.

Nakamura

, Itoh

, Matsuda

et al. 2004. Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat. Genet., 36:760–766.

35.

Ochman

2003. Neutral mutations and neutral substitutions in bacterial genomes. Mol. Biol. Evol., 20:2091–2096.

36.

Ochman

, Lawrence

J.G.

, Groisman

E.A.

2000. Lateral gene transfer and the nature of bacterial innovation. Nature, 405:299–304.

37.

Omelchenko

, Makarova

, Wolf

et al. 2003. Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biol., 4:R55.

38.

Podell

, Gaasterland

2007. Darkhorse: a method for genome-wide prediction of horizontal gene transfer. Genome Biol., 8:R16.

39.

Ragan

M.A.

2001. On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett., 201:187–191.

40.

Rambaut

, Grassly

N.C.

1997. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci., 13:235–238.

41.

Sanderson

M.J.

2003. r8s; inferring absolute rates of evolution and divergence times in the absence of a molecular clock. Bioinformatics, 19:301–302.

42.

Siepel

, Bejerano

, Pedersen

J.S.

et al. 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res., 15:1034–1050.

43.

Wang

2001. Limitations of compositional approach to identifying horizontally transferred genes. J. Mol. Evol., 53:244–250.

44.

Wolf

Y.I.

, Rogozin

I.B.

, Grishin

N.V.

et al. 2002. Genome trees and the tree of life. Trends Genet., 18:472–479.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.19 MB