New Approximate Statistical Significance of Gapped Alignments Based on the Greedy Extension Model

Abstract

Sequence alignment is a fundamental concept in bioinformatics to distinguish regions of similarity among various sequences. The degree of similarity has been considered as a score. There are a number of various methods to find the statistical significance of similarity in the gapped and ungapped cases. In this article, we improve the statistical significance accuracy of the local score by introducing a new approximate p-value. This is developed according to Poisson clumping and the exact distribution of a partial sum of random variables. The efficiency of the proposed method is compared with that of previous methods on real and simulated data. The results yield a remarkable improvement in accuracy of the p-value in the gapped case. This is an evidence for the method to be considered as a prospective candidate for sequences comparison.

1. Introduction

Comparing query sequences with databases of known sequences is one of the most important research area in bioinformatics. There are many algorithms for aligning biological sequences, in which their results in the gapped and ungapped cases can be evaluated by some statistical methods.

Let $A = A_{1}, A_{2}, \dots, A_{n}$ and $ℬ = B_{1}, B_{2}, \dots, B_{m}$ be two independent sequences of i.i.d. (independent and identically distributed) random variables on a finite alphabet set $A$ . The statistical significance of the ungapped local score is derived from extreme-value distribution by considering two assumptions (1) the expected score for a random sequence alignment is negative and (2) there is at least one positive score in the score matrix (Karlin and Altschul, 1990; Karlin and Dembo, 1992). A substitution score matrix is used to assign a score to each aligned residue pair. Positive and negative scores are usually given to similar and dissimilar residue pairs, respectively. There are many various scoring matrices and BLOSUM (BLOcks SUbstitution Matrix) is well known among them (Henikoff and Henikoff, 1992).

Statistical significance for the gapped case is derived from the ungapped case, in which its parameters are estimated by simulation and some other different methods, such as method of moments (Altschul and Gish, 1996), maximum likelihood (e.g., Bailey and Gribskov, 2002), and faster methods based on island approach (Bundschuh, 2002). In addition, previous empirical studies by Mott (1992), Waterman and Vingron (1994a), and Altschul and Gish (1996) showed that the extreme-value distribution can be applied in the gapped case. In 1999, a heuristic p-value was proposed using the greedy extension model (GEM) (Mott and Tribe, 1999). GEM is one of the most important means of this article. Siegmund and Yakir (2000) introduced a p-value in the gapped case by combining the results of Karlin and Dembo (1992) and Mott and Tribe (1999). Also, Zhang et al. (2000) introduced a greedy algorithm to align DNA sequences, which is faster than traditional methods. Fayyaz Movaghar et al. (2007) proposed a p-value for local score by dividing sequence to segments with length h when the gaps are allowed. Hassenforder and Mercier (2007) introduced a p-value for the local alignment score in the Markov framework. Chabriac et al. (2014) introduced a new method to derive asymptotic behavior of the local alignment score by using the Brownian motion. Also, statistical significance of the local alignment score with respect to length and position was studied by Lagnoux et al. (2016).

In Section 2, we review the GEM and the related concepts, then in Section 3, we describe our new p-value for the GEM, based on the exact and an approximate distribution of the local score in one and two sequences frameworks, respectively. Finally, in Section 4, the accuracy of our new p-value is assessed through simulated and real data.

2. Greedy Extension Model

In this section, we review the GEM in detail (Mott and Tribe, 1999). At first, some notations are presented.

2.1. Definitions and notations

Consider two sequences $A$ and $ℬ$ such that the probability of occurrence A in sequences $A$ and $ℬ$ is p_A and q_A, respectively. Let $h (x)$ be the probability mass function of the scoring scheme s, defined as $P (s (A, B) = x) \equiv h (x) = \sum_{{(A, B) : s (A, B) = x}} p_{A} q_{B} .$

As it is mentioned in Section 1, $E [s (A, B)] = \sum_{(A, B)} s (A, B) p_{A} q_{B} < 0$ and there exists $x > 0$ for which $h (x) > 0$ . In the following, $s (A_{i}, B_{j})$ , the score of the ith letter of $A$ aligned with the jth letter of $ℬ$ , is considered as an $(m \times n)$ “dot matrix” such that its $(i, j)$ th element represents $s (A_{i}, B_{j})$ .

Let $Σ_{i j k}$ be the partial sum of scoring scheme as $Σ_{i j k} = \{\begin{matrix} 0 & ; k = 0 \\ \sum_{r = 0}^{r < k} s (A_{i + r}, B_{j + r}) & ; k > 0 \end{matrix} .$

In dot matrix, $Σ_{i j k}$ is the aggregated score of k consecutive pairs of letters that start from $(i, j)$ th element of matrix and end at $(i + k, j + k)$ along the diagonal $d = i - j$ .

For a fixed d, a random walk is generated by the partial sum, ${Σ_{(d + 1), 1 k}}_{k \geq 0}$ , as it is illustrated in Figure 1. If $Σ_{(d + 1), 1 r} < Σ_{(d + 1), 1 k} \forall k < r$ , the coordinate $(d + r + 1, r + 1)$ presents a strict ladder point and clump is an excursion between two consecutive ladder points. The clump score is a non-negative value that it equals to the maximum height attained in the clump (e.g., Fig. 1). Note that the clump scores are independent and identically distributed. Let $ϑ$ be the expected clump length, that is, the interval between two consecutive ladder points. Therefore, a randomly selected coordinate marks a ladder point with probability $ρ = 1 ∕ ϑ$ .

FIG. 1.

The ladder points, clump, the constrained, and unconstrained maxima, Z and $Z'$ , respectively, in $Σ_{i j k}$ random walk for the diagonal $d = i - j$ .

Based on $Σ_{i j k}$ , two types of maxima are presented:

Unconstrained maximum: For a fixed coordinate $(i, j)$ , $Z'_{i j} = {max}_{k \geq 0} Σ_{i j k}$ , whose value is the score of the maximum partial sum starting at $(i, j)$ or 0.

Constrained maximum: For a fixed coordinate $(i, j)$ , let $T_{i j} = inf {k : Σ_{i j k} < 0}$ . Therefore, the constrained maximum is defined as $Z_{i j} = {max}_{k \leq T_{i j}} Σ_{i j k}$ , the score of the first clump of the random walk ${Σ_{i j k}}$ started at $(i, j)$ . It is clear that $Z_{i j}$ occurs before $Z'_{i j}$ and $Z_{i j} \leq Z'_{i j}$ .

Let $H_{n, m}$ be the ungapped local score of two sequences $A$ and $ℬ$ :

where $s (\cdot, \cdot)$ is scoring scheme and $ℓ$ is the number of aligned letters. In other words, (2) $H_{n, m} = {max}_{{(i, j) : (i, j) m a r k s a l a d d e r p o i n t}} Z_{i j} .$

Likewise, the gapped local score of two sequences $A$ and $ℬ$ is defined as (3) $M_{n, m} = {max}_{I, J} S (I, J),$

where $S (I, J) = max {- \sum_{i = 0}^{γ} g (l_{i}) + \sum_{k = 1}^{c} s (A_{a (k)}, B_{b (k)})},$

the maximum is taken over all global alignments given by two increasing sequences $a (\cdot)$ and $b (\cdot)$ ; c and γ are the number of aligned letters and gaps, respectively. The affine gap penalty, $g (l) = Δ + δ l$ , is considered where $Δ$ and δ are the gap opening and extension penalty, respectively.

2.2. Procedure of GEM

The greedy algorithm was described by Neuhauser (1994) for aligning DNA sequences. The Chen–Stein method for Poisson approximation is a powerful tool for deriving the statistical significance of the local alignment score obtained by the greedy algorithm. Mott and Tribe (1999) applied GEM to compare two biological sequences by using a standard scoring scheme and the gap penalties. They obtained an approximate local score when gaps are allowed.

The following steps are performed to obtain an approximate local alignment score based on GEM, namely $W_{G E M}$ (Fig. 2):

FIG. 2.

Local alignment is made by GEM. $(i, j)$ is the start point of alignments and greedy extension steps are seen in horizontal and vertical directions. The first step of greedy extension involves the insertion of a gap of length l in the vertical sequence, followed by a diagonal extension with score $Z'$ . So, $U_{1} = - g (l) + Z'$ . The gap can be inserted into the horizontal sequence as it is seen for U₂. GEM, greedy extension model.

Consider all ungapped clumps in the dot matrix. Let $Z_{i j} (= Z)$ be the maximum sum for a clump, starting at a ladder point $(i, j)$ and ending at $(k - 1, c - 1)$ .

Greedy step: Look for neighboring diagonals from $(k, c)$ to reach the best place for alignment extension. For this purpose, we insert l gaps in sequences and then the unconstrained maximum $Z'_{(k, c + l)}$ ( $Z'_{(k + l, c)}$ ), started at $(k + l, c)$ ( $(k, c + l)$ ), is calculated. The diagonal is found by maximizing U₁, where

(4) $U_{1} = {max}_{l > 0} (Z'_{k, c + l} - g (l), Z'_{k + l, c} - g (l)) .$

Note that under circumstances that the aforesaid maximum occurs at $l = 1$ , the diagonal does not change and the greedy algorithm is stopped, which is called sticking.

Repeat greedy step to produce $U_{1}, U_{2}, \dots$ and generate U walk $U_{1}, U_{1} + U_{2}, \dots$ .

Let $T (Z)$ be the first time that the U walk drops below $- Z$ or greedy extension step comprises just a gap of length 1. Now, a W score is defined as

$W (Z) = Z + max (0, U_{1}, U_{1} + U_{2}, \dots, U_{1} + \dots + U_{T (Z)})$ $= Z + V (Z) .$

Finally, $W_{G E M}$ is defined as the maximum of all the $W (Z)$ scores:

$W_{G E M} = {max}_{i j} (W (Z)) .$

To compute the p-value of similarity score based on GEM, Mott and Tribe (1999) assumed a linear relation between the parameters of the gapped and ungapped frameworks, $K_{g} \approx K (α) = K_{u} κ (α)$ and $λ_{g} \approx λ (α) = λ_{u} θ (α)$ , where $α$ is dependent on the gap penalty and the subscripts u and g refer to the ungapped and gapped cases, respectively. Their proposed p-value of the gapped alignment case is estimated by an interval based on $λ (α)$ and $K (α)$ . The Poisson clumping heuristic was used to derive the p-value (Aldous, 1989; Waterman and Vingron 1994b; Waterman, 1995) as (5) $P (H_{n, m} < t) = e^{- μ (t)},$

where $μ (t)$ is the expected number of clumps. The approximation is valid for $H_{n, m}$ since the ungapped local alignment score is the maximum score of “independent” clumps [Eq. (2)]. Likewise, the gapped local score, $M_{n, m}$ , followed the Poisson clumping heuristic (Waterman and Vingron, 1994b).

Mott and Tribe (1999) estimated $μ (t)$ as a function of $λ (α)$ and $K (α)$ using the asymptotic results of Feller (1972) and Iglehart (1972) on the constrained and unconstrained maxima. The drawback of GEM estimation is the short range of $α$ that, in turn, causes the decrease in accuracy and the imposition of restriction on choosing the gap penalty. Therefore, by extending the range of $α$ , the accuracy of the estimation is improved and the parameters $κ$ and $θ$ are simplified to linear function of $α$ modified by length correction terms (Mott, 2000).

This project addresses the estimation of $μ (t)$ by adopting a new approach in which the exact distribution of the local score of one sequence framework (Mercier and Daudin, 2001) and the unconstrained maximum (Mercier et al., 2003) is applied. Also, an approximate distribution of the local alignment score of two sequences proposed by Mercier and Daudin (2001) is used.

3. Our Proposition

In this section, we introduce a new approximate p-value based on $W_{G E M}$ , named IGEM (Improved GEM). Our method shows an improvement in accuracy of sequence similarity compared with Mott and Tribe (MT) method. We derive the p-value from the exact and near exact distribution for the constrained and unconstrained maxima, respectively, instead of using asymptotic distributions. In this regard, it is necessary to note that the distribution of the local score (Mercier and Daudin, 2001) and the maximum of partial sum of a sequence in ungapped case are considered (Feller, 1972; Iglehart, 1972).

Similarly to Equation (2), the local alignment score of a sequence of length n, H_n, is defined as (6) $H_{n} = {max}_{{i : i m a r k s a l a d d e r p o i n t}} Q_{i},$

where Q_i is the maximum score of the ith clump. Note that the clumps are mutually independent. In contrast, the exact distribution of H_n has been derived as the following: (7) $P (H_{n} < t) = 1 - P_{1} Π^{n} P'_{t + 1},$

where P_i is a $1 \times (t + 1)$ vector whose ith element is 1 and 0 elsewhere, and the $(t + 1) \times (t + 1)$ matrix $Π$ is filled with letter score distribution (Mercier and Daudin, 2001). Therefore, the combination of Equations (6) and (7) gives (8) $P (Q_{1} < t) = \sqrt[d]{P (H_{n} < t)},$

where $d = ρ n$ is the expected number of clumps. By extending Equation (8) to two sequences case and using Equation (2), the distribution of Z is determined as (9) $P (Z < t) = \sqrt[d']{P (H_{n, m} < t)},$

where $d' = ρ n m$ is the expected number of clumps. It is worth mentioning that unlike the previous studies, the expectation of scoring scheme is not necessary to be negative.

For the maximal partial sum, unconstrained maximum, an exact distribution is derived (Mercier et al., 2003) as (10) $P (Z' = k) = \sum_{i = 1}^{f} δ_{i} R_{i}^{k},$

where R_i's are the roots of a polynomial that satisfy $| R_{i} | < 1$ , f is the number of roots, and $δ_{i}$ 's are computed from recursive linear equation systems. The polynomial is based on the scoring scheme and its corresponding probability. Note that Equation (10) is valid in the logarithmic frameworks. In contrast, the expectation of the scoring scheme is negative.

3.1. Exact distribution of the constrained and unconstrained maxima relative to U walk

Let V and $V'$ be the constrained and unconstrained maxima, respectively, proposed by U walk. Considering the definition of $V (Z)$ , clearly, $V \leq V (Z) \leq V'$ . These maxima have a similar asymptotic behavior (Iglehart, 1972). Therefore, we determined the bounds for the distribution of $V (Z) = max (0, U_{1}, U_{1} + U_{2}, \dots, U_{1} + \dots + U_{T (Z)})$ by those of V and $V'$ . For this purpose, the distribution of U is required, which is obtained by Equation (10) as the following: (11) $P (U_{1} \leq t) = \prod_{l > 0} (\sum_{x = 0}^{t + g (l)} \sum_{i = 1}^{f} δ_{i} R_{i}^{x}) 2,$

where R_i's are the roots of a polynomial and f is the number of the roots with absolute values strictly <1. The inner summation is a convergent series since $| R_{i} | < 1$ . Moreover, for the large l and t, the summation on x converges to 1 because it calculates the cumulative distribution function. Note that this distribution is exact and valid in the logarithmic case, that is, $E [s (A, B)] < 0$ . As already mentioned, the diagonals of dot matrix are supposed to be independent, thus, U_i's are independent.

The exact distribution of $V'$ is obtained similar to U by (12) $P (V' = t) = \sum_{i = 1}^{f'} {δ'}_{i} R'_{i}^{t},$

where the parameters are computed similar to those of Equation (11) with the difference that the greedy score, U, is used rather than scoring scheme, s. Note that the average of greedy extension scores is required to be negative.

The exact distribution of V is obtained from Equation (8), which is calculated for U sequence. Let $H_{n_{u}}$ be the local alignment score of U sequence of length n_u, so (13) $P (V < t) = \sqrt[d_{u}]{P (H_{n_{u}} < t)},$

where d_u is the expected number of clumps. Note that, for the calculation of $P (H_{n_{u}} < t)$ from Equation (7), the matrix $Π$ is filled with the distribution of U. Also, the power of $Π$ , the length of U sequence, is a positive random variable. So, Equation (13) is rewritten as (14) $P (V < t) = \sqrt[d_{u}]{\sum_{n} P (H_{n_{u}} < t | n_{u} = n) P (n_{u} = n)},$

where $d_{u} \approx ρ' E (n_{u})$ with $ρ'$ being the probability that a coordinate of U sequence marks a ladder point. The greedy extension number can be interpreted as the gap numbers of the optimal global alignment of two sequences $A$ and $ℬ$ . Of course, the gap number is always equal to or larger than n_u since the greedy extension procedure may be stopped before the end of sequences. In the next section, we propose a distribution for n_u and it is confirmed by simulation.

3.2. Distribution of $W_{G E M}$

As already mentioned, the distribution of $W (Z)$ is derived from the obvious relation $W \leq W (Z) \leq W'$ . First, we need the distribution of Z that is calculated by Equation (9). Therefore, the distribution of W and $W'$ is derived by assuming independence between variable Z and two variables V and $V'$ . It seems reasonable since the variables V and $V'$ are obtained from the variables in which diagonals are different from variable Z.

To determine p (the coordinate marks the start of a gapped alignment of score $> t$ ), we define a clump start. The coordinate $(i, j)$ is the beginning of a clump if

$(i, j)$ is the start of an ungapped clump of positive score, with probability $K = ρ P (Z > 0) = ρ (1 - P (Z < 1))$ computed from Equation (9). Note that, IGEM gives one value, whereas MT finds an interval for this probability; and

the unconstrained maximum of GEM algorithm that runs backward from $(i, j)$ never attains positive score (equivalent to $(i, j)$ being a ladder point) with probability $Θ = P (V' < 0)$ that is obtained from Equation (12).

So, $P (t h e c o o r d i n a t e m a r k s t h e s t a r t o f a g a p p e d a l i g n m e n t o f s c o r e > t)$ lies in the interval (15) $(K Θ P (W > t), K Θ P (W' > t)) .$

Finally, by the Poisson clumping heuristic and like Equation (5), the approximate distribution of $W_{G E M}$ is given as (16) $P (W_{G E M} \leq t) \in (exp (- m n K Θ P (W' > t)), exp (- m n K Θ P (W > t))) .$

4. Numerical Results

In this section, the efficiency of the proposed p-value is assessed. For this propose, the gap number distribution should be determined. Here, we are interested in short and medium sequences wherein the classical methods such as BLAST failed.

4.1. Gap number distribution

As is mentioned in Section 3.1, the length of U sequence, n_u, is necessary to determine the distribution of V in Equation (14). As far as we know, this is the first study on the gap number distribution that is independent of gap length. A number of studies have been carried out on gap frequency relative to gap length, which empirically shows that gap distribution follows a power law distribution (Gu and Li, 1995; Qian and Goldstein, 2001; Zhang and Gerstein, 2003; Goonesekere and Lee, 2004).

To derive the gap number distribution, several databases are independently generated from a fixed letter distribution obtained from the letter frequencies of some Homo sapiens (human) sequences. Each database has 10,000 independent pairs of sequence with given length. For each sequence pair, the path and the score of an optimal global alignment are determined by using the Needleman–Wunsch algorithm, and the gap number, N_g, is counted. The scoring scheme used in calculation is BLOSUM62 and gap opening penalty is set at 9, 11, and 13 with gap extension penalty 2.

It is clear that $N_{g} > 0$ where the length of sequences is different. Let $\bar{N}$ and $σ_{N}^{2}$ be observed mean and variance of N_g, respectively. The distribution of N_g can be approximated by (17) $P (N_{g} = x) = \frac{P (x - 0.5 < T < x + 0.5)}{P (T < 0.5)},$

where $T \sim N (\bar{N}, σ_{N}^{2})$ . The denominator of fraction implies that the normal variable T is truncated at 0. The p-values of the Kolmogorov and Smirnov test are given in Table 1. Figure 3 illustrates a comparison between the theoretical Equation (17) and empirical distribution of N_g for some different sequence and various gap penalties. The results reveal that the distribution of N_g behaves as Equation (17).

FIG. 3.

Comparison between logarithm of the empirical and theoretical Equation (17) distribution of N_g for different length of sequences and various gap openings. Circle and triangle present the empirical and truncated distribution, respectively.

Table 1.

p-Values of the Kolmogorov and Smirnov Test Pertaining to Comparison of the Distribution of $N_{g}$ with Equation (17) for Different Lengths of Sequences and Gap Openings

Lengths	Gap opening
$M - N$	9	11	13
$40 - 80$	0.15	0.1	0.1
$30 - 40$	0.2	0.05	0.045
$40 - 50$	0.15	0.1	0.07
$50 - 55$	0.19	0.09	0.06
$55 - 80$	0.18	0.2	0.15
$30 - 50$	0.1	0.15	0.051
$80 - 90$	0.2	0.2	0.15
$80 - 100$	0.18	0.15	0.18
$50 - 150$	0.2	0.17	0.1
$50 - 160$	0.2	0.2	0.15
$50 - 210$	0.2	0.2	0.15
$150 - 160$	0.2	0.2	0.2
$150 - 300$	0.2	0.2	0.2
$200 - 400$	0.2	0.2	0.2
$200 - 210$	0.2	0.2	0.2
$360 - 370$	0.2	0.2	0.2
$100 - 300$	0.2	0.2	0.2
$200 - 600$	0.2	0.2	0.2

4.2. Simulation study

Here, we take into account two strategies to compare IGEM in Equation (16) with MT: (1) comparison obtained p-values from IGEM and MT with empirical values and (2) the assessment of IGEM and MT using real and simulated sequences with different degree of similarity.

For the first strategy, according to the letter distribution proposed by Robinson and Robinson (1991), we independently generated $L = 1 0^{4}$ pairs of sequences with length 300. Counting the gap number of all pairs gives $n_{u} \approx 17$ , which means the length of U sequences is ∼17.

Local scores are calculated using the scoring scheme BLOSUM62 with gap opening and extension of 11 and 2, respectively. An empirical p-value, P_e, is computed based on generated sequences by $P_{e} (a) = \frac{t h e n u m b e r o f p a i r s w i t h l o c a l s c o r e \geq a}{L}$

Table 2 indicates that for achieving $1 0^{- z} \leq p - v a l u e < 1 0^{- (z + 1)}$ , we need a local score between the determined intervals according to IGEM and MT methods.

Table 2.

For a p-Value of Magnitude $1 0^{- z}$ , the Empirical Local Score, $S_{e}$ , and Intervals of Local Score Based on IGEM and MT Methods Are Calculated from Simulated Sequences

z	S_e	$S_{I G E M}$	$S_{M T}$
4	$60.6$	$(74 . 0$ , $74.7)$	$(61 . 3$ , $63.2)$
5	$75.0$	$(74 . 9$ , $75.0$ )	$(69 . 3$ , $71.3)$
6	$-$	$(75 . 0$ , $76.9)$	$(77.3$ , $79.3)$
8	$-$	$(85.8$ , $89.1)$	$(93.2$ , $95.2)$

IGEM, improved greedy extension model; MT, Mott and Tribe.

Table 2 gives the empirical local scores, S_e, and two intervals of local score, $S_{I G E M} \cdot$ and $S_{M T}$ , whose p-value achieves $1 0^{- z}$ . As it is seen, for $z = 5$ , the score of the proposed method, $S_{I G E M}$ , is more accurate than that of $M T$ , $S_{M T}$ . For $z = 4$ , our method is not accurate enough compared with $M T$ . However, in practice, we are interested in $z > 4$ . Moreover, there is no result for $z = 6$ and 8 because of we requiring a larger database to assess IGEM.

Mott and Tribe (1999) highlight the fact that their method provides a better way for calculating the p-value than BLAST by obviating the need for an empirical p-value as a reference. Our goal, however, is to propose an alternative to BLAST p-value, especially for short sequences. Hence, we need an empirical p-value. This fact complicates our simulations because the p-values of our interest are very small, which requires larger databases.

To overcome the limitation of simulation, we use the Metropolis Coupled Markov Chain Monte Carlo (MCMCMC) (Geyer, 1991). This method has been applied in several fields such as physics (e.g., Earl and Deem, 2005) and biology (e.g., Zhou et al., 2001; Hartmann, 2002; Zhou, 2004; Wolfsheimer et al., 2007). Wolfsheimer et al. (2007) used MCMCMC algorithm to generate amino acid sequences with high degree of similarity, which is our main goal in simulation.

To generate similar sequences, sampling regions can be divided into different segments and sample can be chosen from a specific segment (in this case, tail of local score distributions) with high probability. In other words, the simple idea to achieve a large local score is sampling from different distributions. The statistical significance of an observed score, s, for two sequences x and y is (18) $P (s) = P (S (x, y) = s) = \sum_{x, y} I_{s} (x, y) P (x, y),$

where $I_{s} (x, y)$ is an indicator function with $I_{s} = 1$ if $S (x, y) \geq s$ and 0 otherwise and $P (x, y)$ is the probability of null model. Let $(x_{i}, y_{i}), i = 1, \dots, n,$ be a random sample of sequence pairs obtained under the null model. Then, Equation (18) can be approximated as $p (s) \approx \frac{1}{n} \sum_{i = 1}^{n} I (x_{i}, y_{i})$ . Also, one can rewrite Equation (18) as (19) $\begin{matrix} P (s) = P (S (x, y) = s) = \sum_{x, y} I_{s} (x, y) P (x, y) \\ = \sum_{x, y} f (x, y) \frac{I_{s} (x, y) P (x, y)}{f (x, y)} \\ = \sum_{x, y} f (x, y) q (x, y) \\ \approx \frac{1}{n} \sum_{i = 1}^{n} q (x_{i}, y_{i}) \end{matrix} .$

By drawing a sample according to $f (x, y)$ , $P (s)$ can be estimated through the average of this sample. Hartmann (2002) and Wolfsheimer et al. (2007) propose a distribution for $f (x, y)$ as (20) $f (x, y) \equiv p (x, y) e x p {\frac{s (x, y)}{T}},$

where T is a parameter that describes temperature of a physical system. In other words, each pair of sequences, $(x, y)$ , is the state of a physical system and by using a Markov chain of sequence pair, we generate a pair of similar sequences. Sample can be drawn from Equation (20) by using the MCMCMC algorithm. At iteration $t + 1$ , a pair of sequences, $(x_{t + 1}, y_{t + 1})$ , is accepted with probability (21) $P ((x', y') | (x, y)) = min {1, e x p {\frac{s (x_{t + 1}, y_{t + 1}) - s (x_{t}, y_{t})}{T}}} .$

This process leads a random sample of sequences to compute the p-value of similar sequences (Newberg, 2008). In our calculation, the scoring scheme BLOSUM62 is used with gap opening and extension of 11 and 2, respectively. The results of comparing MT and IGEM methods are presented in Table 3. For each sequence length, magnitude of p-value and its corresponding intervals of local score are calculated. As is seen, the local score of the MT method is larger than one of IGEM for the same p-value, which explains the sensitivity of IGEM. In other words, in the MT method for high similarity cases, a larger local score is required to achieve a smaller p-value.

Table 3.

For a p-Value of Magnitude $1 0^{- z}$ , the Intervals of Local Score Based on IGEM and MT Methods Are Calculated from Simulated Sequences, for Different Sequence Lengths, M and N

Sequence length	z	$S_{I G E M}$	$S_{M T}$
$M = 400; N = 400$	15	$(127, 133)$	$(144, 149)$
$M = 150; N = 150$	23	$(190, 195)$	$(200, 206)$
$M = 100; N = 125$	24	$(188, 193)$	$(198, 204)$
$M = 300; N = 150$	24	$(183, 189)$	$(210, 216)$
$M = 80; N = 130$	26	$(180, 184)$	$(221, 227)$
$M = 80; N = 180$	27	$(208, 213)$	$(230, 237)$

4.3. Real sequences

To evaluate the MT and IGEM methods on a real database, we use a subset of the SCOP 2.06 (Structural Classification of Proteins) database at family, superfamily, and fold levels.

Receiver operating characteristic curve is used to assess the performance of the MT and IGEM methods (Fig. 4). A method with a larger area under curve is considered as a method with a more precision and accuracy in a discrimination (Teichert et al., 2010). Table 4 indicates the performance of the IGEM method over the MT method at all levels.

FIG. 4.

ROC curves related to sequence comparison at family (a), super family (b), and fold (c) level based on statistical significance obtained from the MT and IGEM methods. IGEM, improved GEM; MT, Mott and Tribe; ROC, receiver operating characteristic.

Table 4.

Area Under Curve Values Related to the IGEM and MT Methods Relied on Some Real Sequences

	AUC
	IGEM	MT
Family	$0.9356$	$0.8983$
Super family	$0.8990$	$0.8170$
Fold	$0.7084$	$0.6328$

AUC, area under curve.

5. Conclusion

Using approximation to obtain the p-value of sequence alignment is a common method in biological and bioinformatic studies. However, the approximate p-value suffers from the lack of precision. Therefore, the accuracy of an approximation is crucial, which leads to introduce a new method with high accuracy in terms of statistical significance. For this purpose, we suggest an improved approximation of p-value based on the greedy extension method, namely IGEM, which is derived from the exact distribution of constrained and unconstrained maxima. As is discussed in Section 4, the proposed method presents some considerable advantages against previous methods, which makes it as a prospective candidate for the sequence alignment studies.

Footnotes

Author Disclosure Statement

The authors declare they have no competing financial interests.

Funding Information

No funding was received for this article.

References

Aldous

1989. Probability Approximations via the Poisson Clumping Heuristic. Springer-Verlag, New York.

Altschul

S.F.

, and Gish

1996. Local alignment statistics. Methods Enzymol. 266, 460–480.

Bailey

T.L.

, and Gribskov

2002. Estimating and evaluating the statistics of gapped local-alignment scores. J. Comput. Biol. 9, 575–593.

Bundschuh

2002. Rapid significance estimation in local sequence alignment with gaps. J. Comput. Biol. 9, 243–260.

Chabriac

, Lagnoux

, Mercier

, et al. 2014. Elements related to the largest complete excursion of a reflected BM stopped at a fixed time. Application to local score. Stoch. Process. Appl. 124, 4202–4223.

Earl

D.J.

, and Deem

M.W.

2005. Parallel tempering: Theory, applications, and new perspectives. Phys. Chem. Chem. Phys. 7, 3910–3916.

Fayyaz Movaghar

, Mercier

, and Ferré

2007. H-tuple approach to evaluate statistical significance of biological sequence comparison with gaps. Stat. Appl. Genet. Mol. Biol. 6, 1–21.

Feller

1972. An Introduction to Probability Theory and Its Applications, vol. 2. John Wiley & Sons, New York, NY.

Geyer

C.J.

1991. Markov chain Monte Carlo maximum likelihood. To appear in Computer Science and Statistics: Proceedings of the 23rd Symposium on the Interface.

10.

Goonesekere

N.C.

, and Lee

2004. Frequency of gaps observed in a structurally aligned protein pair database suggests a simple gap penalty function. Nucleic Acids Res. 32, 2838–2843.

11.

, and Li

W.-H.

1995. The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J. Mol. Evol. 40, 464–473.

12.

Hartmann

A.K.

2002. Sampling rare events: Statistics of local sequence alignments. Phys. Rev. E, 65, 056102.

13.

Hassenforder

, and Mercier

2007. Exact distribution of the local score for Markovian sequences. Ann. Inst. Stat. Math. 59, 741–755.

14.

Henikoff

, and Henikoff

J.G.

1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U. S. A. 89, 10915–10919.

15.

Iglehart

D. L.

1972. Extreme values in the GI/G/1 queue. Ann Math Stat. 627–635.

16.

Karlin

, and Altschul

S.F.

1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U. S. A. 87, 2264–2268.

17.

Karlin

, and Dembo

1992. Limit distributions of maximal segmental score among markov-dependent partial sums. Adv. Appl. Probab. 24, 113–140.

18.

Lagnoux

, Mercier

, and Vallois

2016. Statistical significance based on length and position of the local score in a model of i.i.d. sequences. Bioinformatics, 33, 654–660.

19.

Mercier

, Cellier

, and Charlot

2003. An improved approximation for assessing the statistical significance of molecular sequence features. J. Appl. Probab. 40, 427–441.

20.

Mercier

, and Daudin

J.-J.

2001. Exact distribution for the local score of one iid random sequence. J. Comput. Biol. 8, 373–380.

21.

Mott

1992. Maximum-likelihood estimation of the statistical distribution of smith-waterman local sequence similarity scores. Bull. Math. Biol. 54, 59–75.

22.

Mott

2000. Accurate formula for p-values of gapped local sequence and profile alignments. J. Mol. Biol. 300, 649–659.

23.

Mott

, and Tribe

1999. Approximate statistics of gapped alignments. J. Comput. Biol. 6, 91–112.

24.

Neuhauser

1994. A poisson approximation for sequence comparisons with insertions and deletions. Ann. Stat. 22, 1603–1629.

25.

Newberg

L.A.

2008. Significance of gapped sequence alignments. J. Comput. Biol. 15, 1187–1194.

26.

Qian

, and Goldstein

R.A.

2001. Distribution of indel lengths. Proteins, 45, 102–104.

27.

Robinson

A.B.

, and Robinson

L.R.

1991. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl. Acad. Sci. U. S. A. 88, 8880–8884.

28.

Siegmund

, and Yakir

2000. Approximate p-values for local sequence alignments. Ann. Stat. 657–680.

29.

Teichert

, Minning

, Bastolla

, et al. 2010. High quality protein sequence alignment by combining structural profile prediction and profile alignment using sabertooth. BMC Bioinformatics, 11, 251.

30.

Waterman

M.S.

1995. Introduction to Computational Biology: Maps, Sequences and Genomes. CRC Press, London.

31.

Waterman

M.S.

, and Vingron

1994a. Rapid and accurate estimates of statistical significance for sequence data base searches. Proc. Natl. Acad. Sci. U. S. A. 91, 4625–4628.

32.

Waterman

M.S.

, and Vingron

1994b. Sequence comparison significance and poisson approximation. Stat. Sci. 9, 367–381.

33.

Wolfsheimer

, Burghardt

, and Hartmann

A.K.

2007. Local sequence alignments statistics: Deviations from Gumbel statistics in the rare-event tail. Algorithms Mol. Biol. 2, 9.

34.

Zhang

, and Gerstein

2003. Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res. 31, 5338–5348.

35.

Zhang

, Schwartz

, Wagner

, et al. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214.

36.

Zhou

2004. Exploring the protein folding free energy landscape: Coupling replica exchange method with P3ME/RESPA algorithm. J. Mol. Graph. Model. 22, 451–463.

37.

Zhou

, Berne

B.J.

, and Germain

2001. The free energy landscape for β hirpin folding in explicit water. Proc. Natl. Acad. Sci. U. S. A. 98, 14931–14936.

New Approximate Statistical Significance of Gapped Alignments Based on the Greedy Extension Model

Abstract

1. Introduction

2. Greedy Extension Model

2.1. Definitions and notations

2.2. Procedure of GEM

3. Our Proposition

3.1. Exact distribution of the constrained and unconstrained maxima relative to U walk

3.2. Distribution of W G E M

4. Numerical Results

4.1. Gap number distribution

4.2. Simulation study

4.3. Real sequences

5. Conclusion

Footnotes

Author Disclosure Statement

Funding Information

References

3.2. Distribution of $W_{G E M}$