Incorporating Nearest-Neighbor Site Dependence into Protein Evolution Models

Abstract

Evolutionary models of proteins are widely used for statistical sequence alignment and inference of homology and phylogeny. However, the vast majority of these models rely on an unrealistic assumption of independent evolution between sites. Here we focus on the related problem of protein structure alignment, a classic tool of computational biology that is widely used to identify structural and functional similarity and to infer homology among proteins. A site-independent statistical model for protein structural evolution has previously been introduced and shown to significantly improve alignments and phylogenetic inferences compared with approaches that utilize only amino acid sequence information. Here we extend this model to account for correlated evolutionary drift among neighboring amino acid positions. The result is a spatiotemporal model of protein structure evolution, described by a multivariate diffusion process convolved with a spatial birth–death process. This extended site-dependent model (SDM) comes with little additional computational cost or analytical complexity compared with the site-independent model (SIM). We demonstrate that this SDM yields a significant reduction of bias in estimated evolutionary distances and helps further improve phylogenetic tree reconstruction. We also develop a simple model of site-dependent sequence evolution, which we use to demonstrate the bias resulting from the application of standard site-independent sequence evolution models.

1. Introduction

Protein alignment is an integral part of bioinformatic analyses and is a classic widely studied problem in computational biology. Existing methods for aligning two or more proteins compare amino acid sequences and/or structures of the proteins, and encompass a variety of algorithms with different strengths and purposes. Such algorithms are a fundamental part of phylogenetic research in particular, where the degree and nature of evolutionary divergence between species is a quantity of interest. Alignment procedures that are widely used in studies of protein evolution are based only on the amino acid sequence and do not incorporate the tertiary (three-dimensional) structure of the proteins. Methods that do incorporate tertiary structure, such as those mentioned in Wang et al. (2013), do not account for the evolution over time of those structures.

Recently, Challis and Schmidler (2012) introduced a stochastic evolutionary model of protein sequence and structure for this purpose; however, their approach, similar to the vast majority of alignment algorithms, assumes that “sites” (individual amino acid characters or backbone atom coordinate triples) evolve independently of one other. This assumption is well known to be violated since amino acid identities and spatial locations are highly dependent due to a combination of physicochemical constraints and interactions, including bond lengths and excluded volume, hydrophobic and electrostatic attraction and repulsion, hydrogen bonding, and other cooperative effects in forming stable local and global protein structure. Nevertheless, alignment algorithms based on both sequence and structural information typically ignore the correlations induced by these interactions. Ignoring dependence is often justified by the computational intractability of site-dependent models (SDMs) (Challis and Schmidler, 2012; Herman et al., 2014). In this article we demonstrate that in structure-based alignment, as in sequence-based, ignoring site dependence systematically biases evolutionary inference. We present an expanded version of the Challis and Schmidler model that incorporates neighbor dependence without sacrificing computational tractability.

1.1. Motivation

von Haeseler and Schöniger (1998) examined the effect of site dependence on estimates of evolutionary distance between pairs of biological sequences. Using a model of whale mitochondrial DNA evolution whereby the sequence evolves as a collection of independent subsequences, each exhibiting Markovian dependence among its amino acids, the authors demonstrated the tendency to underestimate the true evolutionary distance between two sequences when using a site-independent model (SIM). Figure 3a replicates this effect using binary sequences from a nearest-neighbor site-dependent sequence model that does not assume independent subsequences, described in Section 5. When estimating the divergence time for these sequences under a site-independent version of the same model ( $b = 1$ for model in Section 5), the posterior distribution (Fig. 3a) shows significant underestimation of the true value.

Despite a variety of efforts, no site-dependent sequence model has emerged as a widely applicable replacement for commonly used site-independent sequence models (Arenas, 2015). The primary hurdle to doing so is computational—adding realistic dependence generally prohibits the use of efficient alignment algorithms that rely on dynamic programming. We address this issue further in Section 5.

In contrast, we demonstrate in Section 3 that the site-independent structural model of Challis and Schmidler (2012) can be extended to a site-dependent structural model, incorporating site dependence while maintaining the same interpretability and mathematical and computational tractability as the SIM. Thus, we can incorporate dependence into the evolutionary structural part of the model in a relatively straightforward way. Using data simulated from the SDM, we find a systematic underestimation effect for structural data due to the independent-site assumption, similar to that observed in sequences (Fig. 3e). The new SDM can then be paired with a sequence evolution model to provide a site-dependent expansion of the joint sequence-structure model of Challis and Schmidler (2012).

The article is organized as follows. We briefly review the SIMs commonly used for sequence evolution as well as the structural diffusion model of Challis and Schmidler (2012). We then describe the general form of a dependent structural diffusion model. Section 3 describes the details of incorporating dependence into the model, with computational tractability being the key constraint on the model's form. Section 4 describes a reparameterization of the SDM necessary for analyzing the SDM's effect on phylogenetic inference. Section 6 revisits the motivating aforementioned example and compares inferences and phylogenies from the expanded model on a number of real protein examples. Section 5 describes the basic site-dependent sequence model used in this article.

2. Site-Independent Models (SIMS) of Protein Evolution

We briefly review the site independence assumption as applied in familiar sequence evolution models, as well as more recent developments in modeling structural evolution.

2.1. Sequence evolution

Stochastic models of protein sequence evolution are widely used for statistical alignment, evolutionary inference, and phylogenetic reconstruction (Challis and Schmidler, 2012; Herman et al., 2014; Wang and Schmidler, 2014). The general form taken for such models is similar to that used in Challis and Schmidler (2012), where the joint likelihood for the two sequences $S^{X}, S^{Y}$ and an alignment $ℳ$ between them is given by $\begin{matrix} p (S^{X}, S^{Y}, ℳ | λ, μ, τ, Q) & = P (S^{X}, S^{Y} | ℳ, τ, Q) P (ℳ | λ, μ, τ) \\ = P (S_{M}^{Y} | S_{M}^{X}, τ, Q) P (S_{\bar{M}}^{Y} | π) \times P (S^{X} | π) P (ℳ | λ, μ, τ) \end{matrix},$ (1)

where $S_{M}^{X}, S_{M}^{Y}$ denote the matched (aligned) positions of the amino acid sequences S^X and S^Y, $S_{\bar{M}}^{Y}$ the unmatched positions of S^Y, Q the substitution rate matrix, and $π$ the equilibrium distribution of amino acid labels. The probabilities $P (S_{M}^{Y} | S_{M}^{X}, τ, Q)$ are given by a product of independent substitution probabilities at each site through the transition probability matrix $e^{Q τ}$ . $P (S_{\bar{M}}^{Y} | π)$ and $P (S^{X} | π)$ are given by the equilibrium distribution $π$ , and we refer the reader to Challis and Schmidler (2012) for a discussion of the Links indel model, which we use in this study to specify $P (ℳ | λ, μ, τ)$ .

2.2. Challis–Schmidler model

Challis and Schmidler (2012) introduced a stochastic model for protein structure evolution, extending a previously developed probabilistic framework for structural alignment of proteins (Schmidler, 2006; Wang and Schmidler, 2014) into a model suitable for the study of molecular evolution. This study demonstrated the ability to significantly improve phylogenetic inference when structural information about the proteins is available (see also Herman et al., 2014). We briefly review the original Challis–Schmidler model here, before introducing our extended model incorporating site dependence in the next section. Throughout the article, these structural models will be referred to as the SIM and SDM, respectively.

The diffusion of individual $C_{α}$ backbone positions in space, over time, is modeled through an Ornstein–Uhlenbeck (OU) process. Independence is assumed between each site along the backbone as well as between the $(x, y, z)$ coordinates at each site, leading to the joint structure diffusion being modeled as a product of $3 n$ independent univariate OU processes: $d C_{i j}^{(t)} = θ (ζ_{j} - C_{i j}^{(t)}) d t + σ d B,$ (2)

where $C_{i j}^{(t)}$ denotes coordinate $j \in {x, y, z}$ of $α$ -carbon i at time t. This setup admits tractable stationary and conditional distributions but, as noted by Challis and Schmidler, fails to account for known biophysical interactions that lead to strong observed dependence between sites, such as bond length constraints and the effect of excluded volume in the protein. Although a protein structure's coordinate frame is arbitrarily determined by the experiment, we assume the two structures in our pairwise analyses share a coordinate frame; thus, for a pair of structures $C^{X}, C^{Y}$ , we assume the coordinate frame of C^X and do not distinguish between C^Y and any rigid body rotation R and translation $η$ thereof.

We refer the reader to Challis and Schmidler (2012) for a more detailed discussion of the diffusion process described earlier and for various other model details omitted here for brevity. Finally, we give the likelihood function resulting from the aforementioned structural model. Let $Θ$ represent the full set of parameters for the structural model, $Θ = (t, σ^{2}, θ, R, η, λ, μ)$ . Then the structural likelihood is $\begin{matrix} p (C^{X}, C^{Y}, ℳ | Θ) = P (C_{M}^{Y} | C_{M}^{X}, t, σ^{2}, θ, R, η) P (C_{\bar{M}}^{Y} | σ^{2}, θ) \\ \times P (C^{X} | σ^{2}, θ) P (M | λ, μ, t) \end{matrix} .$ (3)

2.3. Bayesian inference

The models described earlier provide the likelihood of the data which can be used to estimate alignments, evolutionary trees and distances, and model parameters. Throughout this article we will follow the Bayesian paradigm, under which inference is based on the posterior distribution: $P (Θ, ℳ | S^{X}, S^{Y}, C^{X}, C^{Y}) \propto P (C^{X}, C^{Y}, S^{X}, S^{Y} | Θ, ℳ) \cdot P (Θ, ℳ),$ (4)

where the likelihood $P (C^{X}, C^{Y}, S^{X}, S^{Y} | Θ, ℳ)$ is given by the sequence and structure models in Eqs. (1) and (3). A key advantage of the SIM is that marginal likelihoods can be calculated efficiently, summing over possible alignments or evolutionary histories (see Section 3.2).

3. A Site-Dependent Structural Diffusion Model

We now proceed to extend the aforementioned SIM of protein structure evolution to incorporate site dependence. As will be shown, computational tractability can be preserved in the case where dependence is limited to nearest-neighbor relationships.

3.1. Dependence in a multivariate OU process

The independent site model (Eq. 2) can be written as a multivariate diffusion in the form $d C = - Θ (C - ς) d t + L d B_{t},$ (5)

where $Θ$ and $\sum = L L'$ are both assumed to be identity matrices. Here the $3 n \times 1$ vector $C = (C_{x}, C_{y}, C_{z})$ contains the backbone $α$ -carbon coordinates, $ς$ is the $3 n \times 1$ long-term mean vector, and B _t represents $3 n$ independent univariate standard Brownian motion terms. Writing the model in this form makes clear that the assumption of site- (and coordinate-) independence can be relaxed by introduction of general $Θ$ and $\sum$ , enabling a more expressive model. For convenience we factor $Θ = \sum_{d} ⨂ Θ_{p}$ and $\sum = \sum_{d} ⨂ \sum_{p}$ as Kronecker products, allowing coordinate dependence (subscript d) and backbone site dependence (subscript p) to be modeled separately.

For purposes of this article we set $\sum_{d} = I_{3}$ allowing the $x, y, z$ dimensions within an individual site to diffuse independently of each other. Observed data suggest that dependence between diffusion in the $(x, y, z)$ dimensions is not strong: Table 1 shows average sample correlations between spatial dimensions for 549 structures comprised of a group of globins and a large group from the manually curated MALIDUP database (Cheng et al., 2008), as well as sample lag-1 autocorrelations (i.e., correlations between consecutive backbone $α$ -carbons) within each spatial dimension. Although some proteins show weak to moderate correlation between spatial dimensions, the averages indicate the correlation is relatively weak compared with the strong autocorrelation along the backbone within a given spatial dimension. Consequently, we focus on incorporating dependence along the backbone rather than among spatial dimensions $x, y, z$ .

Table 1.

Mean Sample Correlations Between Dimensions and Mean lag-1 Autocorrelations Along Dimensions for 71 Globin and 478 MALIDUP Protein Structures

	lag-1 autocorrelation			Correlation
	x	y	Z	(x,y)	(x,z)	(y,z)
Globins	0.95	0.95	0.95	−0.01	0.00	0.01
MALIDUP	0.93	0.93	0.93	0.01	0.02	−0.02

Under the SDM then, the joint evolution of the $3 n$ scalar coordinates specifying all n backbone positions follows a multivariate OU process governed by $3 n \times 3 n$ matrix-valued parameters $Θ$ and $\sum$ . This model introduces site dependence while preserving the analytical tractability of the conditional and limiting distributions of the process, important properties for phylogenetic inference. Under the diffusion process defined by the stochastic differential equation (Eq. 5), the joint distribution of $C^{(t)}$ (the full coordinate set at time t) conditional on $C^{(s)}$ is multivariate normal: $\begin{matrix} P (C^{(t)} | C^{(s)}) \sim N (e^{- Θ τ} C^{(s)} + (I - e^{- Θ τ}) ς, Σ_{τ}) \end{matrix}$ (6)

with $τ$ denoting the time difference $(t - s)$ and with conditional covariance $Σ_{τ}$ given by

where $vec ()$ is the linear operator converting a matrix into a column vector. Letting $τ \to \infty$ in the conditional mean and covariance gives the stationary distribution $P (C) \sim N (ς, Σ_{\infty}),$ (8)

where the stationary covariance $Σ_{\infty}$ is expressed as

Although these closed-form solutions exist for general $Σ_{p}, Θ_{p}$ , they are in general not computationally tractable when convolved with the indel process of the evolutionary model from Challis and Schmidler (2012) (i.e., the Links model of Thorne et al., 1991) because the conditional independence required for dynamic programming is not preserved. To maintain computational tractability in phylogenetic applications, we require forms of $Θ_{p}$ and $Σ_{p}$ for which both the conditional and stationary distributions of the multivariate OU exhibit certain conditional independencies, as described in the next section.

3.2. Computational tractability in phylogenetic models

Common uses of evolutionary models, in phylogenetic or homology detection contexts, require the ability to optimize or average over the set of possible alignments. In a Bayesian or maximum likelihood context, the alignment must be inferred simultaneously with the other parameters. Because of the (exponentially large) size of the alignment space, algorithmic efficiency considerations in these calculations play a key role. In particular, calculating the joint likelihood $p (X, Y)$ of two structures X and Y marginalized over all possible alignments $ℳ$ is possible in SIMs by use of dynamic programming (the so-called forward algorithm for pair hidden Markov models (HMMs); see Durbin et al., 1998). These algorithms depend on conditional independence properties of the (marginal) likelihood of the backbone coordinates at a single backbone site given all previous backbone sites: $P (C_{i j}^{X}, C_{i j}^{Y} | C_{1 j}^{X}, C_{2 j}^{X}, \dots, C_{(i - 1) j}^{X}, C_{1 j}^{Y}, C_{2 j}^{Y}, \dots, C_{(i - 1) j}^{Y}) = P (C_{i j}^{X}, C_{i j}^{Y})$ (10)

with X and Y denoting ancestor and descendant structures, respectively. Models with long-range dependence among sites, including the dependent diffusion model (Eq. 5) with general $Θ, Σ = L L'$ , do not exhibit these conditional independence relationships and, therefore, prohibit the recursive decomposition that forms the basis of efficient dynamic programming calculations. Since an evolutionary model without efficient alignment algorithms is far too expensive to use in the context of phylogenetic tree inference, we desire a model that incorporates site dependence while still preserving sufficient conditional independence structure to permit use of a forward-type algorithm.

3.3. Constructing a dependent structural diffusion model

A natural approach to introducing limited neighbor dependence into the diffusion model is to consider the backbone sites' coordinates as a series of nodes with forces acting upon each pair of neighboring sites, for example, as in a ball and spring model. Figure 1 shows a general ball and spring model with spring constants $k_{i j}$ . This model corresponds to a probability distribution for the equilibrium positions of the backbone coordinates, which has precision matrix $Σ^{- 1} = (b_{i j})$ , where $b_{i j} = b_{j i}, b_{i i} = k_{i - 1, i} + k_{i, i + 1}$ and $b_{i j} = 0$ for $| i - j | > 1$ .

FIG. 1.

General ball and spring model for n backbone positions.

The corresponding Gaussian model with neighbor dependence is a spatial first-order autoregressive process, denoted AR(1). However, setting the spring matrix equal to an AR(1) precision matrix gives a set of equations for the spring constants $k_{i j}$ with no solution. We, therefore, instead approach the problem of incorporating dependence by starting with a general $Θ$ and $Σ$ and determining what specific forms will correspond to an AR(1) process along the backbone.

We used symbolic algebra software to assist in solving for general matrices $Θ_{p}$ and symmetric, positive definite $Σ_{p}$ such that the constraints $Λ_{τ} (i, j) = Λ_{\infty} (i, j) = 0 \forall i, j : | i - j | > 1$ are satisfied for conditional and stationary precision matrices $Λ_{τ}, Λ_{\infty}$ . Solutions to low-dimensional problems allowed us to identify the general form for a single pair of suitable $Θ_{p}, Σ_{p}$ . For five backbone positions this nearest-neighbor SDM takes the form: $Θ_{p} = θ (\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ ρ & 1 - ρ^{2} & 0 & 0 & 0 \\ ρ^{2} & - ρ^{3} & 1 & 0 & 0 \\ ρ^{3} & - ρ^{4} & 0 & 1 & 0 \\ ρ^{4} & - ρ^{5} & 0 & 0 & 1 \end{matrix}) Σ_{p} = σ^{2} (\begin{matrix} 1 & a ρ & a ρ^{2} & a ρ^{3} & a ρ^{4} \\ a ρ & 1 & ρ & ρ^{2} & ρ^{3} \\ a ρ^{2} & ρ & 1 & ρ & ρ^{2} \\ a ρ^{3} & ρ^{2} & ρ & 1 & ρ \\ a ρ^{4} & ρ^{3} & ρ^{2} & ρ & 1 \end{matrix}),$ (11)

where $a = (3 - ρ^{2}) ∕ 2$ . The conditional and stationary distributions given by Eqs. (6) and (8) have tri-diagonal precision matrices. Thus, dynamic programming is preserved, although with some modification to the standard pair HMM recursion formulas required as described in Section 3.5.

Similar computer algebra experiments were used to demonstrate that no such solutions exist for any diffusion of the form in Eq. (5), where $Σ_{p} = I$ . With $Θ = I_{3} ⨂ Θ_{p}$ and $Σ = I_{3} ⨂ Σ_{p}$ , Eqs. (6–9) give the marginal or conditional distributions for matched positions.

3.4. Bayesian inference for the SDM

Under the new SDM specified by Eqs. (5) and (11), the joint distribution $p (X, Y | ℳ)$ of backbone coordinates for ancestor X and descendant Y given any alignment $ℳ$ can be expressed $p (X, Y | ℳ) = \prod_{m \in M} p (X_{[m]}, Y_{[m]} | m, N_{m}) \prod_{d \in D} p (X_{[d]}, Y_{[d]} | d, N_{d}) \prod_{i \in I} p (X_{[i]}, Y_{[i]} | i, N_{i}),$ (12)

where $M, D$ , and I, respectively, are the sets of matched, deleted, and inserted sites in $ℳ$ . $X_{[m]}$ denotes the backbone coordinates of the positions of X aligned in $m \in M$ , and $N_{i}$ is the set of backbone positions neighboring position i. In other words, $p (X, Y | ℳ)$ can be expressed in a decomposed form, each factor of which is either the joint density for a contiguous block of matches given its neighbors or the density of an insertion or deletion distribution for a particular site given its neighbors.

Bayesian inference based on this joint distribution (and that including indels) uses priors and sampling techniques detailed in Challis and Schmidler (2012) with trivial additions to accommodate priors and sampling for the model's dependence parameter $ρ$ .

3.5. Modified dynamic programming for pair HMMs with nearest-neighbor dependence

The recursive equations used for the pair HMM underlying the SIM (Durbin et al., 1998) require several modifications to be used with the SDM. These modifications are specific to the form of $Θ$ and $Σ = L L'$ chosen for the structural diffusion parameters. The primary reason for the changes is that the backbone coordinate emission probabilities in the SIM are independent of neighboring sites, whereas in the SDM the emission probabilities depend on neighboring sites.

In the SDM, the dynamic programming equations' coordinate emission probabilities for each site will now involve preceding positions' coordinates. Because these probabilities are specified by distributions conditional on an alignment, we must know the form of the joint distribution $p (X, Y | ℳ)$ (Eq. 12) for any alignment $ℳ$ .

In our model, as in Challis and Schmidler (2012), a pair HMM is used to model the distribution of pairwise alignments between two proteins. As described in Durbin et al. (1998), the use of a pair HMM allows one to calculate the probability of two protein structures marginalized over all possible alignments between the two structures. This is accomplished through dynamic programming by using the well-known forward algorithm to recursively calculate values of $f^{k} (i, j)$ (i.e., the total probability of all partial alignments through position $(i, j)$ in the ancestor (i) and descendant (j) that end in state $k \in {M a t c h, D e l e t e, I n s e r t}$ ). The forward equations typically used for this purpose are presented in Durbin et al. (1998) as follows:

$f^{D} (i, j) = p_{X_{i}} \cdot (a_{M D} f^{M} (i - 1, j) + a_{D D} f^{D} (i - 1, j) + a_{I D} f^{I} (i - 1, j))$ (14)

$f^{I} (i, j) = p_{Y_{j}} \cdot (a_{M I} f^{M} (i, j - 1) + a_{D I} f^{D} (i, j - 1) + a_{I I} f^{I} (i, j - 1)),$ (15)

where $p_{X_{i}, Y_{j}}, p_{X_{i}}, p_{Y_{j}}$ are the three emission probabilities for, respectively, a matched pair $X_{i}, Y_{j}$ , a deletion X_i, and an insertion Y_j. Terms of the form $a_{J K}$ give the probability of transition from state J to state K in the pair HMM. The emission probability terms $p_{X_{i}, Y_{j}}, p_{X_{i}}$ and $p_{Y_{j}}$ involve only the sites denoted and are independent of neighboring sites⁴. The SDM emission probabilities are not independent of other sites, so the forward equations must be modified.

To illustrate the set of changes needed, we focus only on the Match equation (Eq. (13)); analogous changes are required for the other two recursive equations. Equation (13) gives the total probability of all alignments up to position $(i, j)$ that end with a Match at position $(i, j)$ . The three terms on the right-hand side arise because a path through the pair HMM could arrive at a Match at $(i, j)$ from one of three previous states in the path: either a Match, Delete, or Insert at $(i - 1, j - 1)$ . The term $p_{X_{i}, Y_{j}}$ is a single factor on the right-hand side, indicating that the Match emission probability at $(i, j)$ is the same regardless of the previous state in the path. In our case, the Match emission probability at $(i, j)$ depends on the previous state in the path. Accordingly, the first step in modifying the equation for our purposes is to define unique emission probabilities that depend on the previous state in the path through the pair HMM. We write the site-dependent version of Eq. (13) as

$\begin{matrix} f^{M} (i, j) = ({\bar{p}}_{X_{i} Y_{j}}^{M}) \cdot a_{M M} f^{M} (i - 1, j - 1) \\ + ({\bar{p}}_{X_{i} Y_{j}}^{D}) \cdot a_{D M} f^{D} (i - 1, j - 1), \\ + ({\bar{p}}_{X_{i} Y_{j}}^{I}) \cdot a_{I M} f^{I} (i - 1, j - 1) \end{matrix}$ (16)

where the superscripts on $\bar{p}$ terms indicate the previous state before the Match at $X_{i}, Y_{j}$ . The modified equations for $f^{D} (i, j)$ and $f^{I} (i, j)$ are analogous. Any of the emission distributions $\bar{p}$ can be derived by first writing down the joint distribution for the appropriate backbone positions given an alignment (see Section 3.4) and then conditioning on that multivariate normal distribution as needed. When determining the emission distributions, obvious edge cases must be dealt with.

To highlight an important point regarding Eq. (16), note that the emission distribution for a matched pair given a previous Match ( ${\bar{p}}_{X_{i} Y_{j}}^{M}$ ) depends on where in the alignment the emitted matched pair occurs. In other words, calculation of ${\bar{p}}_{X_{i} Y_{j}}^{M}$ should take into account two possibilities: one that the state before the previous Match was also a Match, and two that it was an insertion or deletion. To show why this is needed, one can write the joint distribution for three consecutive matched pairs and show that the distribution of the second matched pair conditional on previous positions is different than the distribution of the third matched pair conditional on previous positions. This characteristic arises due to the specific forms chosen for the OU process' $Θ$ and $Σ$ in our SDM. Thus, the term ${\bar{p}}_{X_{i} Y_{j}}^{M}$ in Eq. (16) will itself be calculated as a sum over possible states preceding the prior state:

The presence of the recursive term ${\bar{p}}_{X_{i - 1}, Y_{j - 1}}^{M}$ in the aforementioned equation requires that an additional dynamic programming matrix be tracked. There are no other emission probabilities that depend on more than one previous hidden state of the pair HMM.

3.5.1. Derivation of emission probabilities

Suppose $ℳ_{p}$ is a known partial alignment of all matches, aligning n positions X_i through $X_{i + n - 1}$ to positions Y_j through $Y_{j + n - 1}$ with no indels. The joint distribution of these backbone coordinates $p (X_{i, i + n - 1}, Y_{j, j + n - 1} | ℳ_{p})$ has a block covariance matrix: $p (X_{i, i + n - 1}, Y_{j, j + n - 1} | ℳ_{p}) \sim N (0, (\begin{matrix} Σ_{n \times n} & R^{T} \\ R & Σ_{n \times n} \end{matrix})),$ (18)

where $Σ_{n \times n}$ is equal to the stationary OU solution obtained using Eq. (11) and R is $n \times n$ , equal to $R = \frac{σ^{2} e^{- θ τ}}{2 θ} (\begin{matrix} 1 & ρ & ρ^{2} & ρ^{n - 1} \\ ρ k & 1 & ρ & \dots & ρ^{n - 2} \\ ρ^{2} k & ρ & 1 & ρ^{n - 3} \\ ⋮ & ⋱ & ⋮ \\ ρ^{n - 1} k & ρ^{n - 2} & ρ^{n - 3} & \dots & 1 \end{matrix})$

with $k = \frac{1 - (1 - ρ^{2}) e^{θ ρ^{2} τ}}{ρ^{2}}$ . The emission probability for an insertion Y_j or deletion X_i at a particular site given its previous neighbor has an AR(1) form: $p (X_{i} | X_{i - 1}, Y_{j}) \equiv p (X_{i} | X_{i - 1}) \sim N (ρ X_{i - 1}, σ^{2} (1 - ρ^{2}))$ (19) $p (Y_{j} | X_{i}, Y_{j - 1}) \equiv p (Y_{j} | Y_{j - 1}) \sim N (ρ Y_{j - 1}, σ^{2} (1 - ρ^{2})) .$ (20)

The joint distribution $p (X, Y | ℳ)$ can be specified by combining these insertion and deletion distributions with the distribution for contiguous matches in Eq. (18). Then, the nine dynamic programming emission distributions can be verified using standard techniques for conditioning multivariate normal distributions.

4. Joint Sequence-Structure Model for Phylogenetic Inference

Phylogenetic inference involves constructing a phylogenetic tree using estimates of the evolutionary distance between proteins, or equivalently models of the time-dependent evolution. Traditionally this is done using site-independent sequence evolution models parameterized by a matrix Q of relative substitution rates, defining a likelihood over the time $τ$ over which evolution occurs. The joint sequence-structure evolution model introduced by Challis and Schmidler (2012) multiplies this likelihood by one derived similarly from the time-dependent structure diffusion process (SIM) given by Eq. (2), allowing both structural and sequence differences to inform the estimation of divergence time $τ$ .

4.1. Amino acid sequence model

The sequence portion of our joint sequence and structure model follows that used in Challis and Schmidler (2012), given in Eq. (1).

4.2. Site-dependent random effect model

In a sequence evolution model (Eq. 1), only the product $Q τ$ is identifiable—one cannot simultaneously estimate absolute rates and $τ$ itself. As a result, it is standard to scale the substitution rate matrix Q to a single expected substitution per unit time (Kosiol and Goldman, 2005). As a result, the time $τ$ is interpreted as the expected number of substitutions per site, which can be estimated from sequences. The structural model exhibits a similar identifiability issue: in pairwise estimation with a structure-only model, with neither rate $θ$ nor time $τ$ fixed, only the structural distance $θ τ$ would be identifiable. In the Challis–Schmidler model this was not thought to be a concern, since when the joint model is used $τ$ becomes determined by the sequence information, making $θ$ identifiable as well.

However, this means that disagreement between the structural evolution model and sequence evolution model regarding the divergence time $τ$ will be resolved by compensation in the estimate of $θ$ . Because we do not currently have a computationally tractable site-dependent sequence evolution model, we do not wish the information in the structural SDM to be overridden by the site-independent sequence model, which we know to be susceptible to underestimation. We address this by introducing a distinct sequence time $Q τ = τ_{q}$ and structural time $τ_{s}$ related by a stochastic model. This differs from the approach of Challis and Schmidler (2012) and Herman et al. (2014), which assumed a common time shared by both structural and sequence components of the likelihood.

The importance of distinguishing these two quantities is highlighted by the plot in Figure 2, where we estimated divergence time separately using the sequence-only model of Eq. (1) and the independent structure-only model (see e.g., Challis and Schmidler, 2012) for a set of globins. There is a strong arguably linear relationship between the structure-only evolutionary distance $θ τ$ and the sequence-only evolutionary distance $τ$ , but the relationship between them is clearly noisy. Forcing the two models to share a common parameter ignores the different amounts of information and uncertainty provided about the evolutionary distance by sequence and structural data. The sequence-only and structure-only phylogenetic trees are shown as well, where we see the implications for tree topology.

FIG. 2.

Pairwise sequence-only distance ( $τ_{q}$ ) and structure-only distance ( $θ τ_{s}$ ) estimates from a set of 24 globin proteins under the SIM. The estimates are plotted against each other in panel (b) with the respective phylogenetic tree estimates (via neighbor-joining) in panels (a, c). In panel (b), we excluded pairs whose sequence distances could not be reliably estimated due to high sequence divergence.

Instead, we introduce a random effect model defining a stochastic linear relationship between sequence and structure distances:

Here $τ_{s}, τ_{q}$ are the structural and sequence divergence times, respectively. A simple linear regression gives $\hat{β} = 0.005$ and an estimate for $ω$ . Under this formulation, the sequence model is now given by $\begin{matrix} p (S^{X}, S^{Y}, ℳ | λ, μ, τ_{q}, Q) = P (S^{X}, S^{Y} | ℳ, τ_{q}, Q) P (ℳ | λ, μ, τ_{q}) \\ = P (S_{M}^{Y} | S_{M}^{X}, τ_{q}, Q) P (S_{\bar{M}}^{Y} | π) \times P (S^{X} | π) P (ℳ | λ, μ, τ_{q}) \end{matrix}$ (22)

and the PDE governing the structural diffusion is

To ensure the structure distance variable $τ_{s}$ is on a similar scale to $τ_{q}$ , in each pairwise estimation under this model we fix $θ$ at its posterior mean under the SIM. Hereafter we refer to this joint sequence and structure model with random effect as the SDMre.

5. An SDM for Binary Sequence Evolution

As mentioned earlier, the vast majority of applied work on statistical alignment and phylogenetic inference utilizes SIMs. Here we introduce a simple SDM of sequence evolution, analogous to the SDM of structure introduced in Section 3.3, to explore the impact of the SIM assumption when the data actually exhibit site dependence. Although we restrict our model to a simple case (binary sequences), it yields insights into the effect of mis-specifying site independence that provide a cautionary note for the widespread use of these models. Extension to more realistic SDMs for sequence is of significant interest, and would allow corrections similar to those exhibited for the structural SDM in Section 6.

5.1. Model

Let x or y represent a length n binary sequence. The space of all $2^{n}$ such binary sequences is $Ω = {x_{1}, x_{2}, \dots, x_{2^{n}}}$ . When necessary, sequences in $Ω$ will be subscripted as x_i or x_j to emphasize that they are two distinct elements of $Ω$ . Each x_i consists of $n - 1$ pairs of neighboring labels, each label taking a value in ${- 1, 1}$ . The individual label at a particular site s in a sequence x will be indicated by a second subscript: $x_{i s}$ . We will also work with pairs of length n sequences related by evolution. These are called “configurations,” are denoted $σ = (x, y)$ , and are elements of $Ω \times Ω$ . To characterize elements of $Ω$ , let k_x denote the number of neighbor pairs in x with identical labels, and let c_x denote the number of neighbor pairs in x with different labels. For a sequence x, $λ_{x} : = k_{x} - c_{x}$ is a measure of neighbor dependence: for sequences with $λ_{x} > 0$ , more than half the neighboring label pairs will have the same label and overall the sequence labels will appear nonrandomly distributed along the sequence: neighboring labels will be positively correlated. If $λ_{x} < 0$ , the neighboring labels will be negatively correlated.

To construct a simple model for site-dependent evolution of x to y, we introduce a Markov chain on $Ω$ such that the transitions are site dependent. We first specify a set of (identical) transition rates ${a_{i}}$ and a corresponding probability jump matrix P having entries $P_{i j}$ . The generator Q for the corresponding Markov chain has entries $Q_{i j} = a_{i} P_{i j}$ . In defining P, we follow the convention that multiple substitutions cannot occur simultaneously, so that the $(i, j)$ entry of Q and P will be 0 if the sequences x_i and y_j differ at more than one position. To induce dependence into such a model, we set $Q_{i j} = b^{λ_{y_{j}} - λ_{x_{i}}} ∕ Z_{i}$

with $b \geq 1$ an adjustable parameter controlling the strength of neighbor dependence ( $b = 1$ represents neighbor independence) and Z_i a normalizing constant for the row such that the off-diagonal row elements sum to 1.

This model is a reparameterization of one-dimensional Ising model, a well-known model for ferromagnetism in statistical mechanics, where b is the usual Ising dependence parameter. The stationary distribution can be written as $π (x) = \frac{b^{λ_{x}}}{Z},$

where Z is the normalizing constant $\sum_{z \in Ω} b^{λ_{z}}$ .

Suppose the Markov chain is currently in state i. After an exponential waiting time elapses (given by a sample from the distribution $Exp (a_{i})$ ), the Markov chain is more likely to transition to states j having larger $λ_{y_{j}} - λ_{x_{i}}$ than to states j having smaller $λ_{y_{j}} - λ_{x_{i}}$ . In other words, in this model, with $b > 1$ a binary sequence is more likely to evolve over a given time period into a sequence whose neighboring labels are highly correlated.

Given b, data can be simulated from this model for modest n as follows. First, the transition probability matrix $P_{t} = e^{Q t}$ must be calculated by computer, which requires knowing t and b. Taking the limit ${lim}_{t \to \infty} P_{t}$ (or in practice, calculating P_t for very large t) gives the stationary probabilities, which can be used to draw a random site-dependent binary sequence x_i. Then, given this x_i and an evolutionary distance t, P_t can be used to simulate y_j. A closed form expression for P_t for general b or general n was not found due to the difficulty in calculating a general form for the matrix exponential of this Q.

This simulation scheme is practical only for modest n because of the computational difficulty in exponentiating a large n-by-n matrix. For the binary sequence simulations used in Figure 3, we simulated subsequences of length 8. For a more detailed treatment of a similar site-dependent model based on subsequences, see von Haeseler and Schöniger (1998).

FIG. 3.

(a) Posterior distribution of evolutionary distance for sequences simulated under SDM with $b = 2, t = 0.6$ (see Section 5), when inference is performed under an assumption of site independence. Significant underestimation is seen relative to truth (vertical line). (b–d) This underestimation adversely affects phylogenetic reconstruction, as seen by comparing the true (b) and estimated trees under independent- (d) and dependent-site (c) models. (e) A similar effect is seen for three-dimensional structures, with data simulated under the SDM of Section 3.3.

6. Results

All inferences were performed on the Duke Computer Cluster (DCC), a heterogeneous network of shared computing nodes; a typical node CPU is an Intel Xeon 2.6 GHz. Average runtimes for the SIM range from 20 to 60 iterations per second depending primarily on the length of the proteins, whereas SDM computations are roughly an order of magnitude slower than the SIM. All model parameters were sampled via random walk Metropolis-Hastings, augmented with a library sampling step for rotation parameter R as described in Challis and Schmidler (2012).

6.1. Improved estimation of evolutionary distances

The posterior distribution shown in Figure 3a demonstrates the tendency of site-independent models to underestimate the true evolutionary distance between two sequences when the sequences arise from a process with site dependence (in this case, the correlated-neighbor model described in Section 5). The SIM significantly underestimates the true value, reflecting the significant bias in the SIM's MLE. This is likely a conservative estimate of the bias compared with that for real sequences, which typically exhibit some long-range dependence not present in our dependent binary sequence model.

Figure 4 similarly shows posterior distributions obtained under the (structural) SIM when applied to structures exhibiting neighbor-dependent evolution. The left panel of Figure 4 shows the posteriors from both the SIM and SDM. We see again that the SIM underestimates the true evolutionary distance, whereas the SDM corrects for this.

FIG. 4.

Estimation of evolutionary distance using SIM (light) and SDM (dark), for (a) simulated data with known true distance, and (b) real data from two cysteine proteinase pairs (b, top row) and two globins (b, bottom row). In all cases the SIM estimate is significantly lower than the SDM estimate, strongly suggesting systematic underestimation under the SIM assumption. Simulation parameters: $σ^{2} = 1, θ = 0.002, t = 0.1, ρ = 0.95$ .

Although this is not surprising on data simulated from the SDM, similar results are observed on real data for which the “true” distance is unknown. The four plots at right in Figure 4 compare the SIM and SDM posterior distributions for structural distance $θ τ$ between two pairs of cysteine proteinases from Herman et al. (2014) (top row) and two pairs of globins (human-turtle and human-lamprey, bottom row). In each pairwise estimation, the SIM is significantly underestimating structural distance relative to the SDM. This result is consistently observed across the other pairs of globins and cysteine proteinase pairs from Challis and Schmidler (2012) and Herman et al. (2014) (results omitted for brevity). In each case the SDM posterior is somewhat more diffuse, presumably due to the lower effective sample size in the structural information induced by dependence in the structural model. Although the “true” distances for these pairs cannot be known, these results strongly suggest that including site dependence in the structural model can significantly reduce systematic bias in the estimated evolutionary distances.

6.1.1. Non-neighbor dependence

Proteins exhibit significant non-neighbor dependencies due to shared environments and physicochemical interactions between amino acids that are distant in sequence but proximal in space. Simulations were run using general (nonbanded) covariance matrices to simulate structural evolution with long-range correlations, with the SDM then used to estimate evolutionary distance. The results (omitted for brevity) are very similar to the left panel of Figure 4: the SIM noticeably underestimates the true structural distance, whereas the SDM accurately estimates it. This indicates the robustness of the nearest-neighbor approximation, required for efficient computation, to more general dependency patterns.

6.2. Effect on phylogeny of ignoring structural dependence in globin structures

Errors in estimation of pairwise evolutionary distances have the potential to undermine phylogenetic inference as well. To explore this, we compare phylogenetic trees reconstructed via neighbor-joining for a group of 16 globins using the SIM versus that obtained under the SDMre of Section 4. In each case, the respective model was used to estimate the pairwise distances for all pairs of proteins, and the resulting pairwise distance matrix was used to produce a neighbor-joining tree with the PHYLIP and Drawtree software (Felsenstein, 1989). Differences observed in these trees can be expected to also appear in trees if the SDM were used to replace the SIM component of the fully Bayesian joint sequence-structure tree estimation (Herman et al., 2014).

The phylogenetic trees estimated using posterior mean evolutionary distances are shown in Figure 5. The SIM and SDMre trees are very similar, and neither matches the accepted NCBI taxonomy exactly. However, the SDMre tree improves upon the SIM tree in that botfly and fruit fly are now placed together in a single clade with no other species, as in the NCBI taxonomy. This example demonstrates that phylogeny estimation can be adversely affected by ignoring structural dependence, even for proteins with high structure similarity such as these globins.

FIG. 5.

The SDMre tree (left) improves upon the SIM tree (right) by grouping the botfly and fruit fly in their own clade, matching the accepted NCBI taxonomy.

The SIM and SDMre models leading to the trees in Figure 5 differ in two ways: incorporation of dependence in the diffusion, and incorporation of the random effect relation between the sequence and structure time parameters. For comparison, we also ran the SIM with the random effect incorporated, but without dependence in the diffusion model. This SIMre does not correctly group botfly and fruit fly, indicating that it is the site dependence that leads to the improved tree topology. For comparison, the sequence-only tree is also shown (for a superset of globins) in panel (a) of Figure 2; it is highly inaccurate due to many pairs with highly divergent sequences. Without the structural component of the model included, these divergent sequences yield highly uncertain distance estimates that significantly destabilize the tree.

7. Discussion

The site-dependent structural evolution model described here allows a significant improvement in model realism while retaining the computational tractability necessary for use in phylogenetic inference. As shown, the incorporation of dependence into the model significantly reduces bias in the estimates of evolutionary distance, and can have a resulting stabilizing effect on phylogenetic tree reconstruction. These results suggest a need for continued research on computationally efficient site-dependent sequence evolution models, which can be expected to further improve inference in these problems. This is because our current combined sequence-structure model pairs the site-dependent structural model with a site-independent sequence model, which likely still retains some downward bias on the estimated evolutionary distance due to the independence assumption in the sequence side of the model.

A natural next step will be to incorporate the site-dependent structural model presented here into the fully Bayesian simultaneous alignment and phylogeny reconstruction model of Herman et al. (2014), which currently uses the site-independent structural model. This extension would be straightforward and may improve inference of multiple sequence alignments in addition to improving inference of phylogenetic trees.

8. Software

The code base for this article is written in R and C $+ +$ . The full code, including instructions and sample scripts, can be downloaded at https://github.com/garylarson or obtained by contacting the author.

Footnotes

Author Disclosure Statement

The authors declare they have no conflicting financial interests.

Funding Information

This study was partially supported by NSF grant DMS-1407622 and NIH grant R01-GM090201 (S.C.S.). J.L.T. was supported by NIH grant GM118508. G.L. was partially supported by NSF training grant DMS-1045153 (S.C.S.).

4

For a detailed explanation of the standard forward equation terms we refer the reader to the pair HMM material in Durbin et al. ().

References

Arenas

2015. Trends in substitution models of molecular evolution. Front. Genet. 6, 319.

Challis

C.J.

, and Schmidler

S.C.

2012. A stochastic evolutionary model for protein structure alignment and phylogeny. Mol. Biol. Evol. 2911, 3575–3587.

Cheng

, Kim

B.H.

, and Grishin

N.V.

2008. MALIDUP: A database of manually constructed structure alignments for duplicated domain pairs. Proteins, 704, 1162–1166.

Durbin

, Eddy

, Krogh

, et al. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.

Felsenstein

1989. Phylip—Phylogeny inference package (version 3.2). Cladistics, 5, 164–166.

Herman

J.L.

, Challis

C.J.

, Novák

, et al. 2014. Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol. Biol. Evol. 319, 2251–2266.

Kosiol

, and Goldman

2005. Different versions of the Dayhoff rate matrix. Mol. Biol. Evol. 222, 193–199.

Schmidler

S.C.

2006. Bayesian Statistics, Vol. 8. Oxford University Press, Oxford, pp. 471–490.

Thorne

J.L.

, Kishino

, and Felsenstein

1991. An evolutionary model for maximum likelihood alignment of DNA sequences. J. Mol. Evol. 332, 114–124.

10.

von Haeseler

, and Schöniger

1998. Evolution of DNA or amino acid sequences with dependent sites. J. Comput. Biol. 51, 149–163.

11.

Wang

, and Schmidler

S.C.

2014. Bayesian multiple protein structure alignment, 326–339. In Sharan, R., ed. Research in Computational Molecular Biology, Lecture Notes in Computer Science, vol. 8394. Springer International Publishing, Cham, Switzerland.

12.

Wang

, Ma

, Peng

, et al. 2013. Protein structure alignment beyond spatial proximity. Sci. Rep. 3, 1448.