Abstract
We extend the popular Jukes–Cantor evolution model and calculate the probability of an orthologous nucleotide sequence set [a reference sequence (B1) stays with the other sequences (B−1)], where the sequence evolution [from a last common ancestral sequence (ɑ)] follows the (prospective) Poisson process with the overall event rate λ prorated among mutation types (nucleotide/codon substitution, insertion, and deletion) and sites along each sequence. The corresponding retrospective process (reversing the prospective process) facilitates developing algorithms to calculate the marginal probability [Pr(B1)] (Monte Carlo integration) and sample ɑ (given B1). We calculate probability Pr(B−1|ɑ) based on the identified events (during “ɑ→B−1”) from pairwise sequence alignment to implement Pr(B−1|B1) calculation (Monte Carlo integration). Event queue sampling and probability magnifiers are used to improve the computational efficiency when the number of events is large. We finally test our procedure on both simulated and recently studied hexapod transcriptome data (Brandt et al.), where each asexual lineage pairs with its closest related sexual lineage. Rate estimates (for Phasmatodea and Zygentoma) and model comparison indicate that the asexual lineages likely mutate several times faster than their sexual relatives.
1. Introduction
As interdisciplinary research in molecular evolution continuously grows, many statistical models and computation procedures have been developed for making rigorous inference out of the phylogeny analysis. Among them, robust methods (represented by maximum likelihood estimation [MLE]) against model violations are particularly popular with tree likelihood (probability) as an optimization criterion. Thorne et al. (1991) presented a remarkable MLE framework (TKF91 model) for evolutionary models involving single-base substitution and indel (insertion, deletion) between two DNA sequences, where the likelihood involves summing over all possible alignments, the equilibrium ancestral sequence probability (to ensure the reversibility of the evolution process) and separate processes (for substitution and indel). MOLPHY (molecular phylogenetics based on maximum likelihood) (Adachi and Hasegawa, 1992) offers a number of specific models with algorithms to construct the evolutionary tree from orthologous mtDNA sequences with nucleotide and/or codon substitution. The considered components include the distance matrix, transition probability matrix, relative substitution rate, biased nucleotide composition, distinct transition and transversion rates, nucleotide dependence in a codon, fourfold degenerate sites of mitochondria, and others. Given tree topology, the relative substitution rates and branch lengths (numbers of nucleotide substitutions) are estimated possibly with the help of selected outgroup references.
Among many recent algorithms, BEAST (Suchard et al., 2018) performs Bayesian phylogeny tree sampling to do continuous phylogeographic analyses for fast-evolving avian influenza viruses, PhyML (Oliva et al., 2019) reconstructs the last common ancestor (LCA) sequence for the substitution model with multiple alignment involved, and MELTOS (Camir et al., 2020) uses somatic single nucleotide variants to construct the tumor phylogeny tree. Aside from these advancements, reformed MLE frameworks taking into account other realistic scenarios (e.g., large indels, multiple unaligned sequences) are still anticipated (Thorne et al., 1991).
In the sequel, notations “a” and “B” represent the unobserved LCA and the observed species-specific nucleotide sequence, respectively. The word “letter” represents a distinct nucleotide label (A, T, C, or G) per relevant literature. We use the star (parallel) evolution case (the left panel, Fig. 1) as an illustrative example where the joint probability of B1:3 is of our interest. A naive way might be reporting it as the proportion of observing B1:3 out of a number of simulated B1:3 sets, where each set undergoes mutation from certain a drawn from certain population. However, the chance of observing a specific B1:3 becomes negligible when ɑ follows a noninformative prior distribution (π(ɑ)) among the population. Liu et al. (2009) assumed that each nucleotide site separately follows an evolution model and proposed sampling ɑ (given a reference sequence B1) and Monte Carlo integration of transition (ɑ→B2:3) probabilities to compute Pr(B1:3) as Pr(B1) × Pr(B2:3|B1). Since each protein-coding locus (sequence) is likely a functional unit in a biological pathway, we presently take the whole sequence as the object and calculate Pr(B1:3) within the framework of a self-contained continuous-time Poisson process, where a prospective process (M) is reverted by its retrospective process (M*).

Evolution scenarios (star and phylogeny tree). r, a, and B1:3 represent the root, the LCA and the observed species sequences, respectively. The divergence time (T) and the generic event rate (λ) from the preceding node to the current node are in the brackets. LCA, last common ancestor.
Compared with other approaches, our contributions include the following advantages. We involve less model parameters by appropriately prorating the overall generic event rate (λ) among detailed events (e.g., site-and-letter-specific substitutions and indels of nucleotide and codon) based on specified prorate weights. The resultant constant detailed-event ratio (between M and M*) leads to efficient sampling (for a) and sequence set probability calculation. We do not need multiple alignment, an outgroup reference sequence, the reversibility condition, or an informative prior distribution (for a), among others. The working sequence lengths are realistic (up to 103). We test our algorithm on both simulated and real-world protein-coding loci data. Our numerical results show that maximizing the probability of the observed species sequence set enables quantifying distinct evolution rates along different lineages.
This article is organized as follows. Section 2 illustrates the Poisson process where events during the prospective process match the retrospective events (with a constant rate ratio); the probability of sequence mutation (from ɑ to any observed sequence B) is derived under realistic scenarios (e.g., involving nucleotide and/or codon indels); and algorithms are introduced for calculating Pr(B1:3) through computing Pr(B1) and sampling ɑ (given B1). Section 3 tests our method on both artificial data and a family of real orthologous protein-coding sequences from a recent study. Section 4 concludes with discussion and some future directions.
2. Methods
2.1. Prospective and retrospective sequence mutation processes
The popular Jukes–Cantor (JC) model (Jukes and Cantor, 1969) assumes a constant overall mutation rate (λ) and equal weights among four letters at nucleotide substitution. The instants of our generic events [substitution (S), insertion (I), and deletion (D)] (during mutation ɑ→B) follow a continuous-time Poisson process with up to one event occurrence at any instant. Such a prospective process (denoted by M) is described as follows (the upper panel, Table 1). We introduce Nn: = {1,…,n} and Zn: = {0,…,n − 1}. Let tB be the terminal instant and n be the number of events during
The Illustration of the Poisson Process
The temporal evolution events, sequence states, and lengths on the retrospective process (M*) reverse those corresponding events, states, and lengths on the prospective process (M) at matched instants. a and B represent the ancestor and the evolved observed sequence, respectively.
The complete information includes the transient sequence states
Here, the prorate weights satisfy
Here, subscript “1” represents the next level of hierarchical prorating [involving four letters (A,T,C,G) and those sites along the nucleotide sequence]. This definition is sequence-length (l)-specific. Here, pS/(3l) represents the prorate among three letters (possible states mutated from the original letter) and l nucleotides (for a substitution), pI/4(l + 1) represents the prorate among four letters and l + 1 links (for an insertion), and pD/l represents the prorate among l nucleotides (for a deletion). When LCA is long enough (∼103) and the evolution involves a small number of indels (relative to current sequence length), the denominator is non-zero and the definition is valid. Similarly, we denote S*, D*, and I* as the events (on M*) reverting to S, I, and D (on M). We let M* also have a generic event rate
The prorated site-letter-specific event rates are
Matching M with M* (with a constant site-letter-specific rate ratio) requires
with normalized prorated rates among {S*,I*,D*} to be
, where
. The resultant site-letter-specific rate ratio (M:M*) is ω.
2.2. Sequence transition probability
2.2.1. Nucleotide substitution–insertion–deletion model
For calculating the mutation probability (PM(ɑ→B)) (M represents the prospective process), we let t1 <…<tn be the n random event instants and tB be a right-censoring time (tn < tB < tn + 1, tn + 1 is the anticipated n + 1th event instant). The settings in Section 2.1 and relevant theories on the Poisson process (e.g., Rigdon and Basu, 2000, p54) lead to the following joint probability density function for a queue of n event instants on (0, tB]
Here, 1{S}i,1{I}i and 1{D}i are indicators for S, I, and D, respectively. Since the generic event rate λ(t) is a constant, the probability density [Eq. (2)] does not explicitly involve the event instants. In the sequel, faB(t1:n,B) stands for the left-hand side of Eq. (2). Figure 2 shows an example with n = 6 (two substitutions, two insertions, and two deletions).

Six mutation events jointly cause “a→B.”
We let
is a queue of n events such that ɑ evolves into B}. The mutation (ɑ→B) probability [conditional on n(ɑ,B)] is
Here, we integrate faB(t1:n,B) over the n-dimensional simplex {0 ≤ t1 <
Similarly, we let
is a queue of n events such that B reverts to ɑ}. Given the number of events (n(B,ɑ) = n(ɑ,B)), the retrospective mutation probability [refer to Eqs. (2)–(4)] is
Here,
2.2.2. Codon insertion and deletion
Protein-coding sequence may experience codon (a nucleotide triplet) insertion or deletion (denoted as Ic or Dc). A constant event-specific rate ratio (M:M*) requires
that is, , where
and 64 is the number of distinct nucleotide triplets. The enhanced version of Eq. (2) is
Here, 1{S}i, 1{I}i, 1{D}i, 1{Ic}i, and 1{Dc}i are indicators for nucleotide S-I-D and codon I-D, respectively. Equations (2)–(6) undergo corresponding adjustments. When n is large, exhaustive enumeration for En [in Eq. (4)] is replaced by a sampled subset [with m (<n!) queues] subject to adjustment (probability multiplied by n!/m).
2.3. Calculate sequence set probability
We specify an informative [e.g., Eq. (8)] or noninformative [a special case of Eq. (8), Sections 3.1 and 3.2] prior distribution (π(a)) when the compositional information on a is available or not. The set probability in the star scenario is presented as
Here, “A” (in the subscript of Σ) represents the ancestor sequence (a) population (not a letter for nucleotide “A”) and B1 is the reference sequence used for sampling ɑ based on the posterior distribution
Equation (7) implies the following algorithm.
[i]. Calculate the reference sequence probability: Pr(B1) = ∑a∈A π(a)Pr(B1|a).
[ii]. Sample a (given B1) from Pr(a|B1).
[iii]. Calculate mutation (a→{B2:3}) probability.
Here, [iii] is from Sections 2.2.1 and 2.2.2. We now implement [i]–[iii].
2.3.1. Calculate the marginal probability of an observed sequence
We let lɑ denotes the possible length of a, we assume the prior distribution is
Here, π(i,k) is the probability that the ith nucleotide (of a) being letter k∈{A,T,C,G}, the boundary (lmin, lmax) is the range of the length under consideration, and Pr(lɑ) is the length distribution. For example, when the observed sequences (Bis) are all around 20-nucleotide long, we may only consider a range of 10–30 for lɑ (Section 3.1). The probability of a→B (on M) is
Here,
We use Eqs. (9)–(10) to obtain Pr(B) by Monte Carlo integration
In Eq. (11),
is the integrand. The algorithm for calculating Pr(B) follows from Eq. (11).
[iv]. Set counter: k = 0.
[v]. Given B, draw the kth ancestor sample (ak) on the M* process. This immediately brings about the number (nk) of events and event information (ak→B).
[vi]. Retain a = ak and calculate the integrand
[vii]. Update counter: k = k + 1.
[viii]. Repeat [v]–[vii] until we harvest N samples (of a): {a1, …, aN}.
[ix]. Finish integration (1/N) ×
When the probabilities are tiny (causing underflow), we apply a magnifier (with probability multiplied) to make them visible.
2.3.2. Sample the LCA
The posterior distribution for a (given a reference sequence B) is
The following algorithm draws a from Eq. (13).
[x]. Choose a value for the normalizing constant (denoted as H), which will be used for accepting/rejecting the candidate ancestor sample (see [xi]–[xiii]). For example,
where φ is a tuning parameter that ensures an applicable acceptance rate.
[xi]. Given B, draw a (a candidate ancestor) on M*. This drawing immediately brings about the number of events and event information (B → a).
[xii]. Draw a random variable (U) from the uniform distribution over [0,1].
[xiii]. If U ≤ I(n,a)/H [Eq. (12)], we accept ɑ. Otherwise, we discard it and repeat [xi]–[xiii].
[xiv]. Continues till we harvest N samples of a.
φ is chosen for an applicable range of acceptance rates (in [xiii]). This works well for small n and/or ω = 1.
2.3.3. Identify mutation events (during a→B) using pairwise alignment
Event identification algorithm is crucial for implementing Eq. (7) where the other probability terms (Pr(B2|a) and Pr(B3|a)) are to be calculated (Sections 2.2.1 and 2.2.2). For the example of calculating Eq. (4), once Pr(B1) is obtained (Section 2.3.1) and LCA sampling is completed (Section 2.3.2), the global pairwise alignment algorithm (Needleman and Wunsch, 1970) is applied to {ɑ,B2} and {ɑ,B3} (match score = 1, mismatch penalty = −1, gap penalty = −2) to identify the events. Each codon insertion (deletion) is identified as a fused nucleotide insertion (deletion) triplet from the alignment. The involved algorithms are organized in Figure 3.

Computational procedure flowchart (the solid arrow represents an option).
2.4. Tree phylogeny
The preceding algorithms also apply to the tree phylogeny scenario (the right panel, Fig. 1; subject to minor adjustment). Specifically, the set probability is presented [similar to Eq. (7)] as
Here, two latent sequences [the root (r) and the ancestor (a)] are to be sampled (given B3). Once the prior distribution for the root (π(r)) is specified [similar to Eq. (8)], Pr(B3) is obtained per Section 2.3.1 (with a replaced by r). The joint prior distribution (for the root and the ancestor) is presented as π(r,a) = π(r)Pr(a|r), where Pr(a|r) is dictated by the evolution models on the M process (Sections 2.1, 2.2.1, and 2.2.2). Given B3, sampling a amounts to sampling r (given B3, on the M* process, Section 2.3.2) followed by sampling a (given the sampled r, on the M process). The overall computational procedure is similar to the star evolution case.
2.5. Challenges from very short sequences
Compared with TKF91 model, our assumption and specification (Section 2.1) appear simpler. However, the price paid for not employing an equilibrium distribution for LCA (for attaining a reversible Markov process) may be suffering computational failure from handling very short sequences (e.g., <10 nucleotides) under certain configurations. For example, pS = pD = 0 amounts to site-letter-specific event rates of (0, pI/4(l + 1),0)λ, which dictate that only nucleotide insertions occur on the M process (a→B) and transient null sequences (a) may subsequently occur on the M* process to prohibit exhibiting further deletions before reaching the terminal instant (tB*). Moreover, pairwise alignment (Section 2.3.3) does not work on a null LCA sequence and an observed sequence (B2 or B3). Such a scenario challenges or may disable our proposed algorithms (Sections 2.2 and 2.3).
3. Results
3.1. A preliminary test on a set of short artificial sequences
The following three sequences {B1:3} are artificially produced (assumed to evolve) from a specific ɑ = ATCGATCGATCGATCGATCG such that they have one, two, and three representative events, respectively.
B1 = ATCGAT-GATCGATCGATCG (1 deletion),
B2 = ATCGAT-GATTGATCGATCG (1 deletion, 1 substitution), and
B3 = ATCGAT-GATTGATCTGATCG (1 deletion, 1 substitution, 1 insertion).
We let B1 take the role of the reference sequence (B) [Eq. (13)]. In Eq. (8), we specify that Pr(la) = 1/21(10 ≤ la≤30) and π(i,k) = 1/4 for any k (1 ≤ i≤la) along with the Poisson process specification: λ = 0.1, tB = 20, and (pS,pI,pD) = (1,1,1)/3. We set H = 10–12 and harvest N = 103 ancestor samples (Section 2.3.2) with calculated probabilities presented in Table 2. We further check the probability profile (as λ varies) sensitivity to prorate weights by using another prorate weight vector (pS,pI,pD) = (4,1,1)/6. The two correspondent sets of probabilities are compared in Figure 4, where the probability profile depends on the working prorate weights and the solid symbol along each profile represents the MLE for λ. The left panel has more outstanding (larger) probabilities (at MLE: λ = 0.07) along the profiles compared with the right panel. For one set of sequences, this small-scale study implies that a more realistic prorate (e.g., the left panel) likely leads to easier MLE for the rate (λ). We shall do large-scale (long sequences) studies in the following sections.

The probability profiles versus the generic event rate (λ). Obtained from short artificial sequences. The divergence time (tB) = 2.0. The event rate prorates p = (1,1,1,0,0)/3 and p = (4,1,1,0,0)/6. The solid symbol along each profile represents the corresponding MLE for λ. MLE, maximum likelihood estimation.
Event Probabilities
The generic event rate (λ) = 0.1, the divergence time (tB) = 20, and the prorate (pS, pI, pD) = (1,1,1)/3. The ancestor length follows a uniform distribution in {10:30} where each nucleotide follows a tetranomial distribution across four letters {A,T,C,G} with equal chance. Part 1 shows the mutation (ancestor→observed sequence) probabilities, where the ancestor (a) = ATCGATCGATCGATCGATCG, B0 = ATCG ATCGATCGATCGATCG (0 event), B1 = ATCGATGATCGATCGATCG (1 event), B2 = ATCGATGATTGATCG ATCG (2 events), and B3 = ATCGATGATTGATCTGATCG (3 events). Part 2 shows the marginal probabilities of the observed sequences. Part 3 shows the conditional probabilities of the observed sequences given the reference sequence (B1). Part 4 shows the joint probabilities of the sequence sets.
3.2. Real data: nuclear protein-coding sequences
3.2.1. Data description
Figure 5 lists eight hexapod species groups studied by Brandt et al. (2019), where each asexual lineage (without asterisk) stays with its closest sexual relative (with asterisk) in the unrooted cladogram along with sequence lengths and event counts for the orthologous nuclear protein-coding loci (GenBank number: MH602874-89, positive selection excluded). Neither mutational meltdown exists in the asexual lineages nor different purifying selections exist between two reproductive modes. BLAST multiple alignment takes MH602874 as the reference, and the most common mutation type is nucleotide substitution along with rare and randomly located nucleotide and/or codon insertion–deletions. For example, pair {1,1*} differs from 2* by at least one codon insertion (GAA) in the following portion extracted from multiple alignment of 16 orthologous sequences.

The unrooted cladogram with within-group event counts and rate (λ) estimates. The event counts are numbers of “nucleotide S-I-D, codon I-D, nucleotide identity” from aligning each pair (Bi, Bi*). The rate estimates are obtained by following Section 3.2.3.2.
(2*) MH602874 121 ATGCTTGGAAATGGGCGGCTTGAAGCGATGTGCTTTGATGG-AGTG——-AAACGGCTT 174
(1) MH602886 121 .….……C.….TT.G..G..C.….……C..-GACC——-…A.…A 174
(1*) MH602878 121 .….A.….…T..TT.A.….A.….T..C..C..-.…——-..GA..T.G 174
(Note: other species sequences or nucleotide sites are omitted)
(2*) MH602874 409—-AGTGATGA—-TGATGA—-A—-GATGA—-T—-GTTGATAATATCTGA 444
(1) MH602886 409—-.….…—-G.G…—-.—-.….—-.
(1*) MH602878 409—-..C..C..—-A.GC..—-.—-.….—-.
The nuclear coding sequences have a simpler and more stochastic mutation mechanism than mtDNA (Adachi and Hasegawa,1992). The evolution mechanism underlying this data set roughly fits our model assumption and relevant probability calculation appears applicable.
3.2.2. Sequence (set) probabilities (marginal, conditional, joint)
Each group fits our model by sharing an LCA and tends to have a larger probability than a cross-group pair. Without the need of outgroup references, the noninformative π(a) induces the averaged probability that the pair would have diverged from certain LCA. We set: λ = 0.20, T = 100 (for each group), (lmin,lmax) = (440,450), m = 103 (Section 2.2.2), the magnifier = 10300 (Section 2.3.1), the prorate (pS,pI,pD,pIc,pDc) = (99,2/5,2/5,1/10,1/10)% (codon I-D is rarer than nucleotide I-D), the tuning parameter φ = 10−5 (Section 2.3.2) and harvest N = 103 ancestor samples for Monte Carlo integration. In Table 3, the marginal sequence probability has a range of 10−272–10−269 without much difference across groups (Part 1). Conditional on Bi, P(Bi*|Bi) varies substantially across groups (10−186–10−79) and is 20–70 orders of magnitude larger than P(Bj|Bi) (cross-group pair {i, j}) (Part 2). “Negligible” probabilities arise from substantial codon insertions and/or deletions, for example, sequence 6(7) differs from 5(6) by three (four) codon insertion–deletions. The probabilities are compared (λ = 0.2 vs. 0.4) with larger ones (w/
The Probabilities for a Real Data Set
The ancestor sequence length has a range of (lmin,lmax) = (440,450), the probability magnifier = 10300, and the number of ancestor samples (N) = 103. The divergence time and generic event rate prorate are simply assumed to be T = 100 and p = (99,2/5,2/5,1/10,1/10)%. Sequence labels ({1:8},{1*:8*}) matches those in Figure 5. Part 1 shows marginal probabilities, Part 2 shows conditional probabilities, and Part 3 shows joint probabilities. Each event has two probabilities (the generic event rate λ = 0.2 or 0.4).
3.2.3. Estimate the generic event rates
3.2.3.1. Identical within-group rates
MLE estimate from a set of Poisson process instants often suffices for making statistical inference. The divergence time (T) is ∼40myo for Phasmatodea (pair{6,6*}) and ∼160 myo for Zygentoma (pair{7,7*}) (Brandt et al., 2019). For T = 40 and 160, we estimate an identical λ (for two sequences in each pair) by maximizing the joint probability under two prorates ((99,2/5,2/5,1/10,1/10) and (96,2,2,0,0)%) in Figure 6. Phasmatodea possesses arching profiles (as λ varies) and slightly favors the former prorate (with a higher maximal likelihood at λ = 0.43) over the latter one (with a slightly lower maximal likelihood at λ = 0.52). An opposite result appears for Zygentoma, which possesses unstable calculations at large λ (>0.4) (with the maximal likelihood at λ = 0.32). This comparison enhances the small-scale study (Section 3.1) and suggests to explore the joint space [the event rate (λ), the prorate (p)] for achieving the ultimate maximal likelihood.

The “log-probability vs. event rate (λ)” profile. The divergence times (T) for Phasmatodea and Zygentoma are 40 and 160, respectively. The ancestor sequence length range (lmin, lmax) = (440,450) and the number of ancestor samples (N) = 1000.
3.2.3.2. Distinct within-group rates
We now investigate the feasibility of estimating distinct rates for different species-specific sequences. We first use an artificial ancestor to simulate (T = 160) two sequences (B1 (rate = 0.10), B2 (rate = 0.50)). For each experimental rate λ1, we use B1 to sample LCA and find a value of λ2, which maximizes Pr{B1:2} (the left panel, Fig. 7). In the right panel, λ1 is accurately estimated with λ2 slightly underestimated, where a burn-in period (decreasing trend) may appear [taking B1 = LCA (at λ1 = 0) locates a λ2 such that Tλ2 ≈ the number of events between B1 and B2] with the real estimate appearing later on. We assign λ1 and λ2 (λ1<λ2) to pair{6*,6} (Phasmatodea, T = 40) and the maximal probability at (λ1, λ2) = (0.12, 0.78) (Fig. 8) substantially exceeds the identical-rate case (Fig. 6). Similar results apply to Zygentoma (T = 160) with maximal probability at (λ1, λ2) = (0.06, 0.55) (Fig. 9). However, the average of distinct rates ((λ1 + λ2)/2 = 0.45 or 0.31) is close to the identical rate (λ = 0.43 or 0.32). When we simply assume an identical λ ( = 0.20) for eight groups, the divergence times are estimated as T1:8 = (19,14,16,14,10,10,20,18) × 10 (resolution = 10).

“Log-probability vs. event rate (λ2)” profiles (given event rate λ1) and “maximal log-probability (by λ2) vs. λ1” profiles. Obtained from simulated sequences. The divergence time (T) = 160, the ancestor sequence length range (lmin,lmax) = (440,450), and the number of ancestor samples (N) = 100.

“Log-probability vs. event rate (λ2)” profiles (given event rate λ1) and “maximal log-probability (by λ2) vs. λ1” profiles. Obtained from Phasmatodea sequences. The divergence time (T) = 40, the ancestor sequence length range (lmin,lmax) = (440,450), and the number of ancestor samples (N) = 100.

“Log-probability vs. event rate (λ2)” profiles (given event rate λ1) and “maximal log-probability (by λ2) vs. λ1” profiles. Obtained from Zygentoma sequences. The divergence time (T) = 160, the ancestor sequence length range (lmin,lmax) = (440,450), and the number of ancestor samples (N) = 100.
3.2.4. Biological insights
Nucleotide sequence mutation has complex biological mechanisms, which challenge statistical modeling and computation. Different species-specific nucleotide frequencies in the third codon position during amino acid substitution make the general reversible Markov model (with equilibrium assumption) inapplicable. The divergence at nonsynonymous sites (normalized for background substitution rates) (dN/dS) (Nei and Gojobori, 1986) is not robust (relying on likelihood and branch length estimates) and testing purifying selection effectiveness difference between reproductive modes of Zygentoma may be controversial (using dN/dS and CDC [codon deviation coefficient]) (Brandt et al., 2019). Our model omits certain complicated assumptions and/or components (e.g., letter-specific mutation probability, hydrophobicity fitness, time-varying rates). Moreover, since the match score ranking from BLAST multiple alignment may not completely agree with the cladogram, we simply use noninformative LCA priors rather than using an outgroup or internal reference. BLAST calculates statistical significance of alignment matches and we directly provide the joint probability P(B1:2) to quantify between-species distance (e.g., two dissimilar sequences with negligible joint probability may not share an LCA). Once tree topology and divergence time (Ti)s are determined, any subtree with an overly small probability likely suffers from poor model fitting. Our results indicate that an identical within-group rate may not be realistic and distinct rates could be reasonably estimated using a small number of orthologous sequences.
4. Discussion
For long sequences (∼103 nucleotides), we have considered a computational framework (e.g., LCA sampling, Monte Carlo integration, MLE of distinct rates among sequences) to improve evolution-based sequence analysis. Even if we do not require an outgroup reference to reconstruct the ancestor and/or gene gain–loss along sequences, a noninformative LCA prior still enables objective probability comparison and can potentially help to improve evolutionary tree construction. For very long sequences, segmentation may be an option for improving the computational efficiency. The sequence alignment algorithms (e.g., Needleman and Wunsch, 1970) on which our modeling and computational procedure depends mostly work for identifying substitutions and indels only. Of the alignment output, triple neighboring nucleotide insertions (deletions) are taken as one codon insertion (deletion) and a codon substitution arises from 1 to 3 independent nucleotide substitutions within the codon. Given these facts, the customized model (Section 2.2.2) involves codon I-D (on top of nucleotide S-I-D) and appears to work well for the nuclear protein-coding sequences (Section 3.2.1) in view of the event types identified from the multiple alignment therein. More complicated mechanisms (e.g., inversion, transposition) substantially challenge algorithm adjustment within our framework with causes including but not limited to the following: the involved segment length (along the whole sequence) per event varies and pairwise-alignment algorithms incorporating inversion-transposition identification are yet to come.
Footnotes
Acknowledgments
We appreciate the great efforts made by three anonymous referees who have contributed very constructive, insightful, and valuable comments, which have greatly improved our presentation. We are also grateful to Dr. Mona Singh (Editor-in-chief) for her very dedicated and patient coordination of the article reviewing process even after reformatting the article had taken us a prolonged time period during COVID-19 pandemic.
Author Disclosure Statement
The authors declare they have no conflicting financial interests.
Funding Information
The authors received no funding for this article.
