A profile mixture (PM) model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend, in part, on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here, using algebraic methods, we show that a PM model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.
1. Introduction
A profile mixture (PM) model is a particular stochastic model of protein sequence evolution that describes the changes in sequences along the tree of evolutionary relationships for a collection of taxa. Such a model is often used for the inference of a tree from sequence data, using standard maximum likelihood or Bayesian statistical frameworks. Here we investigate the question of parameter identifiability for this model: Are the model parameters—both the tree topology and numerical parameters—determined by a site pattern distribution arising from the model? Parameter identifiability, which informally means that valid parameter inference is possible in ideal circumstances, is an essential component of the theoretical justification for standard statistical inference approaches.
In models of protein sequence generation, amino acid site patterns are generally assumed to be independent and identically distributed across the sites. Common continuous time models of amino acid substitutions are instances of the general time-reversible (GTR) model, which assumes a single rate matrix Q constant over a metric tree, or extensions that allow for additional scalar rate variation at individual sites. The rate matrix Q has off-diagonal entries from (π), where R is a symmetric matrix of exchangeabilities and π is a vector of frequencies of the amino acids that remains stable under the model.
In principle, one can infer R,π, and a metric tree of taxon relationships from protein sequence data using standard statistical frameworks. However, with 20 amino acids, the state space for the model is large, so an exchangeability matrix R is often fixed in advance, having been previously determined empirically for particular types of data. Well-known exchangeabilities for protein alignments include the JTT (Jones et al., 1992), WAG (Whelan & Goldman, 2001), and LG (Le et al., 2008) matrices.
When inspecting protein sequence data, however, it is often clear that the GTR assumption of identically distributed sites is a poor one, since sites have visibly different amino acid compositions. Site residue distributions, or profiles, likely differ because of the biophysical properties of amino acids (e.g., hydrophilia, polarity, or charge), and the associated structural and functional constraints on the protein. This phenomenon suggests a model with multiple classes of substitution processes, and in particular, a mixture model using a variety of profiles with the same exchangeabilities for all classes. Mixture models can provide better fit to data as they introduce more parameters, although they also increase computational time and may lead to overfitting of the data.
However, a more fundamental issue with adopting a mixture model is that one may lose parameter identifiability. If several choices, or even more worrisome, infinitely many choices of parameters lead to the same probability distribution under the model, then even with an idealized infinite data set perfectly in accordance with the model, one could not recover the parameter values under which the data arose. Since the goal of most phylogenetic analyses is to infer model parameters—generally the topological tree but often numerical parameters in addition—identifiability is an essential property for a model to be useful. Nonidentifiability poses particular challenges in Bayesian Markov chain Monte Carlo (MCMC) analyses, where it may be manifested as a lack of convergence (Rannala, 2002).
Parameter identifiability has long been established for nonmixture site substitution models in phylogenetics, but mixture models provide greater challenges. Although computational work may suggest whether it holds or fails, parameter identifiability can only be established theoretically as it is a model property, and not dependent on an inference method. In recent years, algebraic methods have been introduced and successfully applied to a number of phylogenetic mixture models (Allman and Rhodes, 2006, 2008, 2009; Allman et al., 2010, 2011, 2019; Chifman and Kubatko, 2015; Long and Sullivant, 2015; Hollering and Sullivant, 2019; Wascher and Kubatko, 2021). While one of these works (Rhodes and Sullivant, 2012) established a rather general result on parameter identifiability of phylogenetic mixture models with many components, it unfortunately does not apply to the PM model's specific structure.
In this work, we prove parameter identifiability for a PM model of amino acid site substitution. PM models were introduced in the Bayesian context (Lartillot and Philippe, 2004; Lartillot et al., 2009, 2013) where the number of profiles might be inferred using a Dirichlet process prior, and as finite mixtures with a fixed number of components in a maximum likelihood analysis (Le et al., 2008). Studies suggest that PM models perform better than single-class models, particularly on data that are saturated or with an underlying long branch attraction bias (Lartillot et al., 2007; Wang et al., 2008). Mixtures with as many as 60 classes have been investigated with empirical data sets, with indications that around 20 profiles often provide good fit (Le et al., 2008). For a recent study, assessing the performance under simulation of mixture models including the discrete- rates-across-sites and PM models, see Wang et al. (2014).
Our main result, Theorem 5.7, establishes that parameters of a PM model with up to 73 classes on a tree of 9 or more taxa are generically identifiable; that is, identifiable outside an exceptional parameter set of measure zero. For any fixed number of classes, the parameters include the tree's unrooted topology, the tree's edge lengths, the exchangeabilities, the profiles, and weights of the mixture components. The limit on the number of profiles to 73 in this result is unlikely to be a strict upper bound for identifiability, but is large enough to cover the analyses of empirical data with the PM model that we have found in the literature. This limit arises from some specific features of our proofs, and we suspect could be raised somewhat, at the expense of more complicated arguments.
The proof techniques we use are algebraic in nature, using ideas from tensor decomposition and algebraic geometry. These tools, which have been introduced and used previously for phylogenetic models (Allman and Rhodes, 2006, 2009; Rhodes and Sullivant, 2012) and for more general models in statistics (Allman et al., 2009), are based on the algebraic properties of matrices and three-way tensors obtained from rearranging the entries of the distribution of site pattern frequencies. However, the structure of the PM model, with profiles varying over classes while the exchangeabilities do not, introduces important differences that prevent any easy deduction of the result from previous work. At several points in our arguments, we use exact integer computation, performed by the software Pari/GP (The PARI Group, 2019), to establish certain generic conditions we need on ranks of matrices.
As motivated by applications to amino acid models, our main theorem is stated for the PM model with a state space of size 20. However, the techniques used for establishing it apply to arbitrary sizes of the state space. For example, might be 4 for DNA or 61 for codons. However, appropriate rank computations would need to be carried out to complete the proof in such contexts. In the setting, we also believe that the proof techniques could be pushed to establish identifiability for more than 73 profiles, at the expense of requiring more taxa on the tree.
This article is organized as follows: In Section 2 we introduce phylogenetic substitution models, and in particular the PM model under study. Section 3 provides algebraic definitions and lemmas, although removed from the biological setting of interest. Section 4 then connects the phylogenetic PM model with these algebraic notions. We conclude in Section 5 with the proof of our main theorem on identifiability of the PM model parameters.
2. Markov Models on Trees
We begin by introducing Markov models of site substitution along a tree. Throughout, let be the size of the state space, which we identify with . For protein data, . Let be a rooted topological tree, with root and leaves labeled by elements of the taxon set X. The general Markov model of -state sequence evolution along is parameterized by (1) A vector π giving the distribution of states at the root; and (2) for each edge e directed away from the root, a Markov matrix Me giving the conditional probabilities of state transitions along e. These determine the expected site pattern frequency array, or joint distribution of states at the leaves, which we view as a array or tensor, P. Each site in an alignment is modeled as independent and identically distributed according to P.
A subclass of general Markov models is composed of the GTR models. For a GTR model, there is a single underlying rate matrix Q, and for each edge e of , a length te with . Time-reversibility is the assumption that for some symmetric matrix R of non-negative exchangeabilities and root distribution π, the off-diagonal entries of Q are those of the product (π) and the diagonal entries chosen so that row sums are zero. This results in (π)(π). One consequence of time-reversibility is that the Markov matrix Me is independent of the direction of e. It follows that the tree parameter in a GTR model is de facto unrooted since the location of the root is not identifiable. We repeatedly take advantage of this to “move the root” to locations in T convenient for our arguments.
PM models are finite mixtures of GTR models, where the underlying exchangeability matrix R is the same for each class. The particular PM model examined here has parameters as follows.
Definition 2.1.Let T be a rooted topological tree,a number of states, anda number of classes. Then the numerical parameters of the PM model on T, PM = PM, are:
a collection of non-negative branch lengths, one for each edge e of T;
a symmetricmatrix R of non-negative exchangeabilities;
a collection of m class weights, withand; and
For each class,
– aroot distribution vectorπi, called a profile; and
– a scalar rate parameter.
The scalar rate parameters are used to incorporate across-site rate variation into the PM model. Specifically, for class i with Qi the rate matrix determined by R, πi, the Markov matrix on edge e in T is . We note that site rate variation for PM models may be implemented differently in software, with a rate for each site (Lartillot and Philippe, 2004) or with a discrete- (Le et al., 2008). In the first implementation, the PM model is very likely overparameterized and ideally the MCMC would limit the number of rate multipliers. Implementation of the rate variation using a discrete- has a long history in computation phylogenetics (Yang, 1994), but proofs of such rate variation identifiability are only known for the continuous (Allman et al., 2008; Chai and Housworth, 2011).
While probability distributions from mixture models are often described as weighted sums of distributions from the various classes, phylogenetic mixture models can be equivalently presented as a single model on a tree T with states at internal nodes of T, and states at the leaves. The internal states are pairs where i is a class and is a “usual” state. In this formulation, Markov matrices on internal edges e for the PM model are block diagonal matrices, where the the m blocks are the , . The block structure prevents changes from one class to another, although the “usual” states may change within the class. For the terminal edges e of T, leading to leaves where the class information is not observable, the PM Markov matrix for an edge is formed by stacking the m Markov matrices for the classes. The root distribution is an vector formed by concatenating πi for the classes.
We collect these observations for parameterizing the PM model on a tree.
Definition 2.2.Given parameters for the PM model, assume that T is rooted at r. Then the, thematriceswhere Q is block diagonal with blocksfor each internal edge e of length te, and thematrix Me formed by stacking the matricesfor each class i on a terminal edge e give a parameterization of the PM model as a Markov model of site substitution on T.
Since our main goal is to prove parameter identifiability for the PM model, we formally define the notion of generic identifiability.
Definition 2.3.Consider a parametric model, specified by a parameterization mapfrom some parameter space to a space of probability distributions. Ifis one-to-one, then the model parameters are identifiable. Ifis one-to-one except possibly on a subset of measure zero in the parameter space, then the model parameters are generically identifiable.
Since the PM models under consideration are time reversible, at best only the unrooted topology of the tree parameter is identifiable. Also, it is well known that for the GTR model, some normalization is needed for rates and branch lengths since shows that rescaling all rates in Q can be offset by also rescaling branch lengths. Once understood, this model overparameterization, or lack of identifiability, is of little consequence. Typically, the rate matrix Q is normalized so that branch lengths are measured in expected number of substitutions per site over the elapsed time. In the strictest sense, only the normalized variant of the GTR model has identifiable parameters, a result used in our proof of the main theorem.
Theorem 2.4.For a single-class GTR model on an unrooted metric tree, the tree topology and all numerical parameters are generically identifiable, up to a normalization of Q.
3. Algebraic Definitions and Lemmas
In this section, we collect algebraic definitions and theorems that will play a role in our analysis of the PM model. We present these in a purely algebraic setting, deferring the connection to the phylogenetic models, and in particular the PM model, to later sections. We begin by defining tensors and certain algebraic operations on them leading up to a theorem of J. Kruskal on the structure of 3-way tensors, an important tool that we use several times. We then briefly introduce algebraic varieties and conclude by stating a theorem for identifying generic properties, a tool also used repeatedly in our proofs.
3.1. Tensors
Our first definition is a standard one.
Definition 3.1.Let A be anmatrix and B be anmatrix. The tensor, or Kronecker, productis thematrix whose rows are indexed by the ordered pair,and whose columns are indexed by ordered pair,such that theentry is
Less standard is the following.
Definition 3.2.Let A be anmatrix and B be anmatrix. The row tensor product is thematrix with entries indexed byfor,
In the case thatandis a positive integer, then the row-tensor power of A is thematrix
We do not specify the precise order of row and column indices in these tensor products, since for our applications it will either be clear from context, or inconsequential. In particular, we often only need results on the ranks of these products, which are independent of row and column ordering.
Since Kruskal's Theorem concerns 3-way tensors, we next describe reformatting n-way tensors into 3-way ones. Suppose P is an n-way tensor with indices labeled by X. Then a tripartition of X is a collection of three disjoint nonempty subsets of X whose union is X, . A bipartition of X, or a split, is , with the disjoint sets nonempty.
Definition 3.3.Let A be an n-waytensor witha split of the index set X. Then the matrix flattening of A with respect to, denoted, is amatrix. If, by permuting indices, we assume that,, then the-entry is
forand.
Similarly for a tripartitionof X, the 3-way tensoris
Example 1. Suppose A is a 6-way tensor, let ,, and . Then is a tensor with, for example,
Kruskal's theorem requires the notion of a 3-way tensor obtained as sum of “outer products” of the rows of 3 matrices.
Definition 3.4.Let A be amatrix withrow, and similarly for matrices B and C of sizeand, respectively. Thendenotes the 3-waytensor
where the tensor products in the summands are formatted to preserve an index for each matrix. For instance,, where T denotes the transpose.
To illustrate, suppose that are , , and matrices respectively,
Then is the tensor with slices with respect to the C index given by
As a simple extension of Definition 3.4 for use with phylogenetic models, we write
where
Before stating Kruskal's Theorem, we need the following.
Definition 3.5.Let A be a matrix. The Kruskal (row) rank of a matrix A is the largest number k such that every set of k rows of A is independent.
For example, letting V denote the set of all matrices, a set of dimension 9, consider matrices of the form
where are independent. These matrices have rank 2 but Kruskal rank 1, and form a subset of lower dimension inside the nine-dimensional space V.
It is clear that Kruskal rank is less than or equal to matrix rank, but when a matrix has full row rank, the two notions coincide. In subsequent sections, we exploit this observation by creating matrices with full row rank and therefore full Kruskal rank.
Kruskal's theorem can be viewed as a generic identifiability theorem for 3-way arrays, showing that triple products satisfying a particular rank condition are decomposable in essentially a unique way.
Theorem 3.6. (Kruskal, 1977). Letbe,, andmatrices with Kruskal rank, respectively. If
thenare uniquely determined by, up to simultaneous permutation and scaling of their rows. More precisely, if, then there exist invertible diagonal matricesand a permutation matrix P such that
By way of contrast, note that for two compatible matrices A, B, the natural analogue of the bracket product is the matrix product . However, from , A and B cannot be determined uniquely, since there are many matrix products that give the same result. For instance, for any orthogonal matrix Q. Kruskal's theorem thus states a significant difference between matrices and 3-way tensors.
3.2. Generic points in parameter space
Algebraic geometry provides a convenient tool for understanding exceptional sets, such as those that fail to satisfy the rank conditions necessary to apply Kruskal's Theorem. We briefly give the needed definitions.
Definition 3.7.Let S be a finite set of polynomials in. The common zero set inof the polynomials in S is the algebraic variety . A subset of a variety that is itself a variety is called a subvariety. For any algebraic variety, the ideal is the set of all polynomialssuch thatfor all.
The main result of this work is that PM model parameters are identifiable except for “rare” choices. This is expressed using the following terminology.
Definition 3.8.A property is generic on a full-dimensional subset W oforif it holds at all points of W except possibly for those points in some subsetof measure 0. If V is an algebraic variety in, we say a property is generic on V if it holds at all points except those in a proper subvariety of V.
This notion of “generic” on a variety is stricter than that sometimes used in algebraic geometry (a property that holds on a countable intersection of Zariski dense open sets), but is simpler for our purposes. Also, since proper subvarieties of V are of measure 0, our two uses of the word “generic” in the definition are consistent.
Example 2. The set of matrices, viewed as , forms a variety with . The property of having rank, or equivalently Kruskal rank 3, is generic on V, since matrices of rank at most 2, including those of the form (1), lie on a subvariety defined by the vanishing of a single polynomial, the determinant. This exceptional set is of dimension 8.
A fundamental tool for drawing conclusions that model parameters are generically identifiable is the following variant of a proposition in Rhodes and Sullivant (2012), which we use repeatedly.
Proposition 3.9.Letbe a complex analytic map with U an open subset of. Let V be a variety in. Suppose, and that there exists a pointwith. Then for generic pointsor, the pointlies off the variety V.
Proof. This follows from basic properties of complex analytic functions of many variables (see, for instance, the text by Range, 1986). The function is analytic, and not identically zero. Its zero set is therefore of measure zero, so for generic , lies off . The real points in the zero set must similarly have measure zero.
3.3. Rank propositions
For the proof of our main theorem, the ranks and Kruskal ranks of some special matrices arising in the PM model are needed, and we compile these rank computations here. By giving these algebraic results in advance, the proof of Theorem 5.7 can be presented more cleanly. Note that our arguments depend, in part, on some computations that were performed with the software Pari/GP. As these computations were performed using exact integer arithmetic, they may be taken as valid proofs, up to the usual assumptions of correct programming and no hardware faults.
We begin by defining a particular structured matrix (an instance of the equal input model; Casanellas and Steel, 2017) that can arise from particular parameter choices for the PM model.
Definition 3.10.Withfor, and, letdenote thematrix
We next show that -th row-tensor power of a matrix M formed from m stacked matrices of the form generically has full row rank if m is not too large. Part of the argument is inductive, with as the base case. While the size of is , only 1540 of the columns turn out to be distinct, so its rank is bounded by . Thus only by restricting to classes can we hope to obtain full row rank.
Note, however, that this upper bound of profiles in the proposition below and the slightly stronger restriction to profiles in our main result Theorem 5.7 are tied to our focus on for amino acid models. Even for the restriction to of the main result is unlikely to be tight, as it is, in part, an artifact of our specific arguments. If PM models with larger m become of interest for data analysis, then our general techniques are likely to extend to their analysis.
Proposition 3.11.Forand, let M be anmatrix formed by stackingchoices of matrices of the form. Thenhas full row rank for generic choices of the ai when.
Proof. We begin with the special case of . An exact Pari/GP calculation shows that for by picking distinct random integers for , the -matrix has full row rank. By removing some of the blocks from this example if , we obtain a point p1 for which has full row rank for smaller m as well.
To show that full row rank is a generic condition when , fix , and observe that the map from the space of the ai to {M} is analytic. Since gives full row rank, there is some minor f of , which when viewed as a polynomial in the entries of M has . Taking , then Proposition 3.9 shows that generic choices of the ai give so has rank .
Now consider . Then , where is a matrix and is a matrix. Since has full row rank for generic M, its rows are independent. However, with ,
so it is enough to know that the entries of some single column of are nonzero and that has independent rows to ensure has independent rows. However, this is true for generic choices of parameters for M. □
The next proposition gives a lower bound on Kruskal row rank, valid for all .
Proposition 3.12.For, let M be anmatrix formed by stackingchoices of matrices of the form. For,has Kruskal row rank greater than or equal to 2 for generic choices of the ai.
Proof. Consider first the case that . The matrices of the form M with Kruskal rank at most 1 form a subvariety V of all such M. By Proposition 3.9, it is enough to find a single matrix M not in V to see that generically such matrices have Kruskal rank at least two. Choose distinct positive small numbers as the free entries in each block of M, so that the diagonal entries are the largest in the block. Then no two rows within any block are multiples of each other, and no two rows of different blocks are multiples either, since the ai's are distinct. Thus M has Kruskal rank greater than or equal to two.
The case when follows by an argument similar to that at the end of the proof of Proposition 3.11.
The final propositions in this section involve generic ranks of stacked matrices formed by taking certain tensor products of matrices of the form above.
Proposition 3.13.Let M be anmatrix formed by stacking m choices of matrices of the form. Then forand, the matrix M has rank greater thanfor generic choices of the ai.
Proof. A Pari/GP calculation shows that for some choices of random integers ai, M has
full row rank , when ;
full row rank , when ;
rank , when ; and
rank , when .
Furthermore, by (4), for , there exists a matrix M with rank at least for some choice of ai’s, since we may repeat some blocks. Using Proposition 3.9, the stated rank condition on M is thus generic for all .
Proposition 3.14.Letbe as in Proposition 3.13, and M2be the rearrangement of M1formed by stacking m matriceswith the same choices of the ai. Let L be andiagonal matrix with positive entries. Then forand,has rank greater thanfor generic choices of the ai.
Proof. Sylvester's rank inequality gives
Since M1 and M2 differ only by row and column permutations, they have the same rank. Moreover, since L is a diagonal matrix with positive entries. Then, by Proposition 3.13, there is a choice of ai's so that has rank at least
, when ;
, when ;
, when ; and
, when .
The rank computation for shows additionally that there exist choices of ai giving for larger m, since blocks can be repeated. However, and so by Proposition 3.9, generically the rank must be greater than for all .
4. Algebraic Aspects Of The Pm Model
Next we relate the algebraic definitions made in the previous section to phylogenetic models and the PM model in particular. We begin by describing how a row tensor product of Markov matrices relates to parameters on a star tree.
Definition 4.1.Let A be a set of taxa on a star tree rooted at its internal node, with pendant edgesand associated Markov matrices. Then
For an m-class PM model on a star tree, the matrix MA is of size . Its entries are conditional probabilities of observing different -tuples of states at the taxa in set A, given the state at the root.
Given a tree T on taxa X, tripartitions and splits of X can be associated with the topological structure of T. For instance, the tree of Figure 1 displays a tripartition with . Formally, a tripartition is displayed on a tree if there is some vertex v of T whose deletion results in three subtrees with labeling their leaves. Similarly, if and , then , and T displays the split of X, since there is an edge e whose deletion results in two subtrees with leaves labeled by and .
A tree displaying the tripartition and the split , where .
When a tree T displays a tripartition of a set of taxa, then the flattening of a joint distribution corresponding to that tripartition can be expressed using the 3-way matrix product of certain matrices built from model parameters.
Lemma 4.2.Suppose T is a tree on a set of taxa X rooted at an internal vertex v and that T displays the tripartitionassociated with v. Let P be a probability distribution for a Markov modelon T withstates at the internal nodes. Then there exist matrices,,constructed from model parameters for, each withrows, such that
Proof. From the parameters on T we may define Markov matrices whose entries are conditional probabilities of states at the leaves in each set , given the state at v. Let π be the state distribution at v. Then
where (π), , and .
For establishing generic properties of the PM model, we often consider the particular choice of the exchangeabilities given by the matrix whose entries are all 1. This is, in essence, the CAT-F81 model (Lartillot and Philippe, 2004; Le et al., 2008), with the number of profiles some fixed m. For this R, a Markov matrix has the form given in Equation (3) of Definition 3.10.
Lemma 4.3.Consider the PM modelwith, and let e be a branch of T of length 1. Then for a single-classwith profileπand rate, the Markov matrixfor e is of the formof Definition 3.10, withandsatisfying.
Conversely, anyMarkov matrix of the formwithandcomes from a choice of parameters for one class of the PM model withon an edge of length 1.
Provided(equivalently), this correspondence is one-to-one.
Proof. The first statement follows by direct computation: With the standard basis vectors, (π) has right eigenvectors with eigenvalues for , and eigenvector with eigenvalue 0.
For the converse, since , there is a unique such that . If , let for , and π. Then , and . With these choices, (π) and . If , then all the aj are zero, and M is the identity matrix. Take and π arbitrary. Then
5. Identifiability of Parameters For the Pm Model
With preliminaries completed, we now turn to establishing our main result, on generic parameter identifiability for the PM model. The first step is to understand that the ranks of matrix flattenings of a model distribution are affected by whether the associated split is, or is not, displayed on the tree T.
Proposition 5.1.Let T be an n-taxon tree on X and P a distribution from the model PM = PMwithand. Suppose thatis a split of X with.
Ifis displayed on T, then has rank at most ;
Ifis not displayed on T, thengenerically has rank greater than.
Before beginning the proof, we present a simplified example to illustrate how the matrix rank of flattenings of joint distributions from Markov models on trees carries information about the absence/presence of internal edges on T.
Example 3. Consider a single-class 2-state Markov model on the 4-taxon tree shown in Figure 2. A special case of this model is . The joint distribution of states at the leave of T is the array P, with entries indexed by leaves in the order .
A 4-taxon tree with split .
With and , the rows and columns of are indexed by elements of . For example, the entry is . In contrast, if and , the flattening has -entry is .
Now suppose that the terminal edges of T have length 0, so that the states at a and b must agree, as must those at c and d, since no substitutions occur on terminal edges. Then the matrix arises from the joint distribution of states at the internal nodes v1 and v2, and its only nonzero entries are . Thus the matrix flattening for the split displayed by T has form
with rank at most .
In contrast, the flattening for the split not displayed on T has form
which generically has rank
If the terminal edges of T are of positive length, then the resulting joint distribution P can be obtained by a simple and generically rank-preserving linear action on the rows and columns of the flattenings above, as described in Allman and Rhodes (2006). Thus, flattenings respecting the topology of T generically have rank , while those that do not generically have larger rank.
Proof of Proposition 5.1.
Claim (1) follows from Lemma 4.1 of Rhodes and Sullivant (2012), since the PM model is a submodel of a mixture of m general Markov models on a single tree.
For claim (2), suppose now is not displayed on T. Let V be the variety of matrices of size with rank at most , defined by the set of all minors. By Proposition 3.9, it suffices to find a single choice of parameters that produces a point off V, as the parameterization extends to a complex analytic function.
Since T does not display , by Theorem 3.8.6 of Semple and Steel (2003), there is an edge of T with associated split such that , , , and are all nonempty. To find the needed choice of parameters, fix all internal edges of T except e to have length so the Markov matrices on these edges are I, and fix the edge lengths of all terminal edges and e to be 1 (Fig. 3). Take and mixing weights to be uniform. Values for the parameters are specified later in the argument. For this choice of parameters, T is formed by joining two star trees at the ends of e.
Taking to be the root of T, let be the block diagonal matrix, which is the joint distribution of classes and states at v1 and v2. The probabilities of observing states i, j, k, l at leaves in , , , , respectively, , are the entries of a tensor.
Define an tensor ,
The tensor is the joint distribution of states at the leaves of the tree T of Figure 3 when terminal edges have length zero and are single taxa. Indeed, since is not displayed on T, the matrix is with entries
Since K is block diagonal, has at most nonzero entries, all appearing on the diagonal, and is generically of rank .
To see that in the general case has a similar structure, let and where are given as in Equation (4) of Definition 4.1. Then
We now establish that claim (2) holds when , so the tree is one of those shown in Figure 4. Suppose first that and , as shown for tree (a) of the figure. In this case . Since is diagonal with at most nonzero entries due to the block structure of K, in Equation (5) we can replace by a diagonal matrix Q by eliminating zero rows and columns. To do this, we must also replace with an matrix N formed by taking tensor products of the individual class components of and and then restacking. To be concrete, for class c, the Markov matrix for a terminal edge is by Lemma 4.3, and N is formed by stacking m matrices .
Since Q is diagonal with generically positive entries, using Equation (5) we have that
where . By the singular value decomposition, it follows that
The Pari/GP calculation presented in Proposition 3.13, together with Proposition 3.9, shows generically, and thus, for generic and ri it follows that .
Now continuing with suppose that and , as shown in Figure 4b. The previous argument fails for this tree because now , as the tensor products defining these matrices are taken in different orders. However, a more complicated Pari/GP calculation, presented as Proposition 3.14, shows that generically has rank greater than in this case.
Finally, for the general case of , take to be a 3-element subset of A with at least one element from and one from , and similarly take to be a 3-element subset of B with at least one element from and from . Let be the probability distribution for the taxa . Since the row indices of depend on the states at the taxa in A and the column indices depend on the states at the taxa in B, marginalizing over all possible states for the taxa in A, which are not in , and similarly for B, gives the matrix . There exist matrices, , that perform this marginalization on ,
Since generically has rank greater than and has rank greater than or equal to by this equation, it follows that generically has rank greater than .
A tree T that does not display the split with , , but displays the split with and .
Trees with (a) and , and (b) and .
As a consequence of Proposition 5.1, from a distribution P computed from generic PM model parameters we can identify every edge in the tree for which there are at least three taxa on either side, by computing ranks of flattenings of P. In the following, we see that Proposition 5.1 also helps to identify at least one tripartition on the tree.
Proposition 5.2.Let T be an n-taxon tree on X with, and P a joint distribution from generic parameters for the modelwithand. Then there is at least one tripartitiondisplayed on T, with, which can be identified from P.
Proof. By Lemma of Rhodes and Sullivant (2012), every unrooted binary tree T with has an internal vertex v, which induces a tripartition such that two of the three components contain at least leaves of T.
The two edges incident to v that corresponds to subsets of X with at least leaves are generically identifiable by Proposition 5.1, since for , . If the third edge incident to v has 3 or more taxa in its component, it also can be identified. Thus, it remains to establish that the third edge incident to v can be identified when the number of taxa in its component is 1 or 2. Examples of such trees are illustrated for in Figure 5a,b.
If the third component has only one leaf, as in Figure 5a, the two bipartitions and are identifiable by Proposition 5.1. Together this implies that the tripartition induced by v is . If the third component has two leaves as in Figure 5b, the two splits and are identifiable, but and are not displayed on T, and that can be detected by Proposition 5.1. This implies that the tripartition is on the tree.
Examples of 9-taxon trees with internal vertex v inducing with and or 2.
With a tripartition on the tree identifiable by the preceding proposition, we prepare to apply Kruskal's Theorem. Letting P be a joint distribution from , pick an internal vertex v of T inducing such a tripartition . Then by Lemma 4.2
where . Provided the Kruskal ranks of the matrices are large enough, at least generically, Kruskal's theorem can be applied. The next three lemmas establish this.
Lemma 5.3.Consider the modelwith. If, then therow tensor power of theMarkov matrix associated with a terminal edge of T has full row rank for generic parameters.
Proof. Using Proposition 3.9, it is enough to show there is a single choice of parameters for which the tensor power has full row rank. Let R = 1, and take the terminal branch lengths to be 1. Then by Lemma 4.3, the Markov matrix Me on a terminal edge has the form of stacked matrices of the form . By the Pari/GP calculation of Proposition 3.11, for generic choices of the other parameters, , , has full row rank.
Using Proposition 3.12 in a similar argument we obtain the following.
Lemma 5.4.Consider the modelwithand. Then for, therow tensor power of theMarkov matrix associated with a terminal edge of T generically has Kruskal rank at least 2.
Lemma 5.5.For a distribution from the modelwithand, letbe the matrices described above. If, and, then generically, MB have full Kruskal rank and MC has Kruskal rank at least 2.
Proof. Using Proposition 3.9, we need only show there is a single choice of parameters for which these rank claims hold. Set all internal branch lengths 0 and all terminal branch lengths 1, so that T is a star tree rooted at the central node v. Then by Lemma 5.3, since for generic choices of the profiles πi, the matrices MA (and therefore ) and MB have full row rank and therefore full Kruskal rank. Also by Lemma 5.4, MC has Kruskal rank at least 2.
We add the last ingredient before the main result.
Proposition 5.6.Suppose T is a tree on X that displays a known tripartitioncorresponding to vertex r with,. Ifandthen both T and the numerical parameters of the PMmodel are generically identifiable, up to arbitrary rescaling of the tree and the exchangeability matrix R.
Proof. Using the notation and result of Lemma 5.5, if a distribution P comes from generic parameters of , then
where have full Kruskal rank and MC has Kruskal rank at least 2. Thus Equation (2) of Theorem 3.6 is satisfied with , and are determined uniquely up to simultaneous permutation and scaling of the rows.
Also, by factoring out row sums from the matrices, we can generically identify the root distribution vector at the node r and up to simultaneous permutation of the entries of and the rows of the matrices. Considering any entry of , and supposing that this corresponds to an unknown class and state , then the same rows of correspond to the same class u and state w. Since Kruskal's theorem yields identifiability only up to permutation, we must determine which of the rows of correspond to the same fixed class u.
Consider first the special case that where . Then T, which is generically binary, has a subtree rooted at r, with leaves as shown in Figure 6, although we do not know which two taxa from form the cherry
The Markov matrix MA is of size . Choose the row of MA where for unknown . It is a row vector with entries, but we can reconfigure it as a three-dimensional tensor of size so its -entry is . Since the PM model is time reversible, take v1 as the root of the subtree in Figure 6. Then for unknown vector π, and Markov matrices for class u on this subtree, the joint distribution of states at for fixed class u is
where with denoting the column of M1. For fixed u this is simply a rescaling of the conditional distribution given in the row of MA.
Thus applying Kruskal's theorem to each row of MA reshaped into such a 3-way tensor, we can decompose for each into a triple product, as the matrices generically all have rank . Note that for each , Kruskal's theorem gives the matrices up to ordering of their rows. Two of these matrices, , will be dependent only on the class u, but not the state w. So considering all , we can find rows of MA with the same (possibly permuted rows) version of My and Mz, corresponding to a single-class u. In this way, we can group the rows of with entries of by class u. Now taking those rows of and entries of for one class u and reassembling them in a 3-way product gives a tensor for a single-class GTR model on the full tree T. Both the tree T and numerical parameters are identifiable for this single-class model by Theorem 2.4.
For the general case, suppose . Then by marginalization down to , we can identify the subtrees and parameters for . Then interchanging the roles of A and B identifies the subtree and parameters for A.
A subtree of T with leaves .
Combining Proposition 5.2 with Proposition 5.6, we have proved the main result.
Theorem 5.7.Let T be a tree with at least 9 taxa. Then under the PMmodel with, both T and numerical parameters are generically identifiable, up to arbitrary rescaling of the tree and the exchangeability matrix R.
Theorem 5.7 extends to certain tree shapes with fewer than 9 taxa. To apply Proposition 5.6, T must display a tripartition with two of its subsets of size at least 3, so that T must have at least 7 taxa. Such a tripartition will be generically identifiable by the argument given for Proposition 5.2.
Corollary 5.8.For the PM modelwith, parameters are generically identifiable if T has any of the 8-taxon tree shapes (a)-(d) shown inFigure 7, or the 7-taxon caterpillar shape.
All binary unrooted tree shapes for 8 taxa. Parameters of the profile mixture model are generically identifiable for trees (a–d). The arguments of this article do not answer the identifiability question for tree (e).
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This research was supported, in part, by the National Institutes of Health Grant R01 GM117590, awarded under the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.
References
1.
AllmanE., AnéC., and RhodesJ.2008. Identifiability of a Markovian model of molecular evolution with gamma-distributed rates. Adv. Appl. Probab. 40, 229–249.
2.
AllmanE., HolderM., and RhodesJ.2010. Estimating trees from filtered data: Identifiability of models for morphological phylogenetics. J. Theor. Biol. 263, 108–119.
3.
AllmanE., LongC., and RhodesJ.2019. Species tree inference from genomic sequences using the log-det distance. SIAM J. Appl. Algebra Geometry. 3, 1–30.
4.
AllmanE., MatiasC., and RhodesJ.2009. Identifiability of parameters in latent structure models with many observed variables. Ann. Statist. 37, 3099–3132.
5.
AllmanE., PetrovićS., RhodesJ., et al.2011. Identifiability of two-tree mixtures for group-based models. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 710–722.
6.
AllmanE., and RhodesJ.2006. The identifiability of tree topology for phylogenetic models, including covarion and mixture models. J. Comput. Biol. 13, 1101–1113.
7.
AllmanE., and RhodesJ.2008. Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites. Math. Biosci. 211, 18–33.
8.
AllmanE., and RhodesJ.2009. The identifiability of covarion models in phylogenetics. IEEE/ACM Trans. Comput. Biol. Bioinform. 6, 76–88.
9.
CasanellasM., and SteelM.2017. Phylogenetic mixtures and linear invariants for equal input models. J. Math. Biol. 74, 1107–1138.
10.
ChaiJ., and HousworthE.2011. On Rogers's Proof of Identifiability for the GTR + Gamma + I Model. Syst. Biol. 60, 713–718.
11.
ChifmanJ., and KubatkoL.2015. Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. J. Theor. Biol. 374, 35–47.
12.
HolleringB., and SullivantS.2019. Identifiability in phylogenetics using algebraic matroids. arXiv:1909.13754.
13.
JonesD., TaylorW., and ThorntonJ.1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, 275–282.
14.
KruskalJ.1977. Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl. 18, 95–138.
15.
LartillotN., BrinkmannH., and PhilippeH.2007. Suppression of long-branch attraction artefacts in the animal phylogeny using a site heterogeneous model. BMC Evol. Biol. 7, S4.
16.
LartillotN., LepageT., and BlanquartS.2009. PhyloBayes 3: A Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics. 25, 2286–2288.
17.
LartillotN., and PhilippeH.2004. A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol. Bio. Evol. 21, 1095–1109.
18.
LartillotN., RodrigueN., StubbsD., et al.2013. PhyloBayes MPI: Phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Syst. Biol. 62, 611–615.
19.
LeS., GascuelO., and LartillotN.2008. Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics. 24, 2317–2323.
20.
LongC., and SullivantS.2015. Identifiability of 3-class Jukes-Cantor mixtures. Adv. Appl. Math. 64, 89–110.
21.
RangeR.1986. Holomorphic Functions and Integral Representations in Several Complex Variables, Vol. 108. Springer-Verlag, New York.
22.
RannalaB.2002. Identifiability of parameters in MCMC Bayesian inference of phylogeny. Syst. Biol. 51, 754–760.
23.
RhodesJ., and SullivantS.2012. Identifiability of large phylogenetic mixture models. Bul. Math. Biol. 74, 212–231.
24.
SempleC., and SteelM.2003. Phylogenetics, Vol. 24. Oxford University Press, Oxford.
25.
The PARI Group. 2019. PARI/GP version 2.11.2 [Computer software manual]. University of Bordeaux. Available at: http://pari.math.u-bordeaux.fr. Accessed April15, 2021.
26.
WangH.-C., LiK., SuskoE., et al.2008. A class frequency mixture model that adjusts for site-specific amino acid frequencies and improves inference of protein phylogeny. BMC Evol. Biol. 8, 1–13.
27.
WangH.-C., SuskoE., and RogerA.2014. An amino acid substitution-selection model adjusts residue fitness to improve phylogenetic estimation. Mol. Biol. Evol. 31, 779–792.
28.
WascherM., and KubatkoL.2021. Consistency of SVDQuartets and maximum likelihood for coalescent-based species tree estimation. Syst. Biol. 70, 33–48.
29.
WhelanS., and GoldmanN.2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Bio. Evol. 18, 691–699.
30.
YangZ.1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J. Mol. Evol. 39, 306–314.