Parameter Identifiability of a Multitype Pure-Birth Model of Speciation

Abstract

Diversification models describe the random growth of evolutionary trees, modeling the historical relationships of species through speciation and extinction events. One class of such models allows for independently changing traits, or types, of the species within the tree, upon which speciation and extinction rates depend. Although identifiability of parameters is necessary to justify parameter estimation with a model, it has not been formally established for these models, despite their adoption for inference. This work establishes generic identifiability up to label swapping for the parameters of one of the simpler forms of such a model, a multitype pure birth model of speciation, from an asymptotic distribution derived from a single tree observation as its depth goes to infinity. Crucially for applications to available data, no observation of types is needed at any internal points in the tree, nor even at the leaves.

1. INTRODUCTION

Species diversification models are used in Biology to make inferences about historical speciation and extinction rates over the time since a group of species, or taxa, evolved from a common ancestor. By providing information on rates of speciation and extinction, inference with these models seeks to give insight into the evolutionary dynamics leading to the present diversity of life. These models have a long history, starting with the constant-rate pure-birth model of Yule (1925), and a fairly large literature has developed.

Diversification models describe a process beginning with a single lineage at some time in the past, which as time progresses may speciate or go extinct. When a speciation occurs the edge bifurcates into two edges, with the number of lineages increasing by 1. When an extinction occurs, the lineage ends, and the number of lineages decreases by 1. After either event, the process continues forward, independently on all lineages, producing a growing tree structure until the present time is reached. This tree, which has both topological and metric structure, constitutes an observation. (In applications, it may be necessary to consider the reconstructed tree, which is obtained by removing all tree edges with no descendents at the present; Harvey et al, 1994; Nee et al, 1994).

Two basic sorts of these models have found common use in empirical studies. In the first, the speciation and extinction rates are functions of time and apply to all taxon lineages present at any moment. This can be thought of as modeling exogenous factors, such as environmental conditions, that affect all taxa in the tree identically. Since all lineages behave in the same probabilistic way at any moment, it is not hard to show that the exact branching pattern of the tree structure is irrelevant, with all the information in a tree observation being captured by the number of lineages as a function of time. Thus the work on time-dependent birth–death models by Kendall (1948) is foundational.

In the second sort of diversification model, which we call the multitype birth–death tree model, lineages are assigned one of a finite number of types at each moment, with the model's speciation and extinction rates dependent only on the type. Over time, however, species may change types at fixed switching rates. This models endogenous factors, such as a particular biological trait a taxon may possess, including, for instance, a morphological feature, behavior, or whether a particular gene is present and active in an organism. A given type might correlate with faster or slower speciation than another, and/or affect the extinction rate. For these models the branching structure of a tree observation does matter, as taxa present at a given time may each have different types and, thus, different tendencies to speciate or go extinct.

The Binary State-specific Speciation and Extinction (BiSSE) model of Maddison et al (2007) formalized the multitype framework for biological applications. Multitype (MuSSE) and quantitative-type (QuaSSE) variants of the model were subsequently proposed by FitzJohn (2012). Although these works assumed that the type is observed for the extant taxa at the leaves of a tree, we consider the multitype birth–death tree model with no type information observable for any lineage at any time, as type observations are unnecessary for our results. Indeed, the usefulness of these models to infer correlation between observed types and diversification rates from data with type information for extant taxa has been called into question (Rabosky and Goldberg, 2015).

Many other diversification models have been proposed, combining or extending these basic frameworks, with Stadler (2013) offering one review. New variants continue to be developed (e.g., Barido-Sottani et al, 2020; Cantalapiedra et al, 2014; Maliet et al, 2019; Rasmussen and Stadler, 2019; Stadler, 2019).

When these models are used for inference, the data are taken as a single tree assumed to show the true evolutionary relationships of the taxa. (In practice, this tree itself must be inferred, usually from sequence data using phylogenetic and/or phylogenomic methods which we do not discuss here.) Multiple trees which one can reasonably hypothesize were generated with the same parameter values are simply not available. If the tree is sufficiently large, researchers hope that it provides enough information to infer the speciation and extinction parameters reasonably well. More precisely, it has been implicitly assumed that the inference is statistically consistent, in the sense that as the number of taxa increases toward infinity (i.e., the tree grows larger), the probability of inferring model parameters arbitrarily close to the generating ones approaches 1. Establishing such a result, however, requires showing identifiability of the model parameters: a distribution derived from an observation of a single tree has a limit, as the number of taxa approaches infinity, that uniquely determines all parameter values.

Of course a full proof of the statistical consistency of a particular estimator requires additional arguments. For instance, the standard results on the consistency of maximum likelihood (ML) assume the availability of multiple independent samples and, therefore, cannot be applied. Leroux's result (Leroux, 1992) on the consistency of ML inference from a single sequence of observations from a Hidden Markov Model is analogous to what is needed for applications of these diversification models. Nonetheless, establishing parameter identifiability is the first step toward this goal.

Recent work has shown that the first type of diversification model, with time-dependent rates, does not in fact have identifiable parameters (Louca and Pennell, 2020), calling into question the conclusions of many empirical studies. This nonidentifiability result, which holds even if one allows for identification to be based on arbitrarily many independent tree observations with the same underlying rate parameters, was compellingly illustrated by construction of examples of wildly different rate functions producing identical tree distributions. An instance of this lack of identifiability had in fact appeared earlier, in an argument in which speciation rates were modified and extinction rates set to zero without changing the model distribution (Nee et al, 1994).

Little work, however, has addressed identifiability questions for multitype birth–death tree models. The strongest results on parameter identifiability for a pure birth model focus on a tree's topological features but assume that the types of both leaf nodes and their parents are observed (Popovic and Rivas, 2016). In biological applications, however, the type of a leaf of the tree may be observable, but the type of the parent nodes is virtually never known. Thus no identifiability result relevant to typical data analyses has been produced. A recent article of O'Meara and Beaulieu (2021), which broadly discusses current issues with diversification models in evolutionary biology in light of the Louca and Pennell (2020) result, argues that multitype birth–death tree models are likely to be identifiable—provided their rates are time-independent—but is careful to indicate that this has not yet been established. And as the community has seen for time-dependent models, formal mathematical analysis is essential to settle the question.

One might hope that the analysis of multitype birth–death tree models would be simpler than for a time-dependent rate model, as its parameter space is finite dimensional. In contrast, while trees produced by the time-dependent rate models can be summarized by the counts of lineages through time with no loss, this is not true for the multitype models, where the full tree structure carries additional information. Effectively extracting information from a tree with both topological and metric structure requires a new approach.

In this article, we investigate parameter identifiability of the multitype pure-birth tree (MPBT) model with any finite number of types. We thus restrict extinction rates for all classes to be zero. This model has also been called the multitype Yule model (Popovic and Rivas, 2016). We assume only that the metric tree is observable, with no information on the types either at points internal to the tree or at the leaves. More formally, we establish generic identifiability of parameters up to label swapping. “Generic” means that the result holds if we exclude parameters lying in a measure-zero subset of the parameter space. We give an explicit characterization of such a measure-zero exceptional set, as the zero set of a certain polynomial. “Up to label swapping” means that there are certain symmetries of the parameter space, arising from interchanging types so that their corresponding speciation and switching rates are also interchanged, that have no effect on the model's behavior. Generic identifiability up to label swapping is often the strongest form of identifiability that holds in models with hidden variables (Allman et al, 2009), and since we treat the types as unobservable, its appearance here is not surprising.

Our explicit generic conditions are stated as four assumptions throughout the article, as need for each arises for specific arguments. Briefly, they are that speciation rates for all types are positive and distinct (Assumptions 1 and 4), all switching rates between types are positive (Assumption 2), and that a certain matrix with entries in the speciation and switching rates is nonsingular (Assumption 3). The first few of these are intuitive and plausible assumptions. Although the meaning of the last condition is less clear outside the setting of the formal mathematical proof, we illustrate that in a few special cases it also imposes a natural condition.

Our arguments draw on several earlier studies. The first is the work of Athreya (1968) on Multitype Continuous Time Markov Branching Processes. In fact, these models and the MPBT model have the same underlying structure. But much of the classical branching process literature allows only for observing type counts over time and not for observing the tree structure indicating the branching of specific lineages. The MPBT model, in contrast, treats the tree structure as observable, with type information hidden. Thus while providing an important tool in this work, the results of Athreya are not immediately applicable to the MPBT model.

The second result crucial to our work is a general theorem on identifiability up to label swapping of parameters of a mixture model of product distributions (Allman et al, 2009). In applying this to the MPBT model, we consider the joint distribution of edge lengths around a node on a uniformly at random chosen edge of a random tree, as the random tree grows arbitrarily large. Due to conditional independence of edge lengths, conditioned on the type of the shared node, this joint distribution takes the form of a mixture distribution (over types) of product distributions. Although additional work is necessary to show parameter identifiability, this theorem is a crucial ingredient in our argument.

Although we do not address the multitype birth–death tree model with nonzero extinction rates here, we believe that our approach provides a pathway toward a more general result.

Some applications of multitype birth–death models also attempt to choose an appropriate number of types based on the data, with several Bayesian software packages supporting this (e.g., Barido-Sottani et al, 2020; Rabosky, 2014). While this is an important element of some data analyses, it is not addressed in this work, where we fix the number of types. Choosing the number of types amounts to choosing among a family of nested models, each with generically identifiable parameters, where one may expect any finite data set to be naively better fit with each increase in the number of types. While in the theoretical world of exact distributions one could choose the smallest number of types giving an exact fit, the finiteness of data necessitates the use of more sophisticated approaches to model adequacy.

This article is structured as follows. In Section 2 we provide a more formal definition of the MPBT model, and begin its analysis by deriving formulas related to the generation of a single edge in the tree in Section 3. Section 4 uses the results of Athreya (1968) to obtain asymptotic results on the distribution of types across lineages in the tree at times increasingly distant from the root of the tree. Then, in Section 5, we bring these ingredients together and apply the theorem of Allman et al (2009) to obtain our main results. Concluding remarks appear in Section 6.

2. MODEL DEFINITION

In this section we formalize the MPBT model, in a form useful for our analysis.

Let m be a positive integer denoting the number of types, and denote the set of types by $[m] = {1, 2, \dots, m}$ .

The parameter space of the MPBT model with m types is all 3-tuples $(π, λ, S)$ described as follows.

A root distribution $π = (π_{1}, π_{2}, \dots, π_{m})$ , with $π_{i} \geq 0$ , $\sum_{i} π_{i} = 1$ gives probabilities $π_{i}$ of type i being chosen for the tree root. A vector $λ = (λ_{1}, λ_{2}, \dots, λ_{m})$ with nonnegative entries gives speciation rates $λ_{i}$ for type i. An $m \times m$ matrix $S = (s_{i j})$ with nonnegative off-diagonal entries and rows summing to 0 gives scalar type switching rates $s_{i j}$ from type i to type j, $i \neq j$ . Note that S is determined by the $m^{2} - m$ independent scalar switching rates.

2.1. The edge process model

We first describe how an edge of a tree is produced under the model. As edges of the tree are produced independently conditioned on their starting types, a description of a single edge is sufficient.

We view an edge as growing with time, randomly changing the type of its leading point as it does so. At any time the edge may speciate, at a rate $λ_{i}$ determined by its current type i. When speciation occurs, the edge ceases to grow, and in the full model two new edge processes are started for its descendent edges. However, in formalizing the edge process we describe the speciation of an edge as the process entering an absorbing state, for mathematical convenience.

For each type $i \in [m]$ , define two states $i_{-}, i_{+}$ . At any time, state $i_{-}$ indicates that the current leading point of the edge has type i and that the edge has not yet speciated. The absorbing state $i_{+}$ represents that a speciation has occurred and at the time of speciation the leading point had type i. The parameter $s_{i j}$ , $i \neq j$ , is thus a rate of change from state $i_{-}$ to state $j_{-}$ , while $λ_{i}$ is the rate of change from state $i_{-}$ to $i_{+}$ . No other instantaneous state changes are allowed.

Definition 1. The m-type pure-birth edge process $E_{τ} = E_{τ} (\tilde{π}, λ, S)$ with ${\tilde{π}}_{i} \geq 0$ , $\sum_{i} {\tilde{π}}_{i} = 1$ , is the $2 m$ -state continuous-time Markov process over $τ \in [0, \infty)$ with states $1_{-}, 2_{-}, \dots, m_{-}, 1_{+}, 2_{+}, \dots, m_{+},$

initial state distribution $(\tilde{π}, 0) \in ℛ^{2 m}$ , and $2 m \times 2 m$ transition rate matrix $Q : = [\begin{matrix} S - d i a g (λ) & d i a g (λ) \\ 0 & 0 \end{matrix}],$

where the rows and columns of Q are ordered by states as above. Here $0$ is a vector or matrix of 0s, and $d i a g (λ)$ is the diagonal matrix formed from vector λ.

The transition probability matrix associated to $E_{τ}$ is $P (τ) = exp (Q τ),$

with $P_{i j} (τ)$ giving the probability that an edge is in state j at time $τ$ given that it was in state i at time 0.

Definition 2. The speciation time $T$ associated to $E_{τ} (\tilde{π}, λ, S)$ is the $[0, \infty]$ -valued random variable

A realization of the edge process that reaches a “+” state is viewed as an edge of length $T$ , the time at which a speciation occurs. Each point (time $τ$ ) along the edge is “colored” by type i if the process is in state $i_{-}$ (or state $i_{+}$ at its endpoint) at that time. Under mild assumptions, the edge length is finite with probability 1, as is shown below. Although for the MPBT model colors on edges are ultimately hidden, they play an important role in our arguments.

The terminal edges of the tree are produced by terminating edge processes at a specific time, before they may have reached an absorbing state. Formally defining such a truncated edge process and the colored edge it produces is straightforward.

Due to the time-homogeneous Markov formulation of the edge process, we may equivalently produce an edge either from a single process reaching a “ $+$ ” state, or by starting the process, truncating it before it enters a “ $+$ ” state, starting a new process in the final state of the truncated one, and then conjoining the edges produced. Likewise, to produce an edge from the truncated process, we may allow the process to continue to a later time and then truncate the edge that was produced to an initial segment.

2.2. The MPBT model

We now define the MPBT model, as a generative model producing a tree. Let $T > 0$ be the depth (length of all paths from root to any tip) of the tree to be sampled.

1.
The process begins with a root node. With parameters $(π, λ, S),$ generate from an edge process a colored descendent edge from the root to a node of type i, the only current tip of the tree.

If the length of this edge is $\geq T$ , truncate it to length T, and go to Step 4.

Otherwise, at this node attach two descendent edges of length 0, with points on them colored by i. The tree now has 2 tips.
2.
If the tree currently has k tips, for each tip generate a descendent edge via independent edge processes with parameters $(e_{i}, λ, S)$ , where i is the type of the tip and $e_{i}$ the standard basis vector in $ℛ^{m}$ . Truncate all edge processes at the time $τ$ when the first reaches a “+” state. The colored edges for each tip are conjoined to the edges (possibly of length 0) leading to the tip.

If the path length from the root to a tip of the tree is $\geq T$ , truncate all terminal edges so that all paths from root to leaves have length T, and go to Step 4.

Otherwise, at the tip that arose from reaching state $j_{+}$ , attach two descendent edges of length 0 with points on them colored by j.
3.
Go to Step 2.
4.
Uncolor all edges to obtain a sampled tree.

An example simulation of a colored tree from a binary-type model is shown in Figure 1, with the color indicating type hidden in Figure 2.

FIG. 1.
A finite-length colored tree generated by the binary-type pure birth tree model, before colors are hidden. Here black represents type 1 and dashed gray type 2, with $λ_{1} = 0.1,$ $λ_{2} = 0.5,$ $s_{12} = 0.1,$ $s_{21} = 0.2$ . Only the uncolored tree is observed.

FIG. 2.
The uncolored tree of Figure 1, of depth T, generated by the binary-type pure birth tree model. The black line at t determines several highlighted triples of edges whose lengths are possible draws from the probability distribution G_T,t of Definition 5 of Section 5.

Remark. Inherent in the model are several notions of time. For an individual edge process, $τ$ is a time variable, with $τ = 0$ at the parental node in the edge. For the tree generation process overall, we use t as the time variable, with $t = 0$ at the root. If the edge process starting at the root enters a “+” state at time $τ = T_{0}$ , then that root edge has length $ℓ = T_{0}$ and at its child node $t = T_{0}$ . Then if the edge process for an edge descending from the first speciation produces an edge of length $T_{1}$ , then at its child node $t = T_{0} + T_{1}$ . In general, a point on any edge e at time $τ$ has

We can thus view a random tree as growing with time t, as its terminal edges lengthen while changing type, and speciate.

Remark. While we have defined the MPBT model as starting with a single edge descending from the root node, it is equally common to define diversification models as starting at a bifurcating root. The modifications to the definition that are necessary to do so are straightforward, and working in that context would have no substantive impact on the arguments which follow.

Remark. Even if $T \to \infty$ , a single observed tree does not allow for the identification of π, so we focus on identifying the pair $(λ, S)$ . This factor of the parameter space can be identified with the nonnegative orthant of $ℛ^{m^{2}}$ .
3. THE EDGE PROCESS

For parameters $(λ, S)$ , let $D = d i a g (λ)$ and $U = S - D$ , so that the edge process $E_{τ}$ has Markov rate matrix $Q = [\begin{matrix} U & D \\ 0 & 0 \end{matrix}] .$

Lemma 1. The transition probability matrix for $E_{τ}$ is $P (τ) = [\begin{matrix} exp (U τ) & f (U τ) D τ \\ 0 & I \end{matrix}],$

where $f (A) = \sum_{n = 0}^{\infty} \frac{1}{(n + 1)!} A^{n}$ satisfies $f (A) A = exp (A) - I$ .

Proof. For $n \geq 1$ $Q^{n} = [\begin{matrix} U^{n} & U^{n - 1} D \\ 0 & 0 \end{matrix}],$

For technical reasons we impose the following assumption, which is also biologically plausible.

Assumption 1. The speciation rates $λ_{i}$ are positive for all i.

Lemma 2. Let (λ, S) be parameters for an MPBT edge process satisfying Assumption 1. Then U is nonsingular and all eigenvalues of U have negative real part.

Proof. The assumption implies that U is strictly diagonally dominant, that is, the absolute value of each diagonal entry is strictly greater than the sum of the absolute values of all other entries in its row. Thus U is nonsingular (Horn and Johnson, 2012). Since the diagonal entries are also negative, by the Gershgorin Circle Theorem every eigenvalue of U will have negative real part.

Proposition 1. Let F_i denote the cdf of the speciation time $T$ conditioned on $E_{0} = i_{-}$ , and $1$ be the vector of 1s. Then F_i is given by the i-th entry of $1 - exp (U τ) 1 .$

Moreover, under Assumption 1, $T$ is finite with probability 1.

Proof. Since $T$ is the time $E_{τ}$ first enters any of the absorbing states $j_{+}$ , F_i is the sum across the $i_{-}$ row of the upper right $m \times m$ block of $P (τ)$ . From Lemma 1, using that $D 1 = - U 1$ , the column vector of the F_is is therefore given by $f (U τ) D τ 1 = - f (U τ) U τ 1 = 1 - exp (U τ) 1 .$

Under Assumption 1, by Lemma 2 the eigenvalues of U have negative real parts, so ${lim}_{τ \to \infty} exp (U τ) = 0$ . Thus ${lim}_{τ \to \infty} F_{i} (τ) = 1$ for each i, implying that $T$ is finite with probability 1.

Proposition 2. Let $P_{i_{-}, j_{+}} = {lim}_{τ \to \infty} P_{i_{-}, j_{+}} (τ)$ denote the asymptotic probability of transition to $j_{+}$ conditioned on $E_{0} = i_{-}$ . Then under Assumption 1, $P_{i_{-}, j_{+}}$ is the $(i, j)$ -entry of $- U^{- 1} D$ .

Proof. The matrix $P_{-, +} (τ)$ with entries $P_{i_{-}, j_{+}} (τ)$ is the upper right $m \times m$ block of $P (τ)$ , so by Lemma 1, $P_{-, +} (τ) = f (U τ) D τ = (exp (U τ) - I) U^{- 1} D,$

using that U is nonsingular by Lemma 2. But ${lim}_{τ \to \infty} exp (U τ) = 0$ because U's eigenvalues have negative real parts. Thus

4. TYPE COUNTING PROCESS

Another ingredient of our approach to establishing the identifiability of MPBT model parameters is an analysis of an associated classical branching process, in which only the type counts are observed. More specifically, it records the number of edges of the tree which have each type as a function of time, but retains no information on the topology of the tree. We call this the type counting process and in this section use established results to determine the asymptotic behavior of the relative frequencies of each type.

Definition 3. For $i \in [m]$ , let $N_{t}^{i}$ denote the number of edges in a colored random tree arising from the colored MPBT model that exist at time t and are of type i at that moment. The type counting process N_t is the -valued continuous-time stochastic process over $[0, \infty)$ defined by $N_{t} : = (N_{t}^{1}, N_{t}^{2}, \dots, N_{t}^{m})$ . The relative frequency process is $R_{t} = N_{t} ∕ (\sum_{i = 1}^{m} N_{t}^{i})$ , provided the denominator is nonzero.

The asymptotics of the relative frequencies follow from results of Athreya (1968) on multitype continuous-time Markov branching processes, specifically Theorems 1 and 2 of that work, which are paraphrased below as Theorem 1. Such a model can be described as a process where individuals of type i live an exponentially-distributed length of time (whose rate only depends on type) and on death may be replaced by individuals of any type according to a distribution over

To place the type counting process of the MPBT model into this framework, both speciation and change in type are viewed as deaths. Speciation results in replacement by two individuals of the same type, and change in type results in replacement by an individual of a different type. Since a speciation “death” of a type i individual occurs with rate $λ_{i}$ , and a type change “death” of a type i individual followed by replacement with type j occurs with rate $s_{i j}$ , the combined rate of death for type i is $λ_{i} + \sum_{j \neq i} s_{i j}$ . When a death occurs, it is a speciation with probability $\frac{λ_{i}}{λ_{i} + \sum_{j \neq i} s_{i j}},$

and a change to type j with probability $\frac{s_{i j}}{λ_{i} + \sum_{j \neq i} s_{i j}} .$

Basic properties of the type counting process are summarized in the following.

Lemma 3. The type counting process N_t of the MPBT model is a strong Markov, continuous-time, m-type branching process, where each type i death has an offspring distribution defined by the multivariable probability generating function

We introduce yet another matrix defined in terms of the MPBT model parameters, as its leading eigenvalue and corresponding eigenvector play a large role in the counting process's behavior.

Definition 4. Given parameters $(λ, S)$ of the MPBT model, let $A = S + D .$

A leading eigenvalue of A is an eigenvalue, $ω$ , with the largest real part, and a normalized leading left eigenvector of A is a left eigenvector for $ω$ with $\sum_{i} u_{i} = 1$ .

The matrix A is the infinitesimal generator of the conditional expectation of the N_is. More precisely, $exp (A t) = M_{t} = (m_{i j} (t))$

with $m_{i j} (t) = ℰ [N_{t}^{j} | N_{0} = e_{i}],$

where $e_{i}$ is the i-th standard basis vector.

We will shortly show that $ω$ and $u$ are uniquely determined, under an additional assumption.

Assumption 2. The off-diagonal entries of S are positive, that is, $s_{i j} > 0$ for $i \neq j$ .

Lemma 4. For parameters $(λ, S)$ of the MPBT model satisfying Assumption 2,

1. $M_{t} = exp (A t)$ has positive entries for $t > 0$ .

2. A has a unique leading eigenvalue $ω$ , which is both simple and real. Moreover the corresponding normalized left eigenvector $u$ can be chosen to have all positive components.

Proof. Fix $t > 0$ . Then, using Assumption 2, A has positive off-diagonal entries, so there is a real k such that $B = A t + k I$ has positive entries. Since $B, k I$ commute, it follows that $e^{A t} = e^{B - k I} = e^{- k} e^{B}$ . Since B has positive entries, e^B does as well. Thus, $e^{A t}$ has positive entries.

The Perron–Frobenius Theorem applied to B shows that it has a unique dominant (i.e., of maximal absolute value) eigenvalue which is also positive and simple, with a unique normalized left eigenvector $u$ whose components are all positive. Since A has the same eigenvectors, and eigenvalues shifted by $- k$ and scaled by $1 ∕ t$ , the second claim follows.

Key properties of the counting process follow from the following more general theorem on classical branching processes.

Theorem 1. (Athreya, 1968) Let X_t be a strong Markov, continuous-time, m-type branching process over $[0, \infty)$ which takes values in Let $M_{t} = exp (A t)$ be the conditional expectation matrix. Let $h_{i} (x_{1}, \dots, x_{m})$ be the offspring probability generating function for type i.

If $M_{t_{0}}$ has positive entries for some $t_{0} > 0$ , and $h_{i}$ is of degree $> 1$ for all i, then as $t \to \infty$ ,

where W is a nonnegative random variable, $ω$ is the leading eigenvalue of A, and $u$ is the positive normalized left eigenvector of A associated with $ω$ .

Moreover, if $ξ^{i} = (ξ_{j}^{i})$ are random variables with generating functions h_i, then $ℰ [ξ_{j}^{i} log (ξ_{j}^{i})] < \infty$ (1)

for all $i, j$ if and only if for all i $ℙ (W = 0 | X_{0} = e_{i}) = ℙ (X_{t} = 0 f o r s o m e t | X_{0} = e_{i}) .$

Corollary 1. Consider the counting process associated to the MPBT model for parameters $(π, λ, S)$ . Then under Assumptions 1 and 2, $\sum N_{t}^{i}$ is nonzero and as $t \to \infty$ ,

where u is the positive normalized leading left eigenvector of A.

Proof. Using the assumptions and Lemmas 3 and 4, the hypotheses of Theorem 1 are met, including inequality [Eq. (1)]. Thus

where $ω$ is the leading eigenvalue of A, $u$ is its positive normalized left eigenvector, and W is a non-negative random variable.

Since the random variable $\sum N_{t}^{i}$ is nondecreasing, the probability of extinction is zero: $ℙ (N_{t} = 0 f o r s o m e t | N_{0} = e_{i}) = 0 .$

Thus we find $ℙ (W = 0 | X_{0} = e_{i}) = 0$ , implying $ℙ (W = 0) = 0$ regardless of $π$ . Then by the continuous mapping theorem,

for each i.

Remark. In studying diversification models with a single type but time-dependent rates of speciation and extinction, it is common to consider the random function giving the number of lineages through time in a tree. This loses no information on parameters from the full tree, as each change in its value (speciation or extinction) is equally likely to have occurred on any lineage, and the growth of this function is thus highly informative on parameter values. For the multitype pure-birth model, however, the function $\sum_{i} N_{t}^{i}$ should not capture all information in the tree, as speciation may not be equally likely on all lineages. Corollary 1 indicates that its growth is determined only by $ω$ , the leading eigenvalue of A.

5. IDENTIFIABILITY OF THE MPBT MODEL

Using the distributions of edge lengths and relative frequencies of each type of edge in a tree at a given time found in Sections 3 and 4, we are ready to establish identifiability of the MPBT parameters. To do so, we consider an asymptotic joint distribution of the lengths of 3 edges around a common node in the tree (Fig. 2). We seek to show that from this distribution the model parameters $(λ, S)$ can be determined, up to label swapping.

Due to the conditional independence of the lengths of three edges sharing a common node, given that node's type, this distribution is a mixture of product distributions, with the mixing distribution and the components of the products closely related to distributions previously computed. This structure allows for the application of the following theorem, to obtain unmixed distributions of edge lengths conditioned on the type of the parental node. Thus, even though we have no observation of type at any point in the tree, we can extract a distribution that is conditioned on type.

The following is a variant of Theorem 8 of Allman et al (2009), with the hypotheses modified as discussed on p. 3116 of that article.

Theorem 2. (Allman et al, 2009) For $1 \leq i \leq m$ , let

be a product of 3 independent, absolutely continuous distributions $μ_{i}^{k}$ on $ℛ$ . With $π_{i} > 0$ , let $(π_{1}, π_{2}, \dots, π_{m})$ be a distribution on $[m]$ . For each k, suppose the set of distributions ${μ_{i}^{k}}_{i = 1}^{m}$ has the property that every subset of r_k elements is linearly independent and that $r_{1} + r_{2} + r_{3} \geq 2 m + 2 .$

Then, up to label swapping in i, the $μ_{i}^{k}$ and $π_{i}$ are determined by the mixture distribution

More precisely, P determines distributions $ν_{i}^{k}$ and $(p_{1}, p_{2}, \dots, p_{m})$ such that for some permutation $σ$ of the set $[m]$ , $μ_{i}^{k} = ν_{σ (i)}^{k} a n d π_{i} = p_{σ (i)} .$

To apply this theorem, we make a further technical assumption, denoting the vector of 1s by $1$ .

Assumption 3. Parameters $(λ, S)$ are such that the $m \times m$ matrix $M = M (λ, S) = (\begin{matrix} 1 & U 1 & U^{2} 1 & \dots & U^{m - 1} 1 \end{matrix})$

is nonsingular.

While the role of this assumption in our arguments will be clear in our proofs of Lemma 6 and Theorem 3 below, to understand its implications concretely, consider first the case $m = 2$ . Then $U = (\begin{matrix} - s_{12} - λ_{1} & s_{12} \\ s_{21} & - s_{21} - λ_{2} \end{matrix}),$

so $M = (\begin{matrix} 1 & - λ_{1} \\ 1 & - λ_{2} \end{matrix}) .$

The nonsingularity of M thus is equivalent to $λ_{1} \neq λ_{2} .$ That these speciation rates would need to be different for parameters to be identifiable is intuitively clear, since otherwise type changes governed by S would have no impact on the structure of the uncolored tree.

For general m, Assumption 3 is equivalent to the nonvanishing of $det M$ , a degree $\sum_{i = 1}^{m - 1} i = (\begin{matrix} m \\ 2 \end{matrix})$ polynomial in the m² independent entries of $λ, S$ . Its nonvanishing thus excludes an algebraic variety of codimension 1, a set of Lebesgue measure 0 in the unrestricted parameter space. An explicit calculation in the $m = 3$ case shows the polynomial to be an irreducible polynomial in $λ_{i}$ and $s_{i j}$ , $i \neq j$ .

The nonvanishing of $det M$ always requires that the vector $λ = - U 1$ not be a multiple of $1$ (so that the first two columns of M are linearly independent) and, hence, that not all $λ_{i}$ are the same. However, the additional restrictions it imposes on the parameters are more opaque to intuition without considering special cases.

For instance, when $m = 3$ , if all the $s_{i j}$ are equal, so the type switching behavior is identical for all types, the polynomial simplifies considerably, and factors as $(λ_{1} - λ_{2}) (λ_{2} - λ_{3}) (λ_{3} - λ_{1}) .$

Nonvanishing of the polynomial, then, requires that the three $λ_{i}$ be distinct, as one would expect is needed for identifiability, for otherwise several types would behave identically. However, for other choices of the $s_{i j}$ , two of the $λ_{i}$ can be equal without the polynomial vanishing.

Next, we define the joint edge length distribution for several edges of a tree.

Definition 5. For some $t < T$ , consider the following three random variables. Sample an (uncolored) tree of depth T under the MPBT model. From among the edges of the tree existing at time t choose one uniformly at random. Then with $t_{b} \in (t, T)$ , the time at which that edge speciates, let $ℓ_{t}^{0} = t_{b} - t$ denote the time interval until it speciates, and let $ℓ_{t}^{1}$ and $ℓ_{t}^{2}$ , respectively, denote the lengths of the immediate descendent edges (where the edges are designated 1,2 uniformly at random). Then the joint distribution of these three variables $ℓ_{t}^{0}$ , $ℓ_{t}^{1}$ , and $ℓ_{t}^{2}$ is $G_{T, t} (τ_{0}, τ_{1}, τ_{2}) : = ℙ (ℓ_{t}^{0} \leq τ_{0}, ℓ_{t}^{1} \leq τ_{1}, ℓ_{t}^{2} \leq τ_{2} | ℓ_{t}^{1}, ℓ_{t}^{2} < T - t - ℓ_{t}^{0}) .$

We call $G_{T, t}$ the joint distribution of edge lengths around a node.

The three edge lengths used in the definition of $G_{T, t}$ are depicted in Figure 2, for $t = T ∕ 2$ . The conditioning in the definition of $G_{T, t}$ ensures that it only considers edges in which the edge process has led to speciation, that is, the edge processes for the parental and child edges are not truncated.

Lemma 5. Under Assumptions 1 and 2, as $T \to \infty$ , the joint distribution $G_{T, T ∕ 2}$ at time $T ∕ 2$ of edge lengths around a node on a tree of depth T converges to

where F_j, $P_{i_{-}, j_{+}}$ , and u_i are defined in Propositions 1, 2, and Lemma 4, respectively.

Proof. Note that the event E which is conditioned upon in the definition of $G_{T, T ∕ 2}$ excludes edge lengths resulting from truncated edge processes, so that all edge lengths under consideration are in fact speciation times $T$ . Thus

where the function $ε_{T}$ is the difference of the conditional and nonconditional probabilities above. But since the probability of $E \to 1$ as $T \to \infty$ , it follows that $ε_{T} \to 0$ . We henceforth focus on $ℙ (T_{T ∕ 2}^{0} \leq τ_{0}, T_{T ∕ 2}^{1} \leq τ_{1}, T_{T ∕ 2}^{2} \leq τ_{2})$ rather than $G_{T, T ∕ 2}$ .

Letting A_i denote the event that the uniformly at random chosen edge is of type i at time $\frac{T}{2}$ and B_j denote the event that the edge speciates in color j, and recalling that edge processes around a node are independent when conditioned on the type of that node, we have

In this last expression, the only dependence on T is in $ℙ (A_{i})$ . But by Corollary 1, $ℙ (A_{i}) = ℰ [R_{T ∕ 2}^{i}] \to u_{i}$ as $T \to \infty$ , yielding Equation (2).

Remark. While the specific time $T ∕ 2$ is used in this lemma, our arguments would be essentially unchanged if it were replaced by any function $f (T)$ with $f (T)$ and $T - f (T) \to \infty$ as $T \to \infty$ .

This immediately gives that $G_{\infty}$ is a finite mixture of product distributions.

Corollary 2. The asymptotic joint distribution of edge lengths around a node, $G_{\infty}$ , can be expressed as an m-component mixture of products of 3 univariate distributions:

where $π_{j} = \sum_{i} P_{i_{-}, j_{+}} u_{i}$ , $μ_{j}^{1} = \frac{\sum_{i} P_{i_{-}, j_{+}} (τ) u_{i}}{\sum_{i} P_{i_{-}, j_{+}} u_{i}}$ , $μ_{j}^{2} = μ_{j}^{3} = F_{j} (τ)$ , and $P_{i_{-}, j_{+}}$ is as defined in Proposition 2.

To apply Theorem 2 to $G_{\infty}$ , we need to verify that some of the univariate distributions in its decomposition above are linearly independent. To do so, the following lemma is needed.

We now introduce an additional assumption, which holds for generic parameters.

Assumption 4. The speciation parameters satisfy $λ_{i} \neq λ_{j}$ for all $i \neq j$ .

Lemma 6. Suppose Assumptions 1, 2, 3, and 4 hold, and consider the sets of univariate distributions ${μ_{j}^{k}}_{j = 1}^{m}$ defined in Corollary 2. For $k = 1$ , every pair of functions in this set is linearly independent, while for $k = 2, 3$ the full set is linearly independent.

Proof. Since we need to only consider the cases $k = 1, 2$ .

Consider first the case $k = 2$ . Consider the vector F of functions $μ_{j}^{2} = F_{j}$ . Then by Proposition 1, $F = 1 - exp (U τ) 1 .$

Suppose $c^{T} F = 0$ for some vector $c$ . Since $\frac{d^{n}}{d τ^{n}} F (0) = - U^{n} 1$ , it follows that $c^{T} M = 0$ where M is defined in Assumption 3. Since M is nonsingular, $c = 0$ , so the entries of F are independent.

For $k = 1$ , it is enough to show the independence of each pair of functions

From Lemma 1 the vector G of all $ν_{j}$ is given by

Suppose for some vector $c$ . Since it follows that $u^{T} U^{n - 1} D c = 0 f o r n \geq 1 .$

In particular, for $n = 1$ we find $u^{T} D c = 0$ . For $n = 2$ , since $U = A - 2 D$ and $u^{T} A = ω u^{T}$ , we have $u^{T} U D c = u^{T} (ω I - 2 D) D c = 0 .$

To show that every pair of the $ν_{j}$ s is independent, consider $c$ all of whose entries except possibly two are zero. Without loss of generality suppose the exceptions are $c_{1}, c_{2}$ . Then the $n = 1, 2$ equations become $(\begin{matrix} u_{1} λ_{1} & u_{2} λ_{2} \\ u_{1} (ω - 2 λ_{1}) λ_{1} & u_{2} (ω - 2 λ_{2}) λ_{2} \end{matrix}) (\begin{matrix} c_{1} \\ c_{2} \end{matrix}) = 0 .$

Using $u_{1}, u_{2}, λ_{1}, λ_{2} > 0, λ_{1} \neq λ_{2}$ , computing the determinant of this matrix shows that it is nonsingular and, hence, $c_{1} = c_{2} = 0$ .

We now arrive at our main result.

Theorem 3. Under the explicit generic Assumptions 1, 2, 3, and 4, the parameters $(λ, S)$ of the uncolored MPBT model are identifiable up to label swapping from the asymptotic distribution $G_{\infty}$ of edge lengths around a node.

Proof. Suppose two parameter choices, $(π, λ, S)$ and $(π^{*}, λ^{*}, S^{*})$ , induce the same asymptotic distribution $G_{\infty}$ . Denoting the various distributions of conditional branching times, asymptotic transition probabilities, eigenvectors of matrices, etc. associated to parameters $(π, λ, S)$ as earlier in this work, we use the same notation with a “*” appended to denote the corresponding entities associated to parameters $(π^{*}, λ^{*}, S^{*})$ .

By Theorem 2. Corollary 2, and Lemma 6 the distributions $π_{i}, μ_{i}^{k}$ , for $1 \leq i \leq m$ , $1 \leq k \leq 3$ are determined from $G_{\infty} = G_{\infty}^{*}$ , up to label swapping in i. Thus $F_{i}^{*} (τ) = F_{σ (i)} (τ)$ for some permutation $σ$ .

Using Proposition 1 the equations $F_{i}^{*} (τ) = F_{σ (i)} (τ)$ for all i can be represented in matrix form as $1 - e^{U^{*} τ} 1 = Σ (1 - e^{U τ} 1),$ (3)

where $Σ$ is the permutation matrix representing $σ$ . Equating coefficients of the MacLauren series yields for $n = 1, 2, 3, \dots$ that

Using Equation (4) and the definition of $M, M^{*}$ in Assumption 3 shows $M^{*} = Σ M .$ (5)

Equation (4) further implies $\begin{matrix} U^{*} M^{*} = (\begin{matrix} U^{*} 1 & {(U^{*})}^{2} 1 & {(U^{*})}^{3} 1 & \dots & {(U^{*})}^{m} 1 \end{matrix}) \\ = (\begin{matrix} Σ U 1 & Σ U^{2} 1 & Σ U^{3} 1 & \dots & Σ U^{m} 1 \end{matrix}) \\ = Σ U M . \end{matrix}$

Using Equation (5) then yields $U^{*} Σ M = Σ U M,$

and since M is nonsingular, $U^{*} Σ = Σ U .$

Since $U = S - D$ and each row of S adds to 0, multiplying the last equation by $1$ on the right gives $λ^{*} = Σ λ .$ Since this implies $D^{*} Σ = Σ D$ , it follows that $S^{*} = Σ S Σ^{T}$ as well. Thus the parameters differ only up to label swapping.

Remark. Theorem 3 establishes that an asymptotic distribution, as tree depth $\to \infty$ associated to the MPBT model yields parameter identifiability. This suggests that with a sample of many trees of arbitrarily large size, there is potential for statistically consistent inference, where “consistency” would mean as both the number of trees and the tree depth go to infinity. However, this is not the framework in which data analysis with this model is performed, since while a tree may be large, only one tree observation is available (Maddison et al, 2007).

Fortunately, a minor modification to the proofs above again yields identifiability of parameters from an asymptotic distribution derived from a single observation, as the depth of the tree goes to infinity. Indeed, modify Definition 5 so that G_T,t is the distribution of edge lengths around a node from a single growing tree. The proof of Lemma 5, then, is modified only in its last line, as $ℙ (A_{i}) = R_{T ∕ 2}^{i}$ , a random variable rather than its expected value. Nonetheless, by Corollary 1, we again find $ℙ (A_{i}) \to u_{i}$ , so the conclusion is unchanged.

6. DISCUSSION

Theorem 3 and the succeeding remark show that parameters $(λ, S)$ of the MPBT model can be identified from an asymptotic distribution as the tree depth grows, whether or not the number of sampled trees grows. Although this is not sufficient to conclude that estimation of parameters by ML from a single tree, as suggested by Maddison et al (2007), is statistically consistent, it does at least indicate that it is a possibility. A similar question on ML inference of parameters for a hidden Markov model from a single sequence of observations was addressed by Leroux (1992), with the consistency of ML estimation established as the sequence length goes to infinity.

For applications, it would be highly desirable to extend our identifiability result to a model incorporating constant extinction rates for each type. In most biological settings, the obtainable “data,” however, is not the tree with edges stopping at extinction events, but rather the pruned tree in which all edges with no extant descendants are removed.

For a single type, parameter identifiability of a model with pruning was essentially considered by Nee et al (1994), where it was shown that the lineages-through-time function's rate of change allowed the speciation and extinction rates to be determined, by separately considering the time regimes much earlier than the tree tips and near the tree tips. An analysis combining the insight from Nee et al (1994) with the mixture distribution framework used in this work might be successful in showing that parameters can be recovered from a single large tree observation for the multitype birth–death tree model.

We emphasize that our work here in no way suggests that a multitype model incorporating arbitrary time-dependence in its rates will have identifiable parameters. Indeed, the issues that Louca and Pennell (2020) raised are likely to only be compounded in such a setting, unless the time-dependence is restricted to some specific form. Results such as those of Legried and Terhorst (2022a, 2022b) in the single-type case, which show identifiability for piecewise constant and polynomial time-dependent rates, can be expected to generalize to more types.

Another interesting identifiability question for multitype tree models concerns what information on parameters is contained in the tree topology alone or from weaker metric information than precise branch lengths. While our analysis depends heavily on metric features of the tree, that of Popovic and Rivas (2016) required no metric information. However, it did use type observations at the tips of the tree and at their parental nodes. While types at tree tips may be observed in some biological studies, types of the parental nodes are generally not observable, as data are generally collected only from the taxa extant at the present. Even if ancient DNA or other trait data from earlier times are available, it is unlikely to be from the time of the last speciation.

Footnotes

AUTHORs' CONTRIBUTIONS

All authors contributed equally to this work.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no competing financial interests.

FUNDING INFORMATION

E.S.A. and J.A.R. were supported, in part, by NSF Grant DMS-2051760.

References

Allman

, Matias

, Rhodes

. Identifiability of parameters in latent structure models with many observed variables. Ann Stat, 2009; 37(6A):3099–3132; doi: 10.1214/09-AOS689

Athreya

KB.

Some results on multitype continuous time markov branching processes. Ann Math Stat, 1968; 39(2):347–357; doi: 10.1214/aoms/1177698395

Barido-Sottani

, Vaughan

, Stadler

A multi-type birth-death model for Bayesian inference of lineage-specific birth and death rates. Syst Biol, 2020; 69(5):973–986; doi: 10.1093/sysbio/syaa016

Cantalapiedra

, FitzJohn

, Kuhn

, et al. Dietary innovations spurred the diversification of ruminants during the Caenozoic. Proc R Soc B Biol Sci, 2014; 281(1776):20132746.

FitzJohn

RG.

Diversitree: Comparative phylogenetic analyses of diversification in R. Methods Ecol Evol, 2012; 3:1084–1092.

Harvey

, May

, Nee

Phylogenies without fossils. Evolution, 1994; 48(3):523–529.

Horn

, Johnson

Matrix Analysis. Cambridge University Press: New York, NY, USA; 2012.

Kendall

DG.

On the generalized “birth-and-death” process. Ann Math Stat, 1948; 19(1):1–15; doi: 10.1214/aoms/1177730285

Legried

, Terhorst

A class of identifiable phylogenetic birth & death models. Proc Natl Acad Sci USA, 2022a;119(35):e2119513119; doi: 10.1073/pnas.2119513119

10.

Legried

, Terhorst

. Identifiability and inference of phylogenetic birth-death models. 2022b. Available from: https://www.biorxiv.org/content/10.1101/2022.08.26.505438v1 Last viewed on September 26, 2022.

11.

Leroux

BG.

Maximum-likelihood estimation for hidden Markov models. Stoch Process Their Appl, 1992; 40(1):127–143; doi: 10.1016/0304-4149(92)90141-C

12.

Louca

, Pennell

. Extant timetrees are consistent with a myriad of diversification histories. Nature, 2020; 580(7804):502–505; doi: 10.1038/s41586-020-2176-1

13.

Maddison

, Midford

, Otto

. Estimating a binary character's effect on speciation and extinction. Syst Biol, 2007; 56(5):701–710; doi: 10.1080/10635150701607033

14.

Maliet

, Hartig

, Morlon

A model with many small shifts for estimating species-specific diversification rates. Nat Ecol Evol, 2019; 3(7):1086–1092; doi: 10.1038/s41559-019-0908-0

15.

Nee

, May

, Harvey

. The reconstructed evolutionary process. Philos Trans R Soc London Ser B Biol Sci, 1994; 344(1309):305–311; doi: 10.1098/rstb.1994.0068

16.

O'Meara

, Beaulieu

. Potential survival of some, but not all, diversification methods. 2021. Available from: https://ecoevorxiv.org/w5nvd Last viewed on September 26, 2022.

17.

Popovic

, Rivas

Topology and inference for Yule trees with multiple states. J Math Biol, 2016; 73(5):1251–1291; doi: 10.1007/s00285-016-0992-6

18.

Rabosky

DL.

Automatic detection of key innovations, rate shifts, and diversity-dependence on phylogenetic trees. PLoS One, 2014; 9(2):1–15; doi: 10.1371/journal.pone.0089543

19.

Rabosky

, Goldberg

. Model inadequacy and mistaken inferences of trait-dependent speciation. Syst Biol, 2015; 64(2):340–355; doi: 10.1093/sysbio/syu131

20.

Rasmussen

, Stadler

Coupling adaptive molecular evolution to phylodynamics using fitness-dependent birth-death models. eLife, 2019; 8:e45562; doi: 10.7554/eLife.45562

21.

Stadler

Recovering speciation and extinction dynamics based on phylogenies. J Evol Biol, 2013; 26(6):1203–1219; doi: 10.1111/jeb.12139

22.

Stadler

Species-specific diversification. Nat Ecol Evol, 2019; 3(7):1003–1004; doi: 10.1038/s41559-019-0923-1

23.

Yule

GU.

A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S. Philos Trans R Soc London Ser B Contain Papers Biol Char, 1925; 213(402–410):21–87; doi: 10.1098/rstb.1925.0002